当前位置:网站首页>Introduction to data analysis | kaggle Titanic mission (I) - > data loading and preliminary observation
Introduction to data analysis | kaggle Titanic mission (I) - > data loading and preliminary observation
2022-07-26 10:29:00 【Ape knowledge】
Series index : Introduction to data analysis | kaggle Titanic mission
One 、 Data loading
This time, we mainly understand the process of data analysis and get familiar with data analysis in a practical way python Basic operation , complete kaggle The task of going to Titanic , The whole process of actual combat data analysis . Reference books 《Python for Data Analysis》.
Data sets :https://www.kaggle.com/c/titanic/overview
(1) Load data
It's mainly used here numpy
and pandas
library
import numpy as np
import pandas as pd
When loading data, you can choose to use Relative paths
perhaps Absolute path
Load data
df = pd.read_csv('train.csv')
df.head(3)
It can also be used here read_table()
Reading data , however read_table Is in accordance with the \t Division , and read_csv()
Is by comma , If you use table You can add parameters sep = ',' The implementation and read_csv Same effect .
(2) Read block by block
In daily data analysis , It is inevitable to encounter a particularly large amount of data , easily 2、3 Ten million lines , If you read directly Python In the memory , Let alone enough memory , Reading time and subsequent processing operations are very laborious .
Pandas Of read_csv Function provides 2 Parameters :chunksize
、iterator
, It can read files multiple times by line , Avoid out of memory .
The syntax is :
- iterator : boolean, default False
Return to one TextFileReader object , To process files block by block . - chunksize : int, default None
The size of the file block , See IO Tools docs for more informationon iterator and chunksize.
chunker = pd.read_csv('train.csv',chunksize = 1000)
<pandas.io.parsers.TextFileReader at 0x2087ab29040>
pandas.read_csv Parameters chunksize
adopt Specify a block size ( How many lines are read each time ) To read big data files , It can avoid insufficient memory for one-time reading , It returns an iteratable object TextFileReader
.
Appoint iterator=True You can also return an iteratable object TextFileReader .iterator=True and chunksize You can specify to use .
chunker = pd.read_csv('train.csv', chunksize=100)
for i, chunk in enumerate(chunker):
print(i,' ',len(chunk))
0 100
1 100
2 100
3 100
4 100
5 100
6 100
7 100
8 91
Then look at one Merge data
Code for :
import pandas as pd
df = [pd.read_csv('./data/data_' + str(i) + '.csv') for i in range(5)] # List derivation
data = pd.concat(df, axis=0).reset_index(drop=True) # Merge
data.head()
data.tail()
When axis = 0 when ,pd.concat Realization Column alignment
Merge .
Block read file demo routine :
import feather
import pandas as pd
filePath = r'data_csv.csv'
def read_csv_feature(filePath):
# Read the file
f = open(filePath, encoding='utf-8')
reader = pd.read_csv(f, sep=',', iterator=True)
loop = True
chunkSize = 1000000
chunks = []
while loop:
try:
chunk = reader.get_chunk(chunkSize)
chunks.append(chunk)
except StopIteration:
loop = False
print('Iteration is END!!!')
df = pd.concat(chunks, axis=0, ignore_index=True)
f.close()
return df
data = read_csv_feature(filePath)
Two 、 Preliminary observations
After importing data, you need to The overall structure
And samples , Like data size 、 Format, etc. .
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df.head(10) # Before output 10 Row data
df.tail(15) # After output 15 Row data
df.isnall().head() # Judge whether the data is empty
df.to_csv('train_chinese.csv')
# Be careful : Different operating systems may be garbled when saved . You can join in `encoding='GBK' perhaps ’encoding = ’utf-8‘‘`
Introduction to data analysis | kaggle Titanic mission The series is constantly updated , welcome
Like collection
+Focus on
Last one : Introduction to data analysis | kaggle Titanic mission
Next : Introduction to data analysis | kaggle Titanic mission ( Two )—>pandas Basics
My level is limited , Please comment and correct the deficiencies in the article in the comment area below ~If feelings help you , Point a praise Give me a hand ~
Share... From time to time Interesting 、 Have a material 、 Nutritious content , welcome Subscribe to follow My blog , Looking forward to meeting you here ~
边栏推荐
- Okaleido生态核心权益OKA,尽在聚变Mining模式
- Okaleido ecological core equity Oka, all in fusion mining mode
- The CLOB field cannot be converted when querying Damon database
- Redis realizes the correct posture of token bucket
- Kaptcha image verification code integration
- 利用原生js实现自定义滚动条(可点击到达,拖动到达)
- 【Halcon视觉】极坐标变换
- 【Halcon视觉】形态学腐蚀
- 2022/07/25------字符串的排列
- Self encapsulated database dbutils universal template
猜你喜欢
【Halcon视觉】图像滤波
Learning about tensorflow (I)
Okaleido ecological core equity Oka, all in fusion mining mode
Comparison of packet capturing tools fiddler and Wireshark
Okaleido生态核心权益OKA,尽在聚变Mining模式
【Halcon视觉】软件编程思路
Deduct daily question 838 of a certain day
[Halcon vision] programming logic
Review of database -- 1. Overview
PLC概述
随机推荐
[Qualcomm][Network] qti服务分析
Comparison of packet capturing tools fiddler and Wireshark
What will the new Fuzhou Xiamen railway bring to Fujian coastal areas?
The reason why go language is particularly slow to develop run and build commands
上传图片获取宽高
Android greendao数据库的使用
Wechat official account release reminder (wechat official account template message interface)
Closure of go (cumulative sum)
原生JS-获取transform值 x y z及rotate旋转角度
equals与==的区别
【Halcon视觉】软件编程思路
微信公众号发布提醒(微信公众号模板消息接口)
Review of database -- 1. Overview
2022/07/25------字符串的排列
C语言回调函数
The practice of OpenCV -- bank card number recognition
码云,正式支持 Pages 功能,可以部署静态页面
图片随手机水平移动-陀螺仪。360度设置条件
分布式锁解决方案之Redis实现
About the declaration and definition of template functions [easy to understand]