当前位置：网站首页>Introduction to data analysis | kaggle Titanic mission (I) - > data loading and preliminary observation

Introduction to data analysis | kaggle Titanic mission (I) - > data loading and preliminary observation

2022-07-26 10:29:00 【Ape knowledge】

Please add a picture description

Series index ： Introduction to data analysis | kaggle Titanic mission

List of articles
One 、 Data loading
（1） Load data
（2） Read block by block
Two 、 Preliminary observations

One 、 Data loading

This time, we mainly understand the process of data analysis and get familiar with data analysis in a practical way python Basic operation , complete kaggle The task of going to Titanic , The whole process of actual combat data analysis . Reference books 《Python for Data Analysis》.

Data sets ：https://www.kaggle.com/c/titanic/overview

（1） Load data

It's mainly used here numpy and pandas library

import numpy as np
import pandas as pd

When loading data, you can choose to use Relative paths perhaps Absolute path Load data

df = pd.read_csv('train.csv')
df.head(3)

It can also be used here read_table() Reading data , however read_table Is in accordance with the \t Division , and read_csv() Is by comma , If you use table You can add parameters sep = ',' The implementation and read_csv Same effect .

（2） Read block by block

In daily data analysis , It is inevitable to encounter a particularly large amount of data , easily 2、3 Ten million lines , If you read directly Python In the memory , Let alone enough memory , Reading time and subsequent processing operations are very laborious .

Pandas Of read_csv Function provides 2 Parameters ：chunksize、iterator , It can read files multiple times by line , Avoid out of memory .

The syntax is ：

iterator : boolean, default False
Return to one TextFileReader object , To process files block by block .
chunksize : int, default None
The size of the file block , See IO Tools docs for more informationon iterator and chunksize.

chunker = pd.read_csv('train.csv',chunksize = 1000)
<pandas.io.parsers.TextFileReader at 0x2087ab29040>

pandas.read_csv Parameters chunksize adopt Specify a block size （ How many lines are read each time ） To read big data files , It can avoid insufficient memory for one-time reading , It returns an iteratable object TextFileReader .

Appoint iterator=True You can also return an iteratable object TextFileReader .iterator=True and chunksize You can specify to use .

chunker = pd.read_csv('train.csv', chunksize=100)
for i, chunk in enumerate(chunker):
    print(i,' ',len(chunk))
    
0   100
1   100
2   100
3   100
4   100
5   100
6   100
7   100
8   91

Then look at one Merge data Code for ：

import pandas as pd
df = [pd.read_csv('./data/data_' + str(i) + '.csv') for i in range(5)] # List derivation 
data = pd.concat(df, axis=0).reset_index(drop=True) #  Merge 
data.head()
data.tail()

When axis = 0 when ,pd.concat Realization Column alignment Merge .

Block read file demo routine ：

import feather
import pandas as pd

filePath = r'data_csv.csv'

def read_csv_feature(filePath):
    #  Read the file 
    f = open(filePath, encoding='utf-8')
    reader = pd.read_csv(f, sep=',', iterator=True)
    loop = True
    chunkSize = 1000000
    chunks = []
    while loop:
        try:
            chunk = reader.get_chunk(chunkSize)
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print('Iteration is END!!!')
    df = pd.concat(chunks, axis=0, ignore_index=True)
    f.close()
    return df 

data = read_csv_feature(filePath)

Two 、 Preliminary observations

After importing data, you need to The overall structure And samples , Like data size 、 Format, etc. .

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 # Column Non-Null Count Dtype 
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

df.head(10) # Before output 10 Row data 
df.tail(15) # After output 15 Row data 
df.isnall().head() # Judge whether the data is empty 

df.to_csv('train_chinese.csv')
#  Be careful ： Different operating systems may be garbled when saved . You can join in `encoding='GBK'  perhaps  ’encoding = ’utf-8‘‘`

Introduction to data analysis | kaggle Titanic mission The series is constantly updated , welcome Like collection ＋ Focus on

Last one ： Introduction to data analysis | kaggle Titanic mission
Next ： Introduction to data analysis | kaggle Titanic mission （ Two ）—＞pandas Basics

My level is limited , Please comment and correct the deficiencies in the article in the comment area below ~
If feelings help you , Point a praise Give me a hand ~
Share... From time to time Interesting 、 Have a material 、 Nutritious content , welcome Subscribe to follow My blog , Looking forward to meeting you here ~

原网站

版权声明
本文为[Ape knowledge]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207261026515505.html