当前位置：网站首页>Dive into deep learning - 2.2 data preprocessing

Dive into deep learning - 2.2 data preprocessing

2022-07-19 03:28:00 【Trehol】

One 、 Reading data sets

os.makedirs(dir_name2, exist_ok=True)： Function and os.mkdir It is also used to create new folders , But it is more convenient to use , More functions .

os.makedirs： You can create multiple folders recursively
os.makedirs： Of exist_ok Parameter set to True when , It can automatically judge Do not create a folder when it already exists
os.path.join('..', 'data')---- Stored in CSV（ Comma separated values ） file ../data/house_tiny.csv in

import os

os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f: # open data_file file , Write it down as f, Then write it .
    f.write('NumRooms,Alley,Price\n')  #  Name 
    f.write('NA,Pave,127500\n')  #  Each row represents a data sample 
    f.write('2,NA,106000\n')
    f.write('4,NA,178100\n')
    f.write('NA,NA,140000\n')

Two 、 Handling missing values

inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)

dummy_na : bool, default False, Add a column to show the vacancy value , If False Just ignore the vacancy value

get_dummies Usage of

   NumRooms  Alley_Pave  Alley_nan
0       3.0           1          0
1       2.0           0          1
2       4.0           0          1
3       3.0           0          1

3、 ... and 、 practice

Create a raw dataset with more rows and columns .

Delete the column with the most missing values .
Convert the preprocessed data set into tensor format .

Practice solving the first problem
Methods of counting missing values

null_all = data.isnull().sum()

#isnull The function checks whether the data is missing and returns a Boolean value , Element is empty or NaN return Ture, Otherwise, it would be False

#data.isnull().any() Determine which columns contain missing values , If there is a missing value in this column, return True, conversely False

#data.isnull().sum() Returns the number of missing values per column

Delete missing value

#dropna(thresh=2),thresh Set the threshold , The number of missing values is greater than the threshold value for the whole line （axis=0） Or entire column （axis=1） Will be deleted

drop Description of relevant parameters of function ：
Parameters axis=0, Indicates the operation on the line , If you operate on a column, change the default parameter to axis=1.

Parameters inplace=False, Indicates that the deletion operation does not change the original data , Returns a new after the delete operation dataframe, For example, delete the original data directly , Then change the default parameter to inplace=True.

import pandas as pd
import os
import torch

os.makedirs(os.path.join('F:/Pycharm/DIVE INTO DL', 'data2.2'), exist_ok=True)
data_file = os.path.join('F:/Pycharm/DIVE INTO DL', 'data2.2', 'house_tiny.csv')

with open(data_file, 'w') as f:
    f.write('NumRooms,Alley,Price\n')
    f.write('NA,Pave,127500\n')  #  Each row represents a data sample 
    f.write('2,NA,106000\n')
    f.write('NA,NA,178100\n')
    f.write('NA,NA,140000\n')
    f.write('2,NA,106000\n')
    f.write('NA,NA,178100\n')
    f.write('NA,NA,140000\n')
    f.write('NA,NA,178100\n')
    f.write('NA,NA,140000\n')

data = pd.read_csv(data_file)
#  Handling missing values , First delete the column with the most missing values 
col_null = data.isna().sum(axis=0)
col_null_dict = col_null.to_dict()# Turn to dictionary 
#col_max = col_null.max(axis=0)# Find the maximum value of missing values summed by columns 

max_key = max(col_null_dict.keys(),key=col_null_dict.get)
# Incoming here col_null_dict The effect is the same , All are passed in key values for iteration 
# The latter represents the standard of comparison , Is equal to get(keys). It can be understood as finding the key corresponding to the maximum value , Return to key
print(col_null)
print(' The key corresponding to the maximum value is ：'+ max_key)
del data[max_key]
print(' After deleting the column with the most missing values , The data is ：')
print(data)
# because data Data is not in numeric format , So it can't be used directly data_tensor = torch.tensor(data) Convert to tensor format 
data_post = data.iloc[:, :2]# At this time data_post It's not a numeric type 

data_tensor = torch.tensor(data_post.values)# First convert to numeric type 
print(' After converting to tensor format ：')
print(data_tensor)

This operation only deletes the column with the most missing values from the original data set , Missing values that are not deleted are not handled

原网站

版权声明
本文为[Trehol]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207170100479458.html

当前位置：网站首页>Dive into deep learning - 2.2 data preprocessing

Dive into deep learning - 2.2 data preprocessing

One 、 Reading data sets

Two 、 Handling missing values

3、 ... and 、 practice

边栏推荐

猜你喜欢

随机推荐