当前位置:网站首页>Dive into deep learning - 2.2 data preprocessing
Dive into deep learning - 2.2 data preprocessing
2022-07-19 03:28:00 【Trehol】
One 、 Reading data sets
os.makedirs(dir_name2, exist_ok=True): Function and os.mkdir It is also used to create new folders , But it is more convenient to use , More functions .
os.makedirs: You can create multiple folders recursively
os.makedirs: Of exist_ok Parameter set to True when , It can automatically judge Do not create a folder when it already exists
os.path.join('..', 'data')---- Stored in CSV( Comma separated values ) file ../data/house_tiny.csv in
import os
os.makedirs(os.path.join('..', 'data'), exist_ok=True)
data_file = os.path.join('..', 'data', 'house_tiny.csv')
with open(data_file, 'w') as f: # open data_file file , Write it down as f, Then write it .
f.write('NumRooms,Alley,Price\n') # Name
f.write('NA,Pave,127500\n') # Each row represents a data sample
f.write('2,NA,106000\n')
f.write('4,NA,178100\n')
f.write('NA,NA,140000\n')Two 、 Handling missing values
inputs = pd.get_dummies(inputs, dummy_na=True)
print(inputs)dummy_na : bool, default False, Add a column to show the vacancy value , If False Just ignore the vacancy value
NumRooms Alley_Pave Alley_nan
0 3.0 1 0
1 2.0 0 1
2 4.0 0 1
3 3.0 0 13、 ... and 、 practice
Create a raw dataset with more rows and columns .
Delete the column with the most missing values .
Convert the preprocessed data set into tensor format .
Practice solving the first problem
Methods of counting missing values
null_all = data.isnull().sum()
#isnull The function checks whether the data is missing and returns a Boolean value , Element is empty or NaN return Ture, Otherwise, it would be False
#data.isnull().any() Determine which columns contain missing values , If there is a missing value in this column, return True, conversely False
#data.isnull().sum() Returns the number of missing values per column
#dropna(thresh=2),thresh Set the threshold , The number of missing values is greater than the threshold value for the whole line (axis=0) Or entire column (axis=1) Will be deleted
drop Description of relevant parameters of function :
Parameters axis=0, Indicates the operation on the line , If you operate on a column, change the default parameter to axis=1.
Parameters inplace=False, Indicates that the deletion operation does not change the original data , Returns a new after the delete operation dataframe, For example, delete the original data directly , Then change the default parameter to inplace=True.
import pandas as pd
import os
import torch
os.makedirs(os.path.join('F:/Pycharm/DIVE INTO DL', 'data2.2'), exist_ok=True)
data_file = os.path.join('F:/Pycharm/DIVE INTO DL', 'data2.2', 'house_tiny.csv')
with open(data_file, 'w') as f:
f.write('NumRooms,Alley,Price\n')
f.write('NA,Pave,127500\n') # Each row represents a data sample
f.write('2,NA,106000\n')
f.write('NA,NA,178100\n')
f.write('NA,NA,140000\n')
f.write('2,NA,106000\n')
f.write('NA,NA,178100\n')
f.write('NA,NA,140000\n')
f.write('NA,NA,178100\n')
f.write('NA,NA,140000\n')
data = pd.read_csv(data_file)
# Handling missing values , First delete the column with the most missing values
col_null = data.isna().sum(axis=0)
col_null_dict = col_null.to_dict()# Turn to dictionary
#col_max = col_null.max(axis=0)# Find the maximum value of missing values summed by columns
max_key = max(col_null_dict.keys(),key=col_null_dict.get)
# Incoming here col_null_dict The effect is the same , All are passed in key values for iteration
# The latter represents the standard of comparison , Is equal to get(keys). It can be understood as finding the key corresponding to the maximum value , Return to key
print(col_null)
print(' The key corresponding to the maximum value is :'+ max_key)
del data[max_key]
print(' After deleting the column with the most missing values , The data is :')
print(data)
# because data Data is not in numeric format , So it can't be used directly data_tensor = torch.tensor(data) Convert to tensor format
data_post = data.iloc[:, :2]# At this time data_post It's not a numeric type
data_tensor = torch.tensor(data_post.values)# First convert to numeric type
print(' After converting to tensor format :')
print(data_tensor)
This operation only deletes the column with the most missing values from the original data set , Missing values that are not deleted are not handled
边栏推荐
- Game theory of catching lice
- Use RZ, SZ commands to upload and download files through xshell7
- XX市高中网络拓扑整体规划配置
- Replacement operation not supported by ncnn partial operators
- Unity解决同材质物体重叠产生Z-Fighting的问题
- 深入理解机器学习——类别不平衡学习(Imbalanced Learning):样本采样技术-[人工采样技术之ADASYN采样法]
- 洛谷每日三题之第五天
- zsh: command not found: mysql
- [MySQL] MHA high availability
- Ncnn allocator memory allocator
猜你喜欢

Code demonstration of fcos face detection model in openvino
![mysqldump: [Warning] Using a password on the command line interface can be insecure.](/img/91/8b0d35f85bc0f46daac4e1e9bc9e34.png)
mysqldump: [Warning] Using a password on the command line interface can be insecure.

A Youku VIP member account can be used by several people to log in at the same time. How to share multiple people using Youku member accounts?

Pytorch best practices and code templates

Ubuntu clear CUDA cache

【模板记录】字符串哈希判断回文串

GraphQL初识

Basic IDL content of note 1: common data types_ Create array_ Type conversion_ Print output_ Basic operation_ Relational operation

JDBC connection to MySQL database

Configure high availability using virtual ip+kept
随机推荐
First knowledge of JPA (ORM idea, basic operation of JPA)
In depth understanding of machine learning - unbalanced learning: sample sampling technology - [smote sampling method and borderline smote sampling method of manual sampling technology]
leetcode:50. Pow(x, n)
Note: light source selection and Application
深入理解机器学习——类别不平衡学习(Imbalanced Learning):样本采样技术-[人工采样技术之ADASYN采样法]
leetcode162. 寻找峰值
[MCU simulation] (XVI) control transfer instructions - unconditional transfer instructions, conditional transfer instructions
Snapshot: data snapshot (data disclosure method)
zsh: command not found: mysql
[MCU simulation] (XVIII) control transfer instructions - empty operation instructions
Rhce8 Study Guide Chapter 2 use of basic commands
洛谷每日三题之第三天(第四天补做)
Rhce8 Learning Guide Chapter 1 installing rhel8.4
GraphQL初识
Cmake common commands
重写equals为什么要重写hashcode
Monte Carlo based reinforcement learning method [with code implementation]
支持工业级瘦设备4G接入,润和软件DAYU120通过OpenHarmony兼容性测评
05_服务调用Ribbon
Envi: (the most detailed tutorial in 2022) custom coordinate system