当前位置:网站首页>Boston house price analysis assignment summary

Boston house price analysis assignment summary

2022-07-19 03:41:00 Tomorrowave

How to load data

 Insert picture description here

data_url = "http://lib.stat.cmu.edu/datasets/boston" #  Data sources 
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None) 
#  use pandas read csv file   Skip the 22 That's ok , The middle interval is any length of the same character 
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
#  Merge array 
target = raw_df.values[1::2, 2]
house = pd.read_csv("./data/boston.csv")

Understand the data part

house.head() #  Read the first five lines 

#  The scale of the data 
house.shape

#  The column name of each column of the dataset 
house.columns                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
#  Basic statistical description of data set 
#  This command is very convenient , Presents the basic statistical distribution of this data set , This is for each column , Statistics include : Maximum , minimum value ,
house.describe()

#> Sometimes we can also pass df.info() To make a simple overview of the data set , It depends more on the actual situation , And the type of the variable , Analyze data processing methods through variable types .

house.info()

Data processing

Missing value processing

Several cases of missing data :

①: Missing value is too large , For example, it has exceeded the normal value 1/2, There is no need to consider how to fill this , Keeping this feature increases the error , You can choose to eliminate
②: The missing value is less than 1/2 Of , But there is a continuous lack , It can also be considered as a large section , If this is in the front , There is no need to consider , Act directly as NaN Form a new sample and add it to the sample , If it is in the middle or behind , According to the missing quantity , You can consider using mean or linear regression 、 Gray prediction, wait for rescue
③: The missing value is much smaller than 1/2, And it is discontinuous , Here we can use some complex interpolation , Or use the average of the previous and subsequent numbers , Modes can be filled , And filling in may have some unexpected effects .
Count directly

null.isnull().sum()

Count the proportion of missing values

A = [] 
for col in null.columns:
    A.append((col,
             null[col].isnull().sum() * 100 / null.shape[0]))
pd.DataFrame(A, columns=['Features', 'missing rate'])

Fill blank value method (fillna)

  1. Fill with fixed values
train_data.fillna(0, inplace=True) #  fill  0
  1. Fill in the mean
    Missing values for each column , Fill in the average of the current column .
train_data.fillna(train_data.mean(),inplace=True) #  Fill in the mean 
  1. Fill in the median
train_data.fillna(train_data.median(),inplace=True) #  Fill in the median 
  1. Fill in the mode
train_data.fillna(train_data.mode(),inplace=True) #  Fill in the mode , The data is missing too many modes appear as nan The situation of 
  1. fill KNN data
from fancyimpute import KNN
 
train_data_x = pd.DataFrame(KNN(k=6).fit_transform(train_data_x), columns=features)
原网站

版权声明
本文为[Tomorrowave]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/200/202207170133068280.html