当前位置:网站首页>波士顿房价分析作业总结
波士顿房价分析作业总结
2022-07-17 01:33:00 【Tomorrowave】
加载数据的方法

data_url = "http://lib.stat.cmu.edu/datasets/boston" # 数据来源
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
# 用pandas读csv文件 跳过了22行,中间间隔任意长度相同字符
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
# 合并数组
target = raw_df.values[1::2, 2]
house = pd.read_csv("./data/boston.csv")
了解数据部分情况
house.head() # 读前五行
# 数据的规模
house.shape
# 数据集的每一列的列名
house.columns
# 对数据集数据的基本统计描述
# 这个命令非常便捷,呈现了这个数据集的基本统计分布,这是对每一列而言的,统计量包括:最大值,最小值,
house.describe()
#>有时候我们也可以通过df.info()去对数据集作一个简单的概述,更多的是看确实情况,以及变量的类型,通过变量类型分析数据处理的方法.
house.info()
数据处理
缺失值处理
数据缺失的几种情况:
①:缺失值过大,比如说已经超过了正常值的1/2,这种就不需要考虑怎么样填补了,留着这个特征反而是加大误差,可以选择剔除
②:缺失值小于1/2的,但出现了连续型缺失,也可以认为是一大段一大段的,这种如果在前面的话,可以不用去考虑,直接作为NaN构成新样本加入样本中,如果是在中间或者后面,根据缺失量,可以考虑用均值或者是线性回归、灰度预测等抢救一下
③:缺失值远小于1/2,并且是非连续的,这里就可以用一些复杂的插值,或者说用前后数的平均,众数都能填补,并且填补完可能会有一些意想不到的效果。
直接计数
null.isnull().sum()
统计缺失值的比例
A = []
for col in null.columns:
A.append((col,
null[col].isnull().sum() * 100 / null.shape[0]))
pd.DataFrame(A, columns=['Features', 'missing rate'])
填充空白值方法(fillna)
- 用固定值填充
train_data.fillna(0, inplace=True) # 填充 0
- 填充均值
对每一列的缺失值,填充当列的均值。
train_data.fillna(train_data.mean(),inplace=True) # 填充均值
- 填充中位数
train_data.fillna(train_data.median(),inplace=True) # 填充中位数
- 填充众数
train_data.fillna(train_data.mode(),inplace=True) # 填充众数,该数据缺失太多众数出现为nan的情况
- 填充KNN数据
from fancyimpute import KNN
train_data_x = pd.DataFrame(KNN(k=6).fit_transform(train_data_x), columns=features)
边栏推荐
- 上班摸鱼打卡模拟器微信小程序源码
- 洛谷每日三题之第三天(第四天补做)
- Envi: (the most detailed tutorial in 2022) custom coordinate system
- Subline快捷操作
- Pure virtual function
- Machine learning library scikit learn (linear model, ridge regression, insert a column of data, extract the required column, vector machine (SVM), clustering)
- By voting for the destruction of STI by Dao, seektiger is truly community driven
- Rhce8 Learning Guide Chapter 1 installing rhel8.4
- Zabbix6.0 monitoring vcenter7.0
- D. Permutation restoration (greedy / double pointer /set)
猜你喜欢

Wechat applet -- Summary of problems in the actual development of taro framework
![[MySQL] MHA high availability](/img/d3/d9830f3c331193fd40b8f00ebe35fa.png)
[MySQL] MHA high availability

Yolov6 learning first chapter

Rhce8 Learning Guide Chapter 1 installing rhel8.4

Configure high availability using virtual ip+kept

IEEE754 standard floating point format

Rtx3090 installing pytorch3d

Es6 notes d'étude - station B Xiao Ma Ge

Data source object management (third-party object resources) & load properties file

options has an unknown property ‘before‘
随机推荐
By voting for the destruction of STI by Dao, seektiger is truly community driven
oracle 关闭回收站
Method of realizing horizontal and vertical centering of unknown width and height elements
Authentication code for wireless
Use RZ, SZ commands to upload and download files through xshell7
[MySQL] MHA high availability
Envi: (the most detailed tutorial in 2022) custom coordinate system
Game theory of catching lice
Theoretical basis of double Q-learning and its code implementation [pendulum-v0]
374. 猜数字大小(入门 必会)
Rhce8 Study Guide Chapter 2 use of basic commands
367. 有效的完全平方数(入门必会)
Chengxin University envi_ The second week of IDL experiment content: extract aod+ in all MODIS aerosol products for detailed analysis
The third day of the three questions of Luogu daily (make up on the fourth day)
Zabbix6.0 monitoring vcenter7.0
Labelme starts normally, but cannot be opened
Net SNMP development I
二分查找(leetcode704.很简单必会的)
机器学习库Scikit-Learn(线性模型、岭回归、插入一列数据(insert)、提取所需列、向量机(SVM)、聚类)
Dqn theoretical basis and code implementation [pytoch + cartpole-v0]