当前位置:网站首页>Travel data acquisition, data analysis and data mining [2022.5.30]
Travel data acquisition, data analysis and data mining [2022.5.30]
2022-07-19 05:06:00 【Big data Da Wenxi】
Tourism data acquisition and data analysis 、 data mining 【2022/5/30】
The source code for :
link :https://pan.baidu.com/s/1b39J-dEfUt1ZROO93aEkag
Extraction code :8848
The main points of :
1、 The main use of BeautifulSoup To analyze ,BeautifulSoup Grammar needs to be mastered find_all,find Method , Learn about Baidu by yourself
2、 Use pandas and numpy Data cleaning and mining , Dry cargo is full. .
Collection code :
import requests
from bs4 import BeautifulSoup
import re
import time
import csv
import random
from lxml import etree
def get_html(i):
url = 'https://travel.qunar.com/travelbook/list.htm?page={}&order=hot_heat'.format(i)
headers={
'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.360',
'cookies':' In real time cookies',
'referer':'https://travel.qunar.com/?from=header'}
response = requests.get(url=url,headers = headers)
response.encoding = 'utf-8'
html_ = response.text
# print(html_)
Publishing is restricted , There is... In the source code
return html_
Save data code :
with open('D:/ Crawling content / Where to go Website Strategy crawl /Travel.csv', 'a+', encoding='utf-8-sig', newline='') as csvFile:
csv.writer(csvFile).writerow([' Short commentary ', ' Departure time ', ' Days ',' Per capita cost ',' figure ',' How to play ',' Browse volume ',' Via '])
csvFile.close()
for i in range(1,110):
# print(i)
html = get_html(i)
get_info(html)
time.sleep(random.randint(3, 5))
print(" The first {} Page crawled successfully !".format(i))
print(" End of climb ")
result :


Data cleaning and mining process :
import pandas as pd
df = pd.read_csv("D:/ Crawling content / Where to go Website Strategy crawl /Travel.csv" , encoding = 'utf-8-sig')
df

Found after observation , The data is very irregular , We will conduct data cleaning later :
1、 Pick up the departure time
2、 Extract the number of days 、
3、 The average cost per capita can replace the blank value
4、 Characters can replace null values with modes
5、 The flow volume needs to be extracted. Note that 10000 words need *10000
6、、 The path can be divided into two fields: the starting place and the destination
Next, show the code implementation one by one :
Extract the number of days as the new number of days column :
df[' New days '] = '0'
df
import re
for index, row in df.iterrows():
day = re.findall('\d+',str(row[' Days ']))[0]
print(day)
row[' New days '] = day
df

Extract the departure date as the new departure date :
df[' New departure date '] = df[' Departure time '].str[:-2]
df
Extract the number of views as new views :
df[' Browse volume '].count() #1090 No null values
df[' New views '] = '0'
for index, row in df.iterrows():
if ' ten thousand ' in row[' Browse volume '] :
view = float(re.findall(r"\d+\.?\d*",str(row[' Browse volume ']))[0])*10000
print(view)
row[' New views '] = str(int(view))
else:
view = re.findall(r"\d+",str(row[' Browse volume ']))[0]
row[' New views '] = view
df

Split the route into destination and starting place :
df[' Starting place '] = '.'
df[' Destination '] = '.'
df[' Via '].count() #1090 No null values
for index, row in df.iterrows():
if '-' in row[' Via ']:
row[' Starting place '] = '-'
row[' Destination '] = '-'
else:
s_next = row[' Via '].split(':')[1]
if '>' in s_next:
li = s_next.split('>')
row[' Starting place '] = li[0]
row[' Destination '] = li[-1]
else:
row[' Starting place '] = s_next
row[' Destination '] = s_next
df

Fill in the missing values in the person column :
df[' New people '] = ' be on one's own '
df
for index, row in df.iterrows():
if '-' not in row[' figure '] :
row[' New people '] = row[' figure ']
df

The average value of per capita consumption is filled in with the missing value :
import numpy as np
import re
df[' New per capita cost '] = '0'
df
for index, row in df.iterrows():
if '-' not in row[' Per capita cost '] :
a = float(re.findall(r"\d+",str(row[' Per capita cost ']))[0])
row[' New per capita cost '] = str(a)
df
df[' New per capita cost '].replace('0',np.nan,inplace=True)
df
# Mean filling :
mean_val = round(df[' New per capita cost '].astype(float).mean(),2)
mean_val
df[' New per capita cost '].fillna(mean_val, inplace=True)
df

Finally, delete the old useless columns , Save as csv:
df.drop(labels=[' Departure time ',' Departure time ',' Per capita cost ',' figure ',' Browse volume ',' Days ',' Via '],axis=1,inplace=True)
df
df.to_csv('new_Travel.csv',encoding='utf-8-sig')

Such a simple tourism data collection 、 Cleaning and excavation practice cases are done , There are many cases accumulated at ordinary times , I will continue to write and share in the future Of , If you find it meaningful , Please pay attention , Your support is the greatest motivation for my creation 
边栏推荐
- String字符串根据符号进行特殊截取处理
- 3. Restclient query document
- 【p5.js】模拟烟花效果-交互媒体设计作业
- Simply and quickly establish a pytorch environment yolov5 target detection model to run (super simple)
- 用户登录-以及创建验短信证码
- Use of flask
- One article to understand Zipkin
- About the current response, the method getoutputstream() has been called
- MYSQL模糊匹配1,11,111这种相似字符串问题
- User - registration / login
猜你喜欢

User login - and create SMS verification code

【C】 Beam calculator

Harmonyos third training notes

String string special interception processing according to symbols

pygame-飞机大战1.0(步骤+窗口无响应问题)

【C】张梁计算器

String字符串根据符号进行特殊截取处理

PyGame aircraft War 1.0 (step + window no response problem)

MD5 password encryption

畢設:基於Vue+Socket+Redis的分布式高並發防疫健康管理系統
随机推荐
一文了解配置中心
图片上传的逻辑
机器学习之PCA特征降维+案例实践
Elment UI usage
Basic operations of index library operation
用户-注册/登录
Attendance check-in and leave system based on SSM framework
Pygame:外星人入侵
Message converter (JSON)
三种高并发方式实现i++
第十届泰迪杯数据挖掘挑战赛A题害虫识别YOLOv5模型代码(已跑通,原创作品,持续更新)
负载均衡添加ssl证书
Harmonyos second training notes
一个问题的探讨
加密和解密
DSL search results processing, including sorting, paging, highlighting
日志加入数据库实现思路
pygame安装-Requirement already satisfied问题
MYSQL数据库表A数据同步到表B
POC——DVWA‘s SQL Injection