SmartScraper: 简单、自动、快捷的Python网络爬虫

Last update: Apr 16, 2022

Related tags

Web Crawling smartscraper

Overview

SmartScraper: 简单、自动、快捷的Python网络爬虫

Note: The origin developer of SmartScraper is Alireza Mika， I only change a little code of AutoScraper.

SmartScraper使页面数据抓取变得容易，不再需要学习诸如pyquery、beautifulsoup等定位包，我们只需要提供的url和数据给ta学习网页定位规律即可。

一、安装

pip install smartscraper

二、快速上手

2.1 获取相似结果

例如我们想从 豆瓣读书-小说 页面获得20本书的书名和出版信息

我们使用P1链接训练书名、出版信息这两个字段

from smartscraper import SmartScraper

# 待训练的网页链接
url = 'https://book.douban.com/tag/小说?start=0&type=T'

#定义 想要的字段
wanted_dict = {"title":["活着"],
               "pub": ["余华 / 作家出版社 / 2012-8-1 / 20.00元"]
              }

# 训练/在url对应的页面中寻找wanted_dict规律
scraper = SmartScraper()
results = scraper.build(url, wanted_dict=wanted_dict)
print(results)

运行代码，采集到的results如下

{'title': ['活着', 
           '房思琪的初恋乐园', 
           '白夜行', 
           '索拉里斯星', 
           '鄙视',
           ...], 
 'pub': ['余华 / 作家出版社 / 2012-8-1 / 20.00元', 
         '林奕含 / 北京联合出版公司 / 2018-2 / 45.00元', 
         '[日] 东野圭吾 / 刘姿君 / 南海出版公司 / 2013-1-1 / CNY 39.50', 
         '[波] 斯坦尼斯瓦夫·莱姆 / 靖振忠 / 译林出版社 / 2021-8 / 49.00元', 
         '[意] 阿尔贝托·莫拉维亚 / 沈萼梅、刘锡荣 / 江苏凤凰文艺出版社 / 2021-7 / 62.00',
          ...]
}

使用刚刚训练的scraper尝试从 P2链接 获取书名和出版信息

scraper.get_result_similar('https://book.douban.com/tag/小说?start=20&type=T')

2.2 保存模型

训练的smartscraper模型可以保存，后续直接调用

scraper.save('douban_Book.pkl')

模型导入代码

scraper.load('douban_Book.pkl')

三、其他

3.1 项目补充说明

SmartScraper仅为了简化使用，对AutoScraper进行了小修改（几行代码）
原创项目地址 https://github.com/alirezamika/autoscraper

3.2 相关课程

如果您是经管人文社科专业背景，编程小白，面临海量文本数据采集和处理分析艰巨任务，个人建议学习《python网络爬虫与文本数据分析》视频课。作为文科生，一样也是从两眼一抹黑开始，这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o(￣︶￣)o，

python入门
网络爬虫
数据读取
文本分析入门
机器学习与文本分析
文本分析在经管研究中的应用

感兴趣的童鞋不妨戳一下《python网络爬虫与文本数据分析》进来看看~

3.3 自媒体

B站:大邓和他的python
公众号：大邓和他的python

SmartScraper: 简单、自动、快捷的Python网络爬虫

Related tags

Overview

SmartScraper: 简单、自动、快捷的Python网络爬虫

一、安装

二、快速上手

2.1 获取相似结果

2.2 保存模型

三、其他

3.1 项目补充说明

3.2 相关课程

3.3 自媒体

Owner

DaDeng

Scraping web pages to get data

Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

Meme-videos - Scrapes memes and turn them into a video compilations

Telegram Group Scrapper

A Python web scraper to scrape latest posts from official Coinbase's Blog.

This is a script that scrapes the longitude and latitude on food.grab.com

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

Libextract: extract data from websites

Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

for those who dont want to pay $10/month for high school game footage with ads

Scraping Top Repositories for Topics on GitHub,

This is my CS 20 final assesment.

Scraping news from Ucsal portal with Scrapy.

A way to scrape sports streams for use with Jellyfin.

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Kusonime scraper using python3

The first public repository that provides free BUBT website scraping API script on Github.

API to parse tibia.com content into python objects.

Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

Snowflake database loading utility with Scrapy integration