SmartScraper: 简单、自动、快捷的Python网络爬虫

Overview

SmartScraper: 简单、自动、快捷的Python网络爬虫

Note: The origin developer of SmartScraper is Alireza Mika, I only change a little code of AutoScraper.

img

SmartScraper使页面数据抓取变得容易,不再需要学习诸如pyquery、beautifulsoup等定位包,我们只需要提供的url和数据给ta学习网页定位规律即可。

一、安装

pip install smartscraper



二、快速上手

2.1 获取相似结果

例如 我们想从 豆瓣读书-小说 页面获得20本书的书名和出版信息

我们使用P1链接训练书名、出版信息这两个字段

from smartscraper import SmartScraper

# 待训练的网页链接
url = 'https://book.douban.com/tag/小说?start=0&type=T'

#定义 想要的字段
wanted_dict = {"title":["活着"],
               "pub": ["余华 / 作家出版社 / 2012-8-1 / 20.00元"]
              }

# 训练/在url对应的页面中寻找wanted_dict规律
scraper = SmartScraper()
results = scraper.build(url, wanted_dict=wanted_dict)
print(results)

运行代码,采集到的results如下

{'title': ['活着', 
           '房思琪的初恋乐园', 
           '白夜行', 
           '索拉里斯星', 
           '鄙视',
           ...], 
 'pub': ['余华 / 作家出版社 / 2012-8-1 / 20.00元', 
         '林奕含 / 北京联合出版公司 / 2018-2 / 45.00元', 
         '[日] 东野圭吾 / 刘姿君 / 南海出版公司 / 2013-1-1 / CNY 39.50', 
         '[波] 斯坦尼斯瓦夫·莱姆 / 靖振忠 / 译林出版社 / 2021-8 / 49.00元', 
         '[意] 阿尔贝托·莫拉维亚 / 沈萼梅、刘锡荣 / 江苏凤凰文艺出版社 / 2021-7 / 62.00',
          ...]
}

使用刚刚训练的scraper尝试从 P2链接 获取书名和出版信息

scraper.get_result_similar('https://book.douban.com/tag/小说?start=20&type=T')

2.2 保存模型

训练的smartscraper模型可以保存,后续直接调用

scraper.save('douban_Book.pkl')

模型导入代码

scraper.load('douban_Book.pkl')



三、其他

3.1 项目补充说明



3.2 相关课程

如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,个人建议学习《python网络爬虫与文本数据分析》视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o( ̄︶ ̄)o,

  • python入门
  • 网络爬虫
  • 数据读取
  • 文本分析入门
  • 机器学习与文本分析
  • 文本分析在经管研究中的应用

感兴趣的童鞋不妨 戳一下《python网络爬虫与文本数据分析》进来看看~



3.3 自媒体

Owner
DaDeng
圣 马家沟男子职业技术学院在读博士
DaDeng
A list of Python Bots used to extract data from several websites

A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

Sahil Ladhani 1 Jan 14, 2022
feapder 是一款简单、快速、轻量级的爬虫框架。以开发快速、抓取快速、使用简单、功能强大为宗旨。支持分布式爬虫、批次爬虫、多模板爬虫,以及完善的爬虫报警机制。

feapder 是一款简单、快速、轻量级的爬虫框架。起名源于 fast、easy、air、pro、spider的缩写,以开发快速、抓取快速、使用简单、功能强大为宗旨,历时4年倾心打造。支持轻量爬虫、分布式爬虫、批次爬虫、爬虫集成,以及完善的爬虫报警机制。 之

boris 1.4k Dec 29, 2022
UsernameScraperTool - Username Scraper Tool With Python

UsernameScraperTool Username Scraper for 40+ Social sites. How To use git clone

E4crypt3d 1 Dec 20, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 02, 2023
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
This is a web scraper, using Python framework Scrapy, built to extract data from the Deals of the Day section on Mercado Livre website.

Deals of the Day This is a web scraper, using the Python framework Scrapy, built to extract data such as price and product name from the Deals of the

David Souza 1 Jan 12, 2022
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Dec 31, 2022
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

onlyfans-scraper A command-line program to download media, like and unlike posts, and more from creators on OnlyFans. Installation You can install thi

185 Jul 23, 2022
Scraping weather data using Python to receive umbrella reminders

A Python package which scrapes weather data from google and sends umbrella reminders to specified email at specified time daily.

Edula Vinay Kumar Reddy 1 Aug 23, 2022
Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing. It can be ma

10 Jul 06, 2022
A package designed to scrape data from Yahoo Finance.

yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

Rohan Singh 2 May 28, 2022
Web Scraping OLX with Python and Bsoup.

webScrap WebScraping first step. Authors: Paulo, Claudio M. First steps in Web Scraping. Project carried out for training in Web Scrapping. The export

claudio paulo 5 Sep 25, 2022
A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

Charles Dungy 1 Mar 28, 2022
A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

3 Dec 07, 2021
Demonstration on how to use async python to control multiple playwright browsers for web-scraping

Playwright Browser Pool This example illustrates how it's possible to use a pool of browsers to retrieve page urls in a single asynchronous process. i

Bernardas Ališauskas 8 Oct 27, 2022
Scrapy-based cyber security news finder

Cyber-Security-News-Scraper Scrapy-based cyber security news finder Goal To keep up to date on the constant barrage of information within the field of

2 Nov 01, 2021
SkyScrapers: A collection of variety of Scraping Apps

SkyScrapers Collection of variety of Web Scraping Apps The web-scrapers involved

Biplov Pokhrel 3 Feb 17, 2022
让中国用户使用git从github下载的速度提高1000倍!

序言 github上有很多好项目,但是国内用户连github却非常的慢.每次都要用插件或者其他工具来解决. 这次自己做一个小工具,输入github原地址后,就可以自动替换为代理地址,方便大家更快速的下载. 安装 pip install cit 主要功能与用法 主要功能 change 将目标地址转换为

35 Aug 29, 2022
Python Web Scrapper Project

Web Scrapper Projeto desenvolvido em python, sobre tudo com Selenium, BeautifulSoup e Pandas é um web scrapper que puxa uma tabela com as principais e

Jordan Ítalo Amaral 2 Jan 04, 2022