SmartScraper: 简单、自动、快捷的Python网络爬虫

Overview

SmartScraper: 简单、自动、快捷的Python网络爬虫

Note: The origin developer of SmartScraper is Alireza Mika, I only change a little code of AutoScraper.

img

SmartScraper使页面数据抓取变得容易,不再需要学习诸如pyquery、beautifulsoup等定位包,我们只需要提供的url和数据给ta学习网页定位规律即可。

一、安装

pip install smartscraper



二、快速上手

2.1 获取相似结果

例如 我们想从 豆瓣读书-小说 页面获得20本书的书名和出版信息

我们使用P1链接训练书名、出版信息这两个字段

from smartscraper import SmartScraper

# 待训练的网页链接
url = 'https://book.douban.com/tag/小说?start=0&type=T'

#定义 想要的字段
wanted_dict = {"title":["活着"],
               "pub": ["余华 / 作家出版社 / 2012-8-1 / 20.00元"]
              }

# 训练/在url对应的页面中寻找wanted_dict规律
scraper = SmartScraper()
results = scraper.build(url, wanted_dict=wanted_dict)
print(results)

运行代码,采集到的results如下

{'title': ['活着', 
           '房思琪的初恋乐园', 
           '白夜行', 
           '索拉里斯星', 
           '鄙视',
           ...], 
 'pub': ['余华 / 作家出版社 / 2012-8-1 / 20.00元', 
         '林奕含 / 北京联合出版公司 / 2018-2 / 45.00元', 
         '[日] 东野圭吾 / 刘姿君 / 南海出版公司 / 2013-1-1 / CNY 39.50', 
         '[波] 斯坦尼斯瓦夫·莱姆 / 靖振忠 / 译林出版社 / 2021-8 / 49.00元', 
         '[意] 阿尔贝托·莫拉维亚 / 沈萼梅、刘锡荣 / 江苏凤凰文艺出版社 / 2021-7 / 62.00',
          ...]
}

使用刚刚训练的scraper尝试从 P2链接 获取书名和出版信息

scraper.get_result_similar('https://book.douban.com/tag/小说?start=20&type=T')

2.2 保存模型

训练的smartscraper模型可以保存,后续直接调用

scraper.save('douban_Book.pkl')

模型导入代码

scraper.load('douban_Book.pkl')



三、其他

3.1 项目补充说明



3.2 相关课程

如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,个人建议学习《python网络爬虫与文本数据分析》视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o( ̄︶ ̄)o,

  • python入门
  • 网络爬虫
  • 数据读取
  • 文本分析入门
  • 机器学习与文本分析
  • 文本分析在经管研究中的应用

感兴趣的童鞋不妨 戳一下《python网络爬虫与文本数据分析》进来看看~



3.3 自媒体

Owner
DaDeng
圣 马家沟男子职业技术学院在读博士
DaDeng
Python scraper to check for earlier appointments in Clalit Health Services

clalit-appt-checker Python scraper to check for earlier appointments in Clalit Health Services Some background If you ever needed to schedule a doctor

Dekel 16 Sep 17, 2022
Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

Md Rashidul Islam 1 Nov 03, 2021
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

Porames Vatanaprasan 31 Apr 17, 2022
Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022
Screen scraping and web crawling framework

Pomp Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the

Evgeniy Tatarkin 61 Jun 21, 2021
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
京东秒杀商品抢购Python脚本

Jd_Seckill 非常感谢原作者 https://github.com/zhou-xiaojun/jd_mask 提供的代码 也非常感谢 https://github.com/wlwwu/jd_maotai 进行的优化 主要功能 登陆京东商城(www.jd.com) cookies登录 (需要自

Andy Zou 1.5k Jan 03, 2023
Nekopoi scraper using python3

Features Scrap from url Todo [+] Search by genre [+] Search by query [+] Scrap from homepage Example # Hentai Scraper from nekopoi import Hent

MhankBarBar 9 Apr 06, 2022
IGLS - Instagram Like Scraper CLI tool

IGLS - Instagram Like Scraper It's a web scraping command line tool based on python and selenium. Description This is a trial tool for learning purpos

Shreshth Goyal 5 Oct 29, 2021
Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Christian Gracia 0 Jan 22, 2022
Deep Web Miner Python | Spyder Crawler

Webcrawler written in Python. This crawler does dig in till the 3 level of inside addressed and mine the respective data accordingly

Karan Arora 17 Jan 24, 2022
This is my CS 20 final assesment.

eeeeeSpider This is my CS 20 final assesment. How to use: Open program Run to your hearts content! There are no external dependancies that you will ha

1 Jan 17, 2022
Web Scraping Practica With Python

Web-Scraping-Practica Integrants: Guillem Vidal Pallarols. Lídia Bandrés Solé Fitxers: Aquest document és el primer que trobem. A continuació trobem u

2 Nov 08, 2021
The core packages of security analyzer web crawler

Security Analyzer 🐍 A large scale web crawler (considered also as vulnerability scanner tool) to take an overview about security of Moroccan sites Cu

Security Analyzer 10 Jul 03, 2022
Explore scraping with BeautifulSoup!

beautifulsoup-scrape Explore scraping with BeautifulSoup! Part One: Start from Shakespeare As my professor is a poet (yes, and he teaches me data and

Chuqin 2 Oct 05, 2022
Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

NewsScraper A simple Python 3 module to get crypto or news articles and their content from various RSS feeds. 🔧 Installation Clone the repo locally.

Rokas 3 Jan 02, 2022
High available distributed ip proxy pool, powerd by Scrapy and Redis

高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

SpiderClub 5.2k Jan 03, 2023
PyQuery-based scraping micro-framework.

demiurge PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x. Documentation: http://demiurge.readthedocs.org Installing demiurge $ pip

Matias Bordese 109 Jul 20, 2022
Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

Aditya Gupta 15 May 17, 2022