A crawler of doubamovie

Overview

豆瓣电影

A crawler of doubamovie

一个小小的入门级scrapy框架的应用,选取豆瓣电影对排行榜前1000的电影数据进行爬取。

spider.py

start_requests方法为scrapy的方法,我们对它进行重写。

def start_requests(self):
    # 将start_url中的链接通过for循环进行遍历。
    for url in self.start_urls:
        # 通过yield发送Request请求。
        # 这里的Reques注意是scrapy下的Request类。注意不到导错类了。
        # 这里的有3个参数:
        #        1、url为遍历后的链接
        #        2、callback为发送完请求后通过什么方法进行处理,这里通过parse方法进行处理。
        #        3、如果网站设置了防爬措施,需要加上headers伪装浏览器发送请求。

        yield scrapy.Request(url=url, callback=self.parse, headers=self.headler)

使用scrapy的css选择器,定位选取范围div.item

重写parse对start_request()请求到的数据进行处理,通过yield对爬取到的网页数据进行封装

def parse(self, response):
    # 这里使用scrapy的css选择器,既然数据在class为item的div下,那么把选取范围定位div.item
    for quote in response.css('div.item'):
        # 通过yield对网页数据进行循环抓取
        yield {
            # 抓取排名、电影名、导演、主演、上映日期、制片国家 / 地区、类型,评分、评论数量、一句话评价以及电影链接
            "电影链接": quote.css('div.info div.hd a::attr(href)').extract_first(),
            "排名": quote.css('div.pic em::text').extract(),
            "电影名": quote.css('div.info div.hd a span.title::text')[0].extract(),
            "上映年份": quote.css('div.info div.bd p::text')[1].extract().split('/')[0].strip(),
            "制片国家": quote.css('div.info div.bd p::text')[1].extract().split('/')[1].strip(),
            "类型": quote.css('div.info div.bd p::text')[1].extract().split('/')[2].strip(),
            "评分": quote.css('div.info div.bd div.star span.rating_num::text').extract(),
            "评论数量": quote.css('div.info div.bd div.star span::text')[1].re(r'\d+'),
            "引言": quote.css('div.info div.bd p.quote span.inq::text').extract(),
        }
    next_url = response.css('div.paginator span.next a::attr(href)').extract()
    if next_url:
        next_url = "https://movie.douban.com/top250" + next_url[0]
        print(next_url)
        yield scrapy.Request(next_url, headers=self.headler)

pipelines.py

将其存入mongodb数据库中,不用提前创建表。

def __init__(self):
    self.client = pymongo.MongoClient('localhost', 27017)
    scrapy_db = self.client['doubanmovie']  # 创建数据库
    self.coll = scrapy_db['movie']  # 创建数据库中的表格

def process_item(self, item, spider):
    self.coll.insert_one(dict(item))
    return item

def close_spider(self, spider):
    self.client.close()

item.py

数据封装

url = scrapy.Field()
name = scrapy.Field()
order = scrapy.Field()
content = scrapy.Field()
contentnum = scrapy.Field()
year = scrapy.Field()
country = scrapy.Field()
score = scrapy.Field()
vary = scrapy.Field()
Owner
Cats without dried fish
Cats without dried fish
Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

COVID19-WEB-SCRAPER Open Source Tech Lab - Project [SEMESTER IV] OSTL Assignments OSTL Assignments - 1 OSTL Assignments - 2 Project COVID19 India Data

AMEY THAKUR 8 Apr 28, 2022
Create crawler get some new products with maximum discount in banimode website

crawler-banimode create crawler and get some new products with maximum discount in banimode website. این پروژه کوچک جهت یادگیری و کار با ابزار سلنیوم

nourollah rezaei 2 Feb 17, 2022
This is a webscraper for a specific website

This is a webscraper for a specific website. It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Rahul Siyanwal 1 Dec 13, 2021
This is a sport analytics project that combines the knowledge of OOP and Webscraping

This is a sport analytics project that combines the knowledge of Object Oriented Programming (OOP) and Webscraping, the weekly scraping of the English Premier league table is carried out to assess th

Dolamu Oludare 1 Nov 26, 2021
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022
Iptvcrawl - A scrapy project for crawl IPTV playlist

iptvcrawl a scrapy project for crawl IPTV playlist. Dependency Python3 pip insta

Zhijun 18 May 05, 2022
Complete pipeline for crawling online newspaper article.

Complete pipeline for crawling online newspaper article. The articles are stored to MongoDB. The whole pipeline is dockerized, thus the user does not need to worry about dependencies. Additionally, d

newspipe 4 May 27, 2022
A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

Charles Dungy 1 Mar 28, 2022
👁️ Tool for Data Extraction and Web Requests.

httpmapper 👁️ Project • Technologies • Installation • How it works • License Project 🚧 For educational purposes. This is a project that I developed,

15 Dec 05, 2021
京东秒杀商品抢购Python脚本

Jd_Seckill 非常感谢原作者 https://github.com/zhou-xiaojun/jd_mask 提供的代码 也非常感谢 https://github.com/wlwwu/jd_maotai 进行的优化 主要功能 登陆京东商城(www.jd.com) cookies登录 (需要自

Andy Zou 1.5k Jan 03, 2023
Screen scraping and web crawling framework

Pomp Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the

Evgeniy Tatarkin 61 Jun 21, 2021
Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

slocpi-scraper Sun Life of Canada Philippines Inc Investment Funds Scraper Install dependencies pip install -r requirements.txt Usage General format:

Daryl Yu 2 Jan 07, 2022
Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

2.3k Jan 04, 2023
Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

Twitter Turbo / Auto Claimer / Swapper Version: 1.0 Last Update: 01/26/2022 Use this at your own descretion. I've only used this on test accounts and

Underscores 6 May 02, 2022
a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

George Reyes 7 Nov 27, 2022
TikTok Username Swapper/Claimer/etc

TikTok-Turbo TikTok Username Swapper/Claimer/etc I wanted to create it as fast as possible but i eventually gave up and recoded it many many many many

Kevin 12 Dec 19, 2022
A crawler of doubamovie

豆瓣电影 A crawler of doubamovie 一个小小的入门级scrapy框架的应用,选取豆瓣电影对排行榜前1000的电影数据进行爬取。 spider.py start_requests方法为scrapy的方法,我们对它进行重写。 def start_requests(self):

Cats without dried fish 1 Oct 05, 2021
学习强国 自动化 百分百正确、瞬间答题,分值45分

项目简介 学习强国自动化脚本,解放你的时间! 使用Selenium、requests、mitmpoxy、百度智能云文字识别开发而成 使用说明 注:Chrome版本 驱动会自动下载 首次使用会生成数据库文件db.db,用于提高文章、视频任务效率。 依赖安装 pip install -r require

lisztomania 359 Dec 30, 2022
SkyScrapers: A collection of variety of Scraping Apps

SkyScrapers Collection of variety of Web Scraping Apps The web-scrapers involved

Biplov Pokhrel 3 Feb 17, 2022