🐞 Douban Movie / Douban Book Scarpy

Overview

ScrapyDouban

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

The purpose of maintaining this project is to share some of my practice in the process of using Scrapy, the project covers about 80% of my knowledge of Scrapy, I hope to help friends who are learning Scrapy, please note that the current version of the project is Scrapy 2.5.0.

Docker


Project contains douban_scrapyd douban_db douban_adminer three containers.

The douban_scrapyd container is based on python:3.9-slim-buster, the default installed Python3 libraries are scrapy scrapyd pymysql pillow arrow, default mapping port 6800:6800 to facilitate user access to scrapyd management interface via host IP:6800, login required parameters, username:scrapyd password:public.

The douban_db container is based on mysql:8, root password is public, and the default initialization is to import the docker/mysql/douban.sql file to the douban database.

douban_adminer container is based on adminer:4, default mapping port 8080:8080 to facilitate users to access the database management interface through the host IP:8080, login required parameters, server:mysql username:root password:public.

Project SQL


The path to the SQL file used by the project is docker/mysql/douban.sql.

Collection Process


First collect Subject ID --> then crawl the detail page by Subject ID to collect data --> finally collect comments by Subject ID

method


$ git clone https://github.com/xjia77/ScrapyDouban.git
# Build and run containers
$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# enter douban_scrapyd container
$ sudo docker exec -it douban_scrapyd bash
# enter scrapy content
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list
# Grabbing movie data
$ scrapy crawl movie_subject # collect movie Subject ID
$ scrapy crawl movie_meta # collect movie data
$ scrapy crawl movie_comment # collect movie comment
# Grabbing book data
$ scrapy crawl book_subject # collect book Subject ID
$ scrapy crawl book_meta # collect book data
$ scrapy crawl book_comment # collect book comment

If you want to make changes to your code more easily while testing, you can mount your project in the scrapy directory to the douban_scrapyd container. If you are used to working with scrapyd, you can deploy your project directly to the douban_scrapyd container via scrapyd-client.

Proxy IP


Due to douban's anti-crawler mechanism, the only way to bypass it now is through a proxy IP. ProxyMiddleware middleware is not enabled in the default settings.py. If you really need to use Douban's data to do some research, you can go rent a paid proxy pool.

image download


douban.pipelines.CoverPipeline processes the cover download logic by filtering spider.name, and the save path of the downloaded image files is the /srv/ScrapyDouban/storage directory of the douban_scrapy container.

Owner
Xingbo Jia
~1 year of professional Experience as a Software Engineer with a background in web development data science. Actively interested in software engineering interns
Xingbo Jia
Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

Albert Marrero 1 Jan 12, 2022
Incredibly fast crawler designed for OSINT.

Photon Incredibly fast crawler designed for OSINT. Photon Wiki • How To Use • Compatibility • Photon Library • Contribution • Roadmap Key Features Dat

Somdev Sangwan 9.3k Jan 02, 2023
HappyScrapper - Google news web scrapper with python

HappyScrapper ~ Google news web scrapper INSTALLATION ♦ Clone the repository ♦ O

Jhon Aguiar 0 Nov 07, 2022
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 07, 2023
Scraping weather data using Python to receive umbrella reminders

A Python package which scrapes weather data from google and sends umbrella reminders to specified email at specified time daily.

Edula Vinay Kumar Reddy 1 Aug 23, 2022
Crawl BookCorpus

These are scripts to reproduce BookCorpus by yourself.

Sosuke Kobayashi 590 Jan 03, 2023
Web Scraping OLX with Python and Bsoup.

webScrap WebScraping first step. Authors: Paulo, Claudio M. First steps in Web Scraping. Project carried out for training in Web Scrapping. The export

claudio paulo 5 Sep 25, 2022
A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

🕳️ CygnusX1 Code by Trong-Dat Ngo. Overviews 🕳️ CygnusX1 is a multithreaded tool 🛠️ , used to search and download images from popular search engine

DatNgo 32 Dec 31, 2022
Visual scraping for Scrapy

Portia Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web pag

Scrapinghub 8.7k Jan 05, 2023
Scrapy-soccer-games - Scraping information about soccer games from a few websites

scrapy-soccer-games Esse projeto tem por finalidade pegar informação de tabela d

Caio Alves 2 Jul 20, 2022
✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Spider-Cut is a Network Mapper Framework (NMAP Framework) Installation | Usage | Creators | Donate Installation # Kali Linux | WSL

XforWorks 3 Mar 07, 2022
Create crawler get some new products with maximum discount in banimode website

crawler-banimode create crawler and get some new products with maximum discount in banimode website. این پروژه کوچک جهت یادگیری و کار با ابزار سلنیوم

nourollah rezaei 2 Feb 17, 2022
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
NASA APOD Discord Bot - Fetches information from NASA APOD site.

NASA APOD Discord Bot - Fetches information from NASA APOD site.

Astronomy Club IITK 4 Apr 23, 2022
A module for CME that spiders hashes across the domain with a given hash.

hash_spider A module for CME that spiders hashes across the domain with a given hash. Installation Simply copy hash_spider.py to your CME module folde

37 Sep 08, 2022
哔哩哔哩爬取器:以个人为中心

Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创

Boshen Shi 3 Oct 21, 2021
Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

Lakhdar Belkharroubi 5 Oct 17, 2022
Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

Fernando 6 Dec 06, 2022
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

Douglas Trajano 2 Jan 24, 2022