🐞 Douban Movie / Douban Book Scarpy

Overview

ScrapyDouban

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

The purpose of maintaining this project is to share some of my practice in the process of using Scrapy, the project covers about 80% of my knowledge of Scrapy, I hope to help friends who are learning Scrapy, please note that the current version of the project is Scrapy 2.5.0.

Docker


Project contains douban_scrapyd douban_db douban_adminer three containers.

The douban_scrapyd container is based on python:3.9-slim-buster, the default installed Python3 libraries are scrapy scrapyd pymysql pillow arrow, default mapping port 6800:6800 to facilitate user access to scrapyd management interface via host IP:6800, login required parameters, username:scrapyd password:public.

The douban_db container is based on mysql:8, root password is public, and the default initialization is to import the docker/mysql/douban.sql file to the douban database.

douban_adminer container is based on adminer:4, default mapping port 8080:8080 to facilitate users to access the database management interface through the host IP:8080, login required parameters, server:mysql username:root password:public.

Project SQL


The path to the SQL file used by the project is docker/mysql/douban.sql.

Collection Process


First collect Subject ID --> then crawl the detail page by Subject ID to collect data --> finally collect comments by Subject ID

method


$ git clone https://github.com/xjia77/ScrapyDouban.git
# Build and run containers
$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# enter douban_scrapyd container
$ sudo docker exec -it douban_scrapyd bash
# enter scrapy content
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list
# Grabbing movie data
$ scrapy crawl movie_subject # collect movie Subject ID
$ scrapy crawl movie_meta # collect movie data
$ scrapy crawl movie_comment # collect movie comment
# Grabbing book data
$ scrapy crawl book_subject # collect book Subject ID
$ scrapy crawl book_meta # collect book data
$ scrapy crawl book_comment # collect book comment

If you want to make changes to your code more easily while testing, you can mount your project in the scrapy directory to the douban_scrapyd container. If you are used to working with scrapyd, you can deploy your project directly to the douban_scrapyd container via scrapyd-client.

Proxy IP


Due to douban's anti-crawler mechanism, the only way to bypass it now is through a proxy IP. ProxyMiddleware middleware is not enabled in the default settings.py. If you really need to use Douban's data to do some research, you can go rent a paid proxy pool.

image download


douban.pipelines.CoverPipeline processes the cover download logic by filtering spider.name, and the save path of the downloaded image files is the /srv/ScrapyDouban/storage directory of the douban_scrapy container.

Owner
Xingbo Jia
~1 year of professional Experience as a Software Engineer with a background in web development data science. Actively interested in software engineering interns
Xingbo Jia
A python tool to scrape NFT's off of OpenSea

Right Click Bot A script to download NFT PNG's from OpenSea. All the NFT's you could ever want, no blockchain, for free. Usage Must Use Python 3! Auto

15 Jul 16, 2022
A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

3 Dec 07, 2021
Scraping weather data using Python to receive umbrella reminders

A Python package which scrapes weather data from google and sends umbrella reminders to specified email at specified time daily.

Edula Vinay Kumar Reddy 1 Aug 23, 2022
Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

Sugarbell 2 Oct 27, 2022
A python module to parse the Open Graph Protocol

OpenGraph is a module of python for parsing the Open Graph Protocol, you can read more about the specification at http://ogp.me/ Installation $ pip in

Erik Rivera 213 Nov 12, 2022
Web Scraping Instagram photos with Selenium by only using a hashtag.

Web-Scraping-Instagram This project is used to automatically obtain images by web scraping Instagram with Selenium in Python. The required input will

Sandro Agama 3 Nov 24, 2022
A tool for scraping and organizing data from NewsBank API searches

nbscraper Overview This simple tool automates the process of copying, pasting, and organizing data from NewsBank API searches. Curerntly, nbscrape onl

0 Jun 17, 2021
Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

Oleg T. 0 Dec 06, 2021
CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform

CRI Scrape CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform Disclaimer This code is only for educational purpose. So

Vincenzo Cardone 0 Jul 23, 2022
FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

UserGhost411 1 Nov 17, 2022
Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

About 千葉県の地域別の詳細感染者統計(Excelファイル) をCSVに変換し、かつ地域別の日時感染者集計値を出力するスクリプトです。 Requirement POSIX互換なシェル, e.g. GNU Bash (1) curl (1) python = 3.8 pandas = 1.1.

Conv4Japan 1 Nov 29, 2021
Scrapes proxies and saves them to a text file

Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

nell 2 Dec 22, 2021
Binance Smart Chain Contract Scraper + Contract Evaluator

Pulls Binance Smart Chain feed of newly-verified contracts every 30 seconds, then checks their contract code for links to socials.Returns only those with socials information included, and then submit

14 Dec 09, 2022
mlscraper: Scrape data from HTML pages automatically with Machine Learning

🤖 Scrape data from HTML websites automatically with Machine Learning

Karl Lorey 798 Dec 29, 2022
河南工业大学 完美校园 自动校外打卡

HAUT-checkin 河南工业大学自动校外打卡 由于github actions存在明显延迟,建议直接使用腾讯云函数 特点 多人打卡 使用简单,仅需账号密码以及用于微信推送的uid 自动获取上一次打卡信息用于打卡 向所有成员微信单独推送打卡状态 完美校园服务器繁忙时造成打卡失败会自动重新打卡

36 Oct 27, 2022
A modern CSS selector implementation for BeautifulSoup

Soup Sieve Overview Soup Sieve is a CSS selector library designed to be used with Beautiful Soup 4. It aims to provide selecting, matching, and filter

Isaac Muse 151 Dec 23, 2022
Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Alpha Swap English This is a simple python tool for the purpose of swapping latinic letters with cirylic ones and vice versa, in txt, docx and pdf fil

Aleksandar Damnjanovic 3 May 31, 2022
A social networking service scraper in Python

snscrape snscrape is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches and returns the disco

2.4k Jan 01, 2023
Collection of code files to scrap different kinds of websites.

STW-Collection Scrap The Web Collection; blog posts. This repo contains Scrapy sample code to scrap the following kind of websites: Do you want to lea

Tapasweni Pathak 15 Jun 08, 2022
Find thumbnails and original images from URL or HTML file.

Haul Find thumbnails and original images from URL or HTML file. Demo Hauler on Heroku Installation on Ubuntu $ sudo apt-get install build-essential py

Vinta Chen 150 Oct 15, 2022