a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

Overview

This is George's Scraping Project

  • To get started cd into the theZoo file and run:

  • chmod +x script.sh

  • then: ./script.sh

  • This will spin up a Postgres container, the Python environment, a Redis container, a Squid container (for the proxy) and a Splash container

  • The docker container will automaticaly run the JS spider which is the most complicated one. The other spiders are located under the spider directory and there are some tests under the /validate directory. These tests will use pandas to sql query postgres to make sure the data was added to the DB.

  • The project took me 2 days to complete. I spent most of my time learning about docker compose and the networking aspect of containers as well as the rotating proxies/user agents people add to their spiders.

Below I have outlined the steps I took as I completed the project

Docker

  • I downloaded the Docker Desktop application for MacOS
  • Then as I read through the pdf I looked up docker images for the technologies used, and I found some for postgres, squid, splash and redis

Python Environment

  • I setup a Python virtual environment in my IDE, here I developed the whole project to keep my packages enclosed so they did not conflict with my global packages in my machine. Once I was finished and tested the spiders to make sure they worked properly I dockerized everything and zipped it up to turn in
  • Packages I downloaded: pip, setuptools, wheel, Scrapy, Pandas, SQLAlchemy, scrapy-splash, scrapy-redis and psycopg2-binary
  • I created a requirements.txt file so I could cat the pip list of my package versions into the file for easy replication
  • The models.py file contains the SQLAlchemy code and the database schema
  • The pipelines.py file is where our data is sent to Postgres

The Default Spider

This crawler grabs quotes from the Default endpoint using pagination.

The data is scraped and sent to Postgres as well as downloaded to a json file called items.json

The Scroll Spider

This crawler uses scrolling to grab quotes from the Scroll endpoint.

Previously I had used a puppeter like bot where you can input how much padding the bot should scroll to scrape your desired data. In this instance using Scrapy I did not know how to do that, so I ended up looking up an alternative method. I found that the data is still being paginated in the request. When you google inspect you can see a console log that names the page you are on, so I looked at the request body and found how the data was being loaded. At this point I could have used the requests library, but instead found how to do it using Scrapy. This scraper works the same as the default one where the page number is added to the end of the url to retreive the next batch of data.

The JS Spider

This crawler uses a JS rendering service called Splash to query the JavaScript endpoint in order to grab the quotes.

I had to add Splash specific middlewares to the Scrapy settings in order to make this work. I also created a docker image in my docker compose file that holds the Splash instance. Then the scraping worked just like the default spider.

The Login Spider

This crawler scrapes the input field for the csrf token. It then submits a form request, authenticates and scrapes the rest of the data as the default spider does.

Notes

  • I added a user agent that makes me look like a more realistic person in the settings file. I also added the item pipeline and some configuration for the docker containers. I also added a download delay of 2 seconds so that the scraper does not scrape too fast.

  • Adding the Proxy was a bit tricky for me. I tried using a project called Scylla, however it did not end up working with my envirnonment so I was looking for alternatives. I ended up using Squid, created a docker image and added the proxy configuration in the middleware.py file.

  • The pause/resume scraping functionality comes from scheduler_persist being set to True in the settings using the scrapy-redis package.

  • While containerizing my application I have never had to use Docker Compose, SQLAlchemy or Redis so I quickly learned in order to integrate them into my project.

Potential Features in the Future

  • I did not collect much metadata but I saw a package called scrapy-magic fields and I would have liked to implement it to add the timestamps and urls scraped to the DB items

  • I did not create GUI tools for the Postgres and Redis to make it easier to view, this would have been a nice addition

  • Since only the JS spider is triggered by the script the other ones are manual I only set up a single table, but for a more distributed process I think making more models and tables for each spider would have been good. I wanted to reuse the code so I left it how it is.

  • Cron job functionality

Owner
George Reyes
currently looking for a job
George Reyes
A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

onlyfans-scraper A command-line program to download media, like and unlike posts, and more from creators on OnlyFans. Installation You can install thi

185 Jul 23, 2022
Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

Twitter Turbo / Auto Claimer / Swapper Version: 1.0 Last Update: 01/26/2022 Use this at your own descretion. I've only used this on test accounts and

Underscores 6 May 02, 2022
Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Eric DE MARIA 1 Nov 30, 2021
京东茅台抢购

截止 2021/2/1 日,该项目已无法使用! 京东:约满即止,仅限京东实名认证用户APP端抢购,2月1日10:00开始预约,2月1日12:00开始抢购(京东APP需升级至8.5.6版本及以上) 写在前面 本项目来自 huanghyw - jd_seckill,作者的项目地址我找不到了,找到了再贴上

abee 73 Dec 03, 2022
A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

4.3k Jan 07, 2023
Displays market info for the LUNI token on the Terra Blockchain

LuniBot for Discord Displays market info for the LUNI/LUNA token on the Terra Blockchain (Webscrape method currently scraping CoinMarketCap). Will evo

0 Jan 22, 2022
京东云无线宝积分推送,支持查看多设备积分使用情况

JDRouterPush 项目简介 本项目调用京东云无线宝API,可每天定时推送积分收益情况,帮助你更好的观察主要信息 更新日志 2021-03-02: 查询绑定的京东账户 通知排版优化 脚本检测更新 支持Server酱Turbo版 2021-02-25: 实现多设备查询 查询今

雷疯 199 Dec 12, 2022
Parse feeds in Python

feedparser - Parse Atom and RSS feeds in Python. Copyright 2010-2020 Kurt McKee Kurt McKee 1.5k Dec 30, 2022

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye, you can search with various keywords and usernames on Twitter.

Jolanda de Koff 19 Dec 12, 2022
High available distributed ip proxy pool, powerd by Scrapy and Redis

高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

SpiderClub 5.2k Jan 03, 2023
中国大学生在线 四史自动答题刷分(现仅支持英雄篇)

中国大学生在线 “四史”学习教育竞答 自动答题 刷分 (现仅支持英雄篇,已更新可用) 若对您有所帮助,记得点个Star 🌟 !!! 中国大学生在线 “四史”学习教育竞答 自动答题 刷分 (现仅支持英雄篇,已更新可用) 🥰 🥰 🥰 依赖 本项目依赖的第三方库: requests 在终端执行以下

XWhite 229 Dec 12, 2022
Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

68 Oct 08, 2022
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Pythonic Crawling / Scraping Framework Built on Eventlet Features High Speed WebCrawler built on Eventlet. Supports relational databases engines like

Juan Manuel Garcia 173 Dec 05, 2022
A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Hesam N 1 Dec 19, 2021
Scrape Twitter for Tweets

Backers Thank you to all our backers! 🙏 [Become a backer] Sponsors Support this project by becoming a sponsor. Your logo will show up here with a lin

Ahmet Taspinar 2.2k Jan 05, 2023
download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 02, 2022
A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

Eddy Harrington 87 Jan 06, 2023
This is a webscraper for a specific website

This is a webscraper for a specific website. It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Rahul Siyanwal 1 Dec 13, 2021
淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党

taobao_seckill 淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党 依赖 安装chrome浏览器,根据浏览器的版本找到对应的chromedriver下载安装 web版使用说明 1、抢购前需要校准本地时间,然后把需要抢购的商品加入购物车 2、如果要打包成可执行文件,可使用pyinstalle

2k Jan 05, 2023
TikTok Username Swapper/Claimer/etc

TikTok-Turbo TikTok Username Swapper/Claimer/etc I wanted to create it as fast as possible but i eventually gave up and recoded it many many many many

Kevin 12 Dec 19, 2022