A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Last update: Feb 10, 2022

Overview

Parallel web scraping

The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Goal

The script extracts names and prices of the Top-100 crypto coins and stores the data into a db.

Disclaimer

The task is quite contrived and serves mainly for study purpose. There are innumerous of mature sources containing both real-time and historical cryptocurrency data.

Solved problems within the project

Multiple pages with one level nesting have been scraped. The propagation has been implemented by gathering internal links from the main page followed by looping on them.
To avoid getting banned from the remote server, a mechanism dealing with proxy servers was implemented.
A free public proxy server is commonly assumed as unreliable in terms of availability. To overcome this issue:
- another scraping script extracts a list of free public proxy servers from a web site.
- with each launch of the script, the list of 10 proxy servers gets updated by currently available proxy servers.
- during the script execution, some proxy servers get unavailable. Thus, each scraping query goes through this list and searches for an alive proxy server to execute a query.
To speed up the scraping of the total 101 web pages multithreading is involved. The work is divided among 4 threads running almost simultaneously.
The extracted data is being written directly to a DataBase.

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Related tags

Overview

Parallel web scraping

Goal

Disclaimer

Solved problems within the project

Owner

Kushal Shingote

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

🐞 Douban Movie / Douban Book Scarpy

A simple app to scrap data from Twitter.

Parse feeds in Python

淘宝茅台抢购最新优化版本，淘宝茅台秒杀，优化了茅台抢购线程队列

Audio media crawler for lbry.

for those who dont want to pay $10/month for high school game footage with ads

👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

This is python to scrape overview and reviews of companies from Glassdoor.

A low-code tool that generates python crawler code based on curl or url

Web scrapper para cotizar articulos

Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN

The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

淘宝、天猫半价抢购，抢电视、抢茅台，干死黄牛党

Dictionary - Application focused on word search through web scraping

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Scraping news from Ucsal portal with Scrapy.

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

WebScrapping Project - G1 Latest News

NASA APOD Discord Bot - Fetches information from NASA APOD site.