Web crawling framework based on asyncio.

Last update: Jan 05, 2023

Overview

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

A Python package that scrapes Google News article data while remaining undetected by Google.

中国大学生在线四史自动答题刷分(现仅支持英雄篇)

This is a webscraper for a specific website

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Web scrapping

Console application for downloading images from Reddit in Python

PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

CreamySoup - a helper script for automated SourceMod plugin updates management.

Scraping followers of an instagram account

让中国用户使用git从github下载的速度提高1000倍!

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

:arrow_double_down: Dumb downloader that scrapes the web

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

A simple django-rest-framework api using web scraping

Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

A Python package that scrapes Google News article data while remaining undetected by Google.

中国大学生在线 四史自动答题刷分(现仅支持英雄篇)

This is a webscraper for a specific website

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Web scrapping

Console application for downloading images from Reddit in Python

PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

CreamySoup - a helper script for automated SourceMod plugin updates management.

Scraping followers of an instagram account

让中国用户使用git从github下载的速度提高1000倍!

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

:arrow_double_down: Dumb downloader that scrapes the web

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

A simple django-rest-framework api using web scraping

Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

中国大学生在线四史自动答题刷分(现仅支持英雄篇)