Async Python 3.6+ web scraping micro-framework based on asyncio

Overview

Ruia logo

Ruia

🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio.

Write less, run faster.

travis codecov PyPI - Python Version PyPI Downloads gitter

Overview

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Write less, run faster:

Features

  • Easy: Declarative programming
  • Fast: Powered by asyncio
  • Extensible: Middlewares and plugins
  • Powerful: JavaScript support

Installation

# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+https://github.com/howie6879/ruia

Tutorials

  1. Overview
  2. Installation
  3. Define Data Items
  4. Spider Control
  5. Request & Response
  6. Customize Middleware
  7. Write a Plugins

TODO

  • Cache for debug, to decreasing request limitation, ruia-cache
  • Provide an easy way to debug the script, ruia-shell
  • Distributed crawling/scraping

Contribution

Ruia is still under developing, feel free to open issues and pull requests:

  • Report or fix bugs
  • Require or publish plugins
  • Write or fix documentation
  • Add test cases

!!!Notice: We use black to format the code

Thanks

Comments
  • Add rtds support.

    Add rtds support.

    I notice that you have tried to use mkdoc to generate the website.

    Here's an example at readthedocs.org, powered by sphinx.

    There's a little bug, but it is still great.

    RTDs

    opened by panhaoyu 32
  • Log crucial information regardless of log-level

    Log crucial information regardless of log-level

    I've reduced the log level of a Spider in my script as I find it too verbose, however I also filter out crucial info, particularly the after completion info (number of requests, time, ect.) - https://github.com/howie6879/ruia/blob/651fac54540fe0030d3a3d5eefca6c67d0dcb3c3/ruia/spider.py#L280-L287

    This is code I currently use to reduce verbosity:

    import logging
    
    # Disable logging (for speed)
    logging.root.setLevel(logging.ERROR)
    

    I'm thinking of changing the code so that it shows regardless of log level, but will there ever be a case where you wouldn't want to see it?

    opened by abmyii 13
  • `DELAY` attribute specifically for retries

    `DELAY` attribute specifically for retries

    I assumed the DELAY attr would set the delay for retries but instead it applies to all requests. I would appreciate it if there was a DELAY attr specifically for retries (RETRY_DELAY). I'd be happy to implement it if given the go-ahead.

    Thank you for this great library!

    opened by abmyii 13
  • Calling `self.start` as an instance method for a `Spider`

    Calling `self.start` as an instance method for a `Spider`

    I have the following parent class which has reusable code for all the spiders in my project (this is just a basic example):

    class Downloader(Spider):
        concurrency = 15
        worker_numbers = 2
    
        # RETRY_DELAY (secs) is time between retries
        request_config = {
            "RETRIES": 10,
            "DELAY": 0,
            "RETRY_DELAY": 0.1
        }
    
        db_name = "DB"
        db_url = "postgresql://..."
        main_table = "test"
    
        def __init__(self, *args, **kwargs):
            # Initialise DB connection
            self.db = DB(self.db_url, self.db_name, self.main_table)
    
        def download(self):
            self.start()
            
    		# After completion, commit to DB
            self.db.commit()
    

    I use it by sub-classing for each different spider. However, it seems that self.start cannot be accessed as an instance for spiders (since it's a classmethod) - giving this error:

    Traceback (most recent call last):
      File "src/scraper.py", line 107, in <module>
        scraper = Scraper()
      File "src/downloader.py", line 31, in __init__
        super(Downloader, self).__init__(*args, **kwargs)
      File "/usr/lib/python3.8/site-packages/ruia/spider.py", line 159, in __init__
        self.request_session = ClientSession()
      File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 210, in __init__
        loop = get_running_loop(loop)
      File "/usr/lib/python3.8/site-packages/aiohttp/helpers.py", line 269, in get_running_loop
        loop = asyncio.get_event_loop()
      File "/usr/lib/python3.8/asyncio/events.py", line 639, in get_event_loop
        raise RuntimeError('There is no current event loop in thread %r.'
    RuntimeError: There is no current event loop in thread 'MainThread'.
    Exception ignored in: <function ClientSession.__del__ at 0x7f28875e8b80>
    Traceback (most recent call last):
      File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 302, in __del__
        if not self.closed:
      File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 916, in closed
        return self._connector is None or self._connector.closed
    AttributeError: 'ClientSession' object has no attribute '_connector'
    

    Any idea how I can solve this issue whilst maintaining the structure I am trying to implement?

    opened by abmyii 11
  • asyncio `RuntimeError`

    asyncio `RuntimeError`

    ERROR asyncio Exception in callback BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)
    handle: <Handle BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)>
    Traceback (most recent call last):
      File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
        self._context.run(self._callback, *self._args)
      File "/usr/lib/python3.8/asyncio/selector_events.py", line 516, in _sock_write_done
        self.remove_writer(fd)
      File "/usr/lib/python3.8/asyncio/selector_events.py", line 346, in remove_writer
        self._ensure_fd_no_transport(fd)
      File "/usr/lib/python3.8/asyncio/selector_events.py", line 251, in _ensure_fd_no_transport
        raise RuntimeError(
    RuntimeError: File descriptor 150 is used by transport <_SelectorSocketTransport fd=150 read=idle write=<polling, bufsize=0>>
    

    Getting this quite a bit still. I don't think it's ruia directly, but aiohttp. Any ideas?

    One thing that may be causing it is that in clean functions I call other functions synchronously, i.e.:

        async def clean_<...>(self, value):
            return <function>(value)
    

    Could that be causing it? I tried doing return await ... but the error still persisted.

    opened by abmyii 11
  • Show URL in Error for easier debugging

    Show URL in Error for easier debugging

    I think errors would be more useful if they also showed the URL of the parsed page. Example:

    ERROR Spider <Item: extract ... error, please check selector or set parameter named default>, https://...

    I hacked a solution together by passing around the url parameter, but I can't think of a clean solution ATM. Any ideas? I can also push my changes if you would like to see them (very hacky).

    opened by abmyii 11
  • 运行示例代码报错

    运行示例代码报错

    我参考的是 https://github.com/howie6879/ruia/blob/master/docs/en/tutorials/item.md 里的代码

    import asyncio
    from ruia import Item, TextField, AttrField
    
    
    class PythonDocumentationItem(Item):
        title = TextField(css_select='title')
        tutorial_link = AttrField(xpath_select="//a[text()='Tutorial']", attr='href')
    
    
    async def main():
        url = 'https://docs.python.org/3/'
        item = await PythonDocumentationItem.get_item(url=url)
        print(item.title)
        print(item.tutorial_link)
    
    
    if __name__ == '__main__':
        # Python 3.7 required
        asyncio.run(main())
    

    运行能获取到正常的结果 3.9.5 Documentation tutorial/index.html 但是会报错,提示RuntimeError: Event loop is closed 完整的运行结果如下所示:

    [2021:05:06 14:11:55] INFO  Request <GET: https://docs.python.org/3/>
    3.9.5 Documentation
    tutorial/index.html
    Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x0000018664A679D0>
    Traceback (most recent call last):
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 116, in __del__
        self.close()
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 108, in close
        self._loop.call_soon(self._call_connection_lost, None)
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 746, in call_soon
        self._check_closed()
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 510, in _check_closed
        raise RuntimeError('Event loop is closed')
    RuntimeError: Event loop is closed
    

    PC :win10 64bit Python :3.9.4 64bit

    opened by qgyhd1234 10
  • 我愿意用分布式函数调度框架合和你来比,看谁代码更少谁更自由来爬任意网站,欢迎交流。

    我愿意用分布式函数调度框架合和你来比,看谁代码更少谁更自由来爬任意网站,欢迎交流。

    https://github.com/ydf0509/distributed_framework/blob/master/test_frame/car_home_crawler_sample/car_home_consumer.py

    欢迎来对比,或者你不想用汽车之家测试,可以指定一个任何两层级网站的爬虫调度,看谁的代码少,写法更快更自由,看谁的控制手段多,看谁的运行速度更快,。

    opened by ydf0509 9
  • `TextField` strips strings which may not be desirable

    `TextField` strips strings which may not be desirable

    https://github.com/howie6879/ruia/blob/8a91c0129d38efd8fcd3bee10b78f694a1c37213/ruia/field.py#L120

    My use case is extracting paragraphs which have newlines between them, and these are stripped out by TextField. Should a new field be introduced (I have already made one for my scraper), or should the stripping be optional? Perhaps both is best.

    opened by abmyii 9
  • Trouble scraping deck.tk/deckstats.net

    Trouble scraping deck.tk/deckstats.net

    For example:

    import asyncio
    from ruia import Request
    
    
    async def request_example():
        url = "https://deck.tk/07Pw8tfr"
        params = {
            'name': 'ruia',
        }
        headers = {
            'User-Agent': 'Python3.6',
        }
        request = Request(url=url, method='GET', params=params, headers=headers)
        response = await request.fetch()
        json_result = await response.json()
        print(json_result)
    
    
    if __name__ == '__main__':
        asyncio.get_event_loop().run_until_complete(request_example())
    

    This simply hangs without resolution. That is, the request is never resolved, and I must Ctrl-C out of it. Scrapy handles this without issue, but I was hoping to transition to ruia. Any ideas?

    bug enhancement 
    opened by Triquetra 7
  • Would be nice to be able to pass in

    Would be nice to be able to pass in "start_urls"

    Ruia seems like a brilliant way to write simple and elegant web scrapers, but I can't figure out how to have a different "start_urls" value. I want a web scraper that can check all links on any GIVEN web page, not just whatever the start_urls lead me to, but also with the simplicity and asynchronous power that Ruia provides. Maybe this is a feature but I can't tell from the documentation or code

    opened by JacobJustice 7
  • Improve Chinese documentation

    Improve Chinese documentation

    Toc: Ruia中文文档

    • [x] 快速开始
    • [ ] 入门指南
      • [ ] 1.概览
      • [ ] 2.爱美妆
      • [ ] 3.定义Item
      • [ ] 4.运行 Spider
      • [ ] 5.个性化
      • [ ] 6.插件
      • [ ] 7.帮助
    • [ ] 基础概念
      • [ ] 1.Request
      • [ ] 2.Response
      • [ ] 3.Item
      • [ ] 4.Field
      • [ ] 5.Spider
      • [ ] 6.Middleware
    • [ ] 开发指南
      • [ ] 1.搭建开发环境
      • [ ] 2.Ruia架构
      • [ ] 3.为Ruia编写插件
      • [ ] 4.贡献代码
    • [ ] 实践指南
      • [ ] 1.谈谈对Python爬虫的理解
    enhancement 
    opened by howie6879 0
Releases(v0.8.0)
Owner
howie.hu
奇文共欣赏,疑义相与析
howie.hu
用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。

crawler_for_university 用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。 环境依赖 wxpy,requests,bs4等库 功能描述 该项目基于python,通过爬虫爬各高校的就业信息网,爬取招聘信

8 Aug 16, 2021
A simple flask application to scrape gogoanime website.

gogoanime-api-flask A simple flask application to scrape gogoanime website. Used for demo and learning purposes only. How to use the API The base api

1 Oct 29, 2021
HappyScrapper - Google news web scrapper with python

HappyScrapper ~ Google news web scrapper INSTALLATION ♦ Clone the repository ♦ O

Jhon Aguiar 0 Nov 07, 2022
download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 02, 2022
Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

Mohammad Sadegh Salimi 4 Aug 30, 2022
Luis M. Capdevielle 1 Jan 14, 2022
Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Pyrics Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes. ./test/run.py provides the full function in terminal cmd

MisterDK 1 Feb 12, 2022
Web Scraping Instagram photos with Selenium by only using a hashtag.

Web-Scraping-Instagram This project is used to automatically obtain images by web scraping Instagram with Selenium in Python. The required input will

Sandro Agama 3 Nov 24, 2022
A simple app to scrap data from Twitter.

Twitter-Scraping-App A simple app to scrap data from Twitter. Available Features Search query. Select number of data you want to fetch from twitter. C

Davis David 2 Oct 31, 2022
A scrapy pipeline that provides an easy way to store files and images using various folder structures.

scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

Panagiotis Simakis 7 Oct 23, 2022
UdemyBot - A Simple Udemy Free Courses Scrapper

UdemyBot - A Simple Udemy Free Courses Scrapper

Gautam Kumar 112 Nov 12, 2022
A distributed crawler for weibo, building with celery and requests.

A distributed crawler for weibo, building with celery and requests.

SpiderClub 4.8k Jan 03, 2023
Dictionary - Application focused on word search through web scraping

Dictionary - Application focused on word search through web scraping, in addition to other functions such as dictation, spell and conjugation of syllables.

Juan Manuel 2 May 09, 2022
Kusonime scraper using python3

Features Scrap from url Scrap from recommendation Search by query Todo [+] Search by genre Example # Get download url from kusonime import Scrap

MhankBarBar 2 Jan 28, 2022
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 06, 2023
A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

TriNitroTofu 1 Dec 07, 2021
API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

NameMC Scrape API This is an api to scrape NameMC using message previews generated by discord. NameMC makes it a pain to scrape their website, but som

Twilak 2 Dec 22, 2021
Auto Join: A GitHub action script to automatically invite everyone to the organization who star your repository.

Auto Invite To The Organization By Star A GitHub Action script to automatically invite everyone to your organization that stars your repository. What

Max Base 11 Dec 11, 2022
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 07, 2023