Async Python 3.6+ web scraping micro-framework based on asyncio

Overview

Ruia logo

Ruia

🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio.

Write less, run faster.

travis codecov PyPI - Python Version PyPI Downloads gitter

Overview

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Write less, run faster:

Features

  • Easy: Declarative programming
  • Fast: Powered by asyncio
  • Extensible: Middlewares and plugins
  • Powerful: JavaScript support

Installation

# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+https://github.com/howie6879/ruia

Tutorials

  1. Overview
  2. Installation
  3. Define Data Items
  4. Spider Control
  5. Request & Response
  6. Customize Middleware
  7. Write a Plugins

TODO

  • Cache for debug, to decreasing request limitation, ruia-cache
  • Provide an easy way to debug the script, ruia-shell
  • Distributed crawling/scraping

Contribution

Ruia is still under developing, feel free to open issues and pull requests:

  • Report or fix bugs
  • Require or publish plugins
  • Write or fix documentation
  • Add test cases

!!!Notice: We use black to format the code

Thanks

Comments
  • Add rtds support.

    Add rtds support.

    I notice that you have tried to use mkdoc to generate the website.

    Here's an example at readthedocs.org, powered by sphinx.

    There's a little bug, but it is still great.

    RTDs

    opened by panhaoyu 32
  • Log crucial information regardless of log-level

    Log crucial information regardless of log-level

    I've reduced the log level of a Spider in my script as I find it too verbose, however I also filter out crucial info, particularly the after completion info (number of requests, time, ect.) - https://github.com/howie6879/ruia/blob/651fac54540fe0030d3a3d5eefca6c67d0dcb3c3/ruia/spider.py#L280-L287

    This is code I currently use to reduce verbosity:

    import logging
    
    # Disable logging (for speed)
    logging.root.setLevel(logging.ERROR)
    

    I'm thinking of changing the code so that it shows regardless of log level, but will there ever be a case where you wouldn't want to see it?

    opened by abmyii 13
  • `DELAY` attribute specifically for retries

    `DELAY` attribute specifically for retries

    I assumed the DELAY attr would set the delay for retries but instead it applies to all requests. I would appreciate it if there was a DELAY attr specifically for retries (RETRY_DELAY). I'd be happy to implement it if given the go-ahead.

    Thank you for this great library!

    opened by abmyii 13
  • Calling `self.start` as an instance method for a `Spider`

    Calling `self.start` as an instance method for a `Spider`

    I have the following parent class which has reusable code for all the spiders in my project (this is just a basic example):

    class Downloader(Spider):
        concurrency = 15
        worker_numbers = 2
    
        # RETRY_DELAY (secs) is time between retries
        request_config = {
            "RETRIES": 10,
            "DELAY": 0,
            "RETRY_DELAY": 0.1
        }
    
        db_name = "DB"
        db_url = "postgresql://..."
        main_table = "test"
    
        def __init__(self, *args, **kwargs):
            # Initialise DB connection
            self.db = DB(self.db_url, self.db_name, self.main_table)
    
        def download(self):
            self.start()
            
    		# After completion, commit to DB
            self.db.commit()
    

    I use it by sub-classing for each different spider. However, it seems that self.start cannot be accessed as an instance for spiders (since it's a classmethod) - giving this error:

    Traceback (most recent call last):
      File "src/scraper.py", line 107, in <module>
        scraper = Scraper()
      File "src/downloader.py", line 31, in __init__
        super(Downloader, self).__init__(*args, **kwargs)
      File "/usr/lib/python3.8/site-packages/ruia/spider.py", line 159, in __init__
        self.request_session = ClientSession()
      File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 210, in __init__
        loop = get_running_loop(loop)
      File "/usr/lib/python3.8/site-packages/aiohttp/helpers.py", line 269, in get_running_loop
        loop = asyncio.get_event_loop()
      File "/usr/lib/python3.8/asyncio/events.py", line 639, in get_event_loop
        raise RuntimeError('There is no current event loop in thread %r.'
    RuntimeError: There is no current event loop in thread 'MainThread'.
    Exception ignored in: <function ClientSession.__del__ at 0x7f28875e8b80>
    Traceback (most recent call last):
      File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 302, in __del__
        if not self.closed:
      File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 916, in closed
        return self._connector is None or self._connector.closed
    AttributeError: 'ClientSession' object has no attribute '_connector'
    

    Any idea how I can solve this issue whilst maintaining the structure I am trying to implement?

    opened by abmyii 11
  • asyncio `RuntimeError`

    asyncio `RuntimeError`

    ERROR asyncio Exception in callback BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)
    handle: <Handle BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)>
    Traceback (most recent call last):
      File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
        self._context.run(self._callback, *self._args)
      File "/usr/lib/python3.8/asyncio/selector_events.py", line 516, in _sock_write_done
        self.remove_writer(fd)
      File "/usr/lib/python3.8/asyncio/selector_events.py", line 346, in remove_writer
        self._ensure_fd_no_transport(fd)
      File "/usr/lib/python3.8/asyncio/selector_events.py", line 251, in _ensure_fd_no_transport
        raise RuntimeError(
    RuntimeError: File descriptor 150 is used by transport <_SelectorSocketTransport fd=150 read=idle write=<polling, bufsize=0>>
    

    Getting this quite a bit still. I don't think it's ruia directly, but aiohttp. Any ideas?

    One thing that may be causing it is that in clean functions I call other functions synchronously, i.e.:

        async def clean_<...>(self, value):
            return <function>(value)
    

    Could that be causing it? I tried doing return await ... but the error still persisted.

    opened by abmyii 11
  • Show URL in Error for easier debugging

    Show URL in Error for easier debugging

    I think errors would be more useful if they also showed the URL of the parsed page. Example:

    ERROR Spider <Item: extract ... error, please check selector or set parameter named default>, https://...

    I hacked a solution together by passing around the url parameter, but I can't think of a clean solution ATM. Any ideas? I can also push my changes if you would like to see them (very hacky).

    opened by abmyii 11
  • 运行示例代码报错

    运行示例代码报错

    我参考的是 https://github.com/howie6879/ruia/blob/master/docs/en/tutorials/item.md 里的代码

    import asyncio
    from ruia import Item, TextField, AttrField
    
    
    class PythonDocumentationItem(Item):
        title = TextField(css_select='title')
        tutorial_link = AttrField(xpath_select="//a[text()='Tutorial']", attr='href')
    
    
    async def main():
        url = 'https://docs.python.org/3/'
        item = await PythonDocumentationItem.get_item(url=url)
        print(item.title)
        print(item.tutorial_link)
    
    
    if __name__ == '__main__':
        # Python 3.7 required
        asyncio.run(main())
    

    运行能获取到正常的结果 3.9.5 Documentation tutorial/index.html 但是会报错,提示RuntimeError: Event loop is closed 完整的运行结果如下所示:

    [2021:05:06 14:11:55] INFO  Request <GET: https://docs.python.org/3/>
    3.9.5 Documentation
    tutorial/index.html
    Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x0000018664A679D0>
    Traceback (most recent call last):
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 116, in __del__
        self.close()
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 108, in close
        self._loop.call_soon(self._call_connection_lost, None)
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 746, in call_soon
        self._check_closed()
      File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 510, in _check_closed
        raise RuntimeError('Event loop is closed')
    RuntimeError: Event loop is closed
    

    PC :win10 64bit Python :3.9.4 64bit

    opened by qgyhd1234 10
  • 我愿意用分布式函数调度框架合和你来比,看谁代码更少谁更自由来爬任意网站,欢迎交流。

    我愿意用分布式函数调度框架合和你来比,看谁代码更少谁更自由来爬任意网站,欢迎交流。

    https://github.com/ydf0509/distributed_framework/blob/master/test_frame/car_home_crawler_sample/car_home_consumer.py

    欢迎来对比,或者你不想用汽车之家测试,可以指定一个任何两层级网站的爬虫调度,看谁的代码少,写法更快更自由,看谁的控制手段多,看谁的运行速度更快,。

    opened by ydf0509 9
  • `TextField` strips strings which may not be desirable

    `TextField` strips strings which may not be desirable

    https://github.com/howie6879/ruia/blob/8a91c0129d38efd8fcd3bee10b78f694a1c37213/ruia/field.py#L120

    My use case is extracting paragraphs which have newlines between them, and these are stripped out by TextField. Should a new field be introduced (I have already made one for my scraper), or should the stripping be optional? Perhaps both is best.

    opened by abmyii 9
  • Trouble scraping deck.tk/deckstats.net

    Trouble scraping deck.tk/deckstats.net

    For example:

    import asyncio
    from ruia import Request
    
    
    async def request_example():
        url = "https://deck.tk/07Pw8tfr"
        params = {
            'name': 'ruia',
        }
        headers = {
            'User-Agent': 'Python3.6',
        }
        request = Request(url=url, method='GET', params=params, headers=headers)
        response = await request.fetch()
        json_result = await response.json()
        print(json_result)
    
    
    if __name__ == '__main__':
        asyncio.get_event_loop().run_until_complete(request_example())
    

    This simply hangs without resolution. That is, the request is never resolved, and I must Ctrl-C out of it. Scrapy handles this without issue, but I was hoping to transition to ruia. Any ideas?

    bug enhancement 
    opened by Triquetra 7
  • Would be nice to be able to pass in

    Would be nice to be able to pass in "start_urls"

    Ruia seems like a brilliant way to write simple and elegant web scrapers, but I can't figure out how to have a different "start_urls" value. I want a web scraper that can check all links on any GIVEN web page, not just whatever the start_urls lead me to, but also with the simplicity and asynchronous power that Ruia provides. Maybe this is a feature but I can't tell from the documentation or code

    opened by JacobJustice 7
  • Improve Chinese documentation

    Improve Chinese documentation

    Toc: Ruia中文文档

    • [x] 快速开始
    • [ ] 入门指南
      • [ ] 1.概览
      • [ ] 2.爱美妆
      • [ ] 3.定义Item
      • [ ] 4.运行 Spider
      • [ ] 5.个性化
      • [ ] 6.插件
      • [ ] 7.帮助
    • [ ] 基础概念
      • [ ] 1.Request
      • [ ] 2.Response
      • [ ] 3.Item
      • [ ] 4.Field
      • [ ] 5.Spider
      • [ ] 6.Middleware
    • [ ] 开发指南
      • [ ] 1.搭建开发环境
      • [ ] 2.Ruia架构
      • [ ] 3.为Ruia编写插件
      • [ ] 4.贡献代码
    • [ ] 实践指南
      • [ ] 1.谈谈对Python爬虫的理解
    enhancement 
    opened by howie6879 0
Releases(v0.8.0)
Owner
howie.hu
奇文共欣赏,疑义相与析
howie.hu
淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党

taobao_seckill 淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党 依赖 安装chrome浏览器,根据浏览器的版本找到对应的chromedriver下载安装 web版使用说明 1、抢购前需要校准本地时间,然后把需要抢购的商品加入购物车 2、如果要打包成可执行文件,可使用pyinstalle

2k Jan 05, 2023
A Python package that scrapes Google News article data while remaining undetected by Google.

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https

Geminid Systems, Inc 6 Aug 10, 2022
ChromiumJniGenerator - Jni Generator module extracted from Chromium project

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

allenxuan 4 Jun 12, 2022
Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

slocpi-scraper Sun Life of Canada Philippines Inc Investment Funds Scraper Install dependencies pip install -r requirements.txt Usage General format:

Daryl Yu 2 Jan 07, 2022
Dictionary - Application focused on word search through web scraping

Dictionary - Application focused on word search through web scraping, in addition to other functions such as dictation, spell and conjugation of syllables.

Juan Manuel 2 May 09, 2022
This project was created using Python technology and flask tools to scrape a music site

python-scrapping This project was created using Python technology and flask tools to scrape a music site You need to install the following packages to

hosein moradi 1 Dec 07, 2021
Iptvcrawl - A scrapy project for crawl IPTV playlist

iptvcrawl a scrapy project for crawl IPTV playlist. Dependency Python3 pip insta

Zhijun 18 May 05, 2022
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 06, 2023
An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

TikTok Scraper An utility library to scrape data from TikTok hassle-free Go to the website » View Demo · Report Bug · Request Feature About The Projec

6 Jan 08, 2023
API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

NameMC Scrape API This is an api to scrape NameMC using message previews generated by discord. NameMC makes it a pain to scrape their website, but som

Twilak 2 Dec 22, 2021
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤(从2月份稳定运行至今)

python+selenium实现的web端自动打卡 说明 本打卡脚本适用于郑州大学健康打卡,其他web端打卡也可借鉴学习。(自己用的,从2月分稳定运行至今) 仅供学习交流使用,请勿依赖。开发者对使用本脚本造成的问题不负任何责任,不对脚本执行效果做出任何担保,原则上不提供任何形式的技术支持。 为防止

Sunday 1 Aug 27, 2022
A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Parallel web scraping The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy serv

Kushal Shingote 1 Feb 10, 2022
爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》 简介: 时光荏苒,记不清写了多少案例了。

lx 793 Jan 05, 2023
Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit for

Dan Claudiu Pop 79 Nov 27, 2022
Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Iceberg Locations Antarctic large iceberg positions derived from ASCAT and OSCAT-2. All data collected here are from the NASA SCP website Overview Thi

Joel Hanson 5 Jul 27, 2022
Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Pyrics Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes. ./test/run.py provides the full function in terminal cmd

MisterDK 1 Feb 12, 2022
A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Annex Bubt Scraping Script I think this is the first public repository that provides free annex-BUBT, BUBT-Soft, and BUBT website scraping API script

Md Imam Hossain 4 Dec 03, 2022
🤖 Threaded Scraper to get discord servers from disboard.org written in python3

Disboard-Scraper Threaded Scraper to get discord servers from disboard.org written in python3. Setup. One thread / tag If you whant to look for multip

Ѵιcнч 11 Nov 01, 2022