Subscrape - A Python scraper for substrate chains

Related tags

Web Crawlingsubscrape
Overview

subscrape

A Python scraper for substrate chains that uses Subscan.

Usage

  • copy config/sample_scrape_config.json to config/scrape_config.json and configure to your desire.
  • make sure there is a data/parachains folder
  • run
  • corresponding files will be created in data/

If a file already exists in data/, that operation will be skipped in subsequent runs.

Configuration

To query extrinsics from Substrate chains, only the module and call is needed.

To query transactions from Moonbeam chains, the contract address and the hex-formatted method id is needed. This can be found by scrolling to the input of a transaction on moonbeam and copying the method id. Example: https://moonriver.moonscan.io/tx/0x35c7d7bdd33c77c756e7a9b41b587a6b6193b5a94956d2c8e4ea77af1df2b9c3

Architecture

On overview is given in this Twitter thread: https://twitter.com/alice_und_bob/status/1493714489014956037

General

We use the following methods in the projects:

SubscanWrapper

There is a class SubscanWrapper that encapsulates the logic around calling Subscan. API: https://docs.api.subscan.io/ If you have a Subscan API key, you can put it in the main folder in a file called "subscan-key" and it will be applied to your calls.

ParachainScraper

This is a scraoer that knows how to use the SubscanWrapper to fetch data for a parachain and serialize it to disk.

Currently it knows how to fetch addresses and extrinsics.

Comments
  • Introduce SubscanV2 API

    Introduce SubscanV2 API

    This PR introduces a SubscanV2 API Wrapper as a configurable option. In a future version of Subscrape, this should become the default version.

    What's new?

    • You can now configure the API which shall be used for scraping by setting the _api param in the configuration within a chain.
    • SubscrapeDB is rewritten to store hydrated extrinsics and events in a dedicated Sqlite file. Previous storage are being redefined as indexes (or summary elements)
    • New operations extrinsics-list and events-list were added. You can submit lists of extrinsic or event indices and receive hydrated elements.
    • events and extrinsics scraping: module can now be set to None to scrape all extrinsics or events. name can now be set to None to scrape all calls from the module.
    • The improved paging mechanism of SubscanV2 is used
    • the 'transfers' operation was fixed

    This changes the output format of the results. Any code that uses the new scraper might break.

    Configuring the new SubscanV2 API

    config = {
            "kusama":{
                "_api": "SubscanV2",
                "extrinsics":{
                    "crowdloan": ["create"]
                },
                "extrinsics-list":[
                    "14238250-2"
                ],
                "events":{
                    "crowdloan": ["created"]
                }
                "events-list":[
                    "14238250-39"
                ],
                "transfers": [
                    "Dp27W3mGpkT7BG9SxNsTtWoKcvJSwzF4BU8tadYirkn6Kwx"
                ]
            },
        }
    

    For now, SubscanV1 is still the default version and is used automatically if _api is not given. Though at some point in the future, the default my change to SubscanV2 and at some point support for SubscanV1 might be discontinued.

    What changed

    SubscanWrapper was refactored to be SubscanBase, from which SubscanV1 and SubscanV2 now inherit. Most logic is in SubscanBase for now. The specific classes only overwrite params atm but in the future I suspect specific method implementations will follow (paging changes in V2)

    A bit of the logic from ParachainScraper had to move into SubscanBase, namely the fetch functions that parameterized the call. I wanted to avoid code duplication or ugly forks in code.

    • [x] Updated Documentation + sample config
    • [x] Updated Tests
    opened by Tomen 1
  • MOVR liquidity provisioning

    MOVR liquidity provisioning

    • refactor to break apart large re-usable MOVR methods.
    • nearly complete support of scraping liquidity provisioning (but no tax analysis yet)
    • export the MOVR transactions to an .xlsx spreadsheet (using pandas)
    opened by spazcoin 1
  • Add new `extrinsics-list` scrape operation. Refactor SubscrapeDB. Add Unit Tests and Github Automations

    Add new `extrinsics-list` scrape operation. Refactor SubscrapeDB. Add Unit Tests and Github Automations

    New Feature

    This PR adds a new operation extrinsics-list to scrape extrinsics for a list of extrinsic indexes.

    config = {
        "kusama":{
            "extrinsics-list":{
                ["14238250-2"]
            }
        }
    }
    

    To showcase the usage, bin/sample_extrinsic_list.py has been added.

    Changes to SubscrapeDB

    To read and write single extrinsics to and from the DB the write_extrinsic and read_extrinsic methods have been added.

    This highlighted the weakness of the current design in SubscrapeDB that extrinsics are assumed to be known my module/call. To work around the issue, a new meta index was introduced. SectorizedStorageManager is replaced with a sqlitedict key-value store wrapper, which itself is a wrapper on top of Sqlite. This is only the first of a few optimization steps I see, but it is non-breaking so far.

    Addition of Unit Tests and Github Automations

    Since this has a huge impact on the storage, I decided to start adding unit tests. I also added Github CI Automations to automatically run the unit tests upon a PR

    opened by Tomen 1
  • export EVM/MOVR transactions to spreadsheet

    export EVM/MOVR transactions to spreadsheet

    Either as a built-in functionality or as an example for the user, add script capability to export decoded MOVR/GLMR transactions into a CSV or XLS file format to be imported into Excel. Extra decoded data like exact input/output quantities and token identifiers should also be exported.

    This feature would not yet try to line up transactions to perform Ins/Outs transaction analysis. This feature would not yet try to format the CSV/XLS export data to match an existing format to be imported into other programs like Rotki.

    Estimate: 3 hours

    enhancement 
    opened by spazcoin 1
  • finish EVM/MOVR token swap decoding

    finish EVM/MOVR token swap decoding

    Subscrape can now decode contract interaction transactions using the contract's ABI. For transactions like token swaps where both input and output tokens are specified, subscrape can now decode those (including event logs). Howevever, swapETHForTokens still needs to be decoded/interpreted because it doesn't specify an input quantity so that's likely because it's using native tokens. For this task, complete the analysis and implementation for those specific methods so that DEX token swap decoding is complete.

    Estimate: 2 hours.

    enhancement 
    opened by spazcoin 1
  • new MOVR/GLMR operation:

    new MOVR/GLMR operation: "account_transactions"

    Description

    • support new operation "account_transactions" for retrieving all transactions for an account. Structured so that I think support for filters and skipping can be added later.
    • refactor moonbeam_scraper's "fetch_transactions" so that it can be used to retrieve transactions both for accounts and entire contract method (by pulling processor definition out).
    • a little PEP8 cleanup and reStructured text docstrings for methods

    Testing

    • to make sure your "transactions" operation still works, I ran subscrape.py using your example config file and it chugged through Solarbeam transactions. Looks like about 7935 unique addresses, so hopefully that matches your previous results.

    Future

    • I'd suggest renaming your "transactions" operation to something like contract_transactions.
    • I'd love to use your filtering module on moonriver/moonbeam, but it looks like you've only integrated it with ParachainScraper so far.
    opened by spazcoin 1
  • SubscrapeDB V2 - transition to full SQLite

    SubscrapeDB V2 - transition to full SQLite

    Changes:

    • Introducing Extrinsic and Event classes with properties as delivered by Subscan.
    • SubscrapeDB completely overhauled to use full SQLite capabilities instead of just having key-value pairs.
    • _db_connection_string: New config param on the chain level that allows you to define a connection string for SQLAlchemy to use.
    • _auto_hydrate: New config param that defines if scraping extrinsics and events shall automatically hydrate them.
    • removed SubscanV1 and the _api config param
    • _stop_at_known_data: New config param that defines if the scraper should stop scraping metadata when hitting a known element.
    • SubscanWrapper's fetch_extrinsic() and fetch_event() now updates any already existing item in the DB. You can explicitely override this by setting the optional update_existing param to False
    opened by Tomen 0
  • real life use cases explained

    real life use cases explained

    Add explanations or how-to-use our scripts and example config files to perform several of the operations that we wrote subscrape to accomplish ourselves.

    contributes to #22

    opened by spazcoin 0
  • shim `setup.py` so editable install possible

    shim `setup.py` so editable install possible

    I've previously used setup.py but noticed you were using a pyproject.toml instead so I thought I'd learn about the differences. I found this SO post describing how you could use a shim setup.py to still allow editable installs of the subscrape package. (I didn't even know editable installs were possible before!) Anyways, I've added in that shim to give us maximal utility in the future. https://stackoverflow.com/questions/62983756/what-is-pyproject-toml-file-for

    contributes to #20

    opened by spazcoin 0
  • Feature/sectorized storage

    Feature/sectorized storage

    This pull request abstracts the sectorized storage and retrieval of extrinsics into a dedicated class. This is the first step toward storing types other than extrinsics with the same logic.

    The pull request has been tested with scrape.py

    opened by Tomen 0
  • generically decode DEX swap transactions, including event log decoding

    generically decode DEX swap transactions, including event log decoding

    Description:

    This PR uses @yifei_huang's scripts to decode contract input data as well interpret a transaction log file (to figure out the result of a contract interaction, not just the inputs). See article explanation: https://towardsdatascience.com/decoding-ethereum-smart-contract-data-eed513a65f76

    • retrieve the ABIs for each contract interface, instead of archiving copies of each one. This allows us to support all APIs without needing to archive each one.
    • Get token name, symbol, and calculate real transaction values. This this required adding support for the Blockscout API also.
    • moonbeam_scraper.py is now able to decode contract input data, and then retrieve transaction receipts for contract interactions so that it can decode the internal dex 'trace' transactions to determine the exact input and output quantities for swap transactions. Previously, we were using the internal 'Swap' transactions, but those were too difficult to match up for a multi-hop swap (i.e. ROME -> MOVR -> SOLAR) and a few implementations were confusing with multiple Swap inputs. So instead the code now only sums up the Transfer input quantities from the source acct. This is done in a generic way without requiring classes for each DEX contract implementation which works as long as DEX contracts all follow general patterns and naming conventions.
    • decode errors for DPS contracts were due to @yifei_huang's initial convert_to_hex functionality not being able to handle more complex data structures. When working down through nested structures, it missed buried byte arrays and didn't convert them to hex. json.dumps(decoded_func_params) was throwing an exception because the bytes weren't JSON serializable. The solution was to look for lists embedded inside a tuple/struct.
    • decode errors for 0xTaylor's puzzle contract were intentionally caused by poor utf8 string handling. Therefore still print out a warning the first time that contract is encountered and can't be decoded. Include full traceback and diagnostic messages instead of swallowing them.
    • add rate limiting to our block explorer API calls, to make sure we always get a valid response from the APIs. Adapt the rate limit for subscan.io depending on whether an API key is provided.
    • support moonscan.io api keys
    • use httpx for GET and POST instead of requests library. Queries to the moonscan.io API started failing. Using Wireshark I tracked it down to use of the "requests" python package which doesn't support HTTPS2. It was submitting the moonscan.io GET request with HTTP/1.1 and then moonscan.io responds with HTTP/1.1 403 Forbidden and the cloudflare error message. Whereas if I hand-crafted the same GET request as a URL string, it would be accepted and I received JSON back with the transaction history. Wireshark showed that these handcrafted URLs were being submitted as HTTP2. The solution was to migrate GET and POST HTTP messages to use the httpx package instead of requests. (and with pip you must install the httpx optional support for http2) I went ahead and changed subscan.io calls to use httpx also, but with minimal testing.
    • update docs to explain how moonbeam_scraper works now, how to interpret its output data, what typical warnings/errors are.
    opened by spazcoin 0
  • DB path should be configurable

    DB path should be configurable

    In case many different processes are performed from the same integration, DB size might grow to a point that execution time is suffering from overlapping data.

    Making the DB path configurable can mitigate this issue.

    enhancement 
    opened by Tomen 0
  • Transfers scraping might complete prematurely

    Transfers scraping might complete prematurely

    If a transfer (or even a balances.transfer event) is queried from two different addresses, the later one might result in reporting an existing entry in the DB which leads the scrape to complete prematurely.

    bug 
    opened by Tomen 0
  • extract ABI for unverified EVM contracts

    extract ABI for unverified EVM contracts

    Currently, we only parse contract interactions for verified contracts since they have published ABIs to help us interpret the call data. However, there's a new tool to guess the ABI from an unverified contract. https://mobile.twitter.com/w1nt3r_eth/status/1575848008985370624

    Get to it!

    opened by spazcoin 0
  • scrape EVM/MOVR token transfers

    scrape EVM/MOVR token transfers

    We can already decode contract interactions for EVM DEX token swaps. However, what about simply sending ERC-20 from one account to another? (or even XC-20 tokens) This task is to investigate what those operations look like on-chain and add a script to decode them.

    Estimate: 4 hours

    enhancement 
    opened by spazcoin 0
  • Scrape EVM/MOVR liquidity provisioning DEX transactions

    Scrape EVM/MOVR liquidity provisioning DEX transactions

    decode generic liquidity provisioning transaction types for EVM DEXes after decoding the transaction input data using the DEX contract's ABI.

    estimate: 8 hours

    enhancement 
    opened by spazcoin 0
Releases(v1.0)
  • v1.0(Mar 21, 2022)

    This version introduces the first stable version of Subscrape. It is able to scrape extrinsics from parachains and store them locally in SubscrapeDB to then query and transform them as needed.

    Full Changelog: https://github.com/ChaosDAO-org/subscrape/commits/v1.0

    Source code(tar.gz)
    Source code(zip)
Owner
ChaosDAO
ChaosDAO
PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Une PS5 pour Noël Python + Chrome --headless = une PS5 pour noël MacOS Installer chrome Tweaker le .yaml pour la listes sites a scrap et les criteres

Olivier Giniaux 3 Feb 13, 2022
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 01, 2023
淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

MaoTai 118 Dec 16, 2022
👁️ Tool for Data Extraction and Web Requests.

httpmapper 👁️ Project • Technologies • Installation • How it works • License Project 🚧 For educational purposes. This is a project that I developed,

15 Dec 05, 2021
VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

VG-Scraper VG-Scraper is a convinient program where you can find all the news articles instead of finding one yourself. Installing [Linux] Open a term

3 Feb 13, 2022
爱奇艺会员,腾讯视频,哔哩哔哩,百度,各类签到

My-Actions 个人收集并适配Github Actions的各类签到大杂烩 不要fork了 ⭐️ star就行 使用方式 新建仓库并同步代码 点击Settings - Secrets - 点击绿色按钮 (如无绿色按钮说明已激活。直接到下一步。) 新增 new secret 并设置 Secr

280 Dec 30, 2022
An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
京东茅台抢购

截止 2021/2/1 日,该项目已无法使用! 京东:约满即止,仅限京东实名认证用户APP端抢购,2月1日10:00开始预约,2月1日12:00开始抢购(京东APP需升级至8.5.6版本及以上) 写在前面 本项目来自 huanghyw - jd_seckill,作者的项目地址我找不到了,找到了再贴上

abee 73 Dec 03, 2022
Web Scraping Practica With Python

Web-Scraping-Practica Integrants: Guillem Vidal Pallarols. Lídia Bandrés Solé Fitxers: Aquest document és el primer que trobem. A continuació trobem u

2 Nov 08, 2021
A simple app to scrap data from Twitter.

Twitter-Scraping-App A simple app to scrap data from Twitter. Available Features Search query. Select number of data you want to fetch from twitter. C

Davis David 2 Oct 31, 2022
Scrape Twitter for Tweets

Backers Thank you to all our backers! 🙏 [Become a backer] Sponsors Support this project by becoming a sponsor. Your logo will show up here with a lin

Ahmet Taspinar 2.2k Jan 05, 2023
OSTA web scraper, for checking the status of school buses in Ottawa

OSTA-La-Vista OSTA web scraper, for checking the status of school buses in Ottawa. Getting Started Using a Raspberry Pi, download Python 3, and option

1 Jan 28, 2022
This is a module that I had created along with my friend. It's a basic web scraping module

QuickInfo PYPI link : https://pypi.org/project/quickinfo/ This is the library that you've all been searching for, it's built for developers and allows

OneBit 2 Dec 13, 2021
Scrap the 42 Intranet's elearning videos in a single click

42intra_scraper Scrap the 42 Intranet's elearning videos in a single click. Why you would want to use it ? Adjust speed at your convenience. (The intr

Noufel 5 Oct 27, 2022
对于有验证码的站点爆破,用于安全合法测试

使用方法 python3 main.py + 配置好的文件 python3 main.py Verify.json python3 main.py NoVerify.json 以上分别对应有验证码的demo和无验证码的demo Tips: 你可以以域名作为配置文件名字加载:python3 main

47 Nov 09, 2022
Parse feeds in Python

feedparser - Parse Atom and RSS feeds in Python. Copyright 2010-2020 Kurt McKee Kurt McKee 1.5k Dec 30, 2022

SkyScrapers: A collection of variety of Scraping Apps

SkyScrapers Collection of variety of Web Scraping Apps The web-scrapers involved

Biplov Pokhrel 3 Feb 17, 2022
Jobinja.ir jobs scraper.

Jobinja.ir Dataset Introduction This project is a simple web scraper that scraps pages of jobinja.ir concurrently and writes and update (if file gets

Iman Kermani 3 Apr 15, 2022
Telegram Group Scrapper

this programe is make your work so much easy on telegrame. do you want to send messages on everyone to your group or others group. use this script it will do your work automatically with one click. a

HackArrOw 3 Dec 03, 2022