A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

    A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

    2 Jun 06, 2022
    A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

    TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

    Danushka-Madushan 1 Nov 28, 2021
    This is a module that I had created along with my friend. It's a basic web scraping module

    QuickInfo PYPI link : https://pypi.org/project/quickinfo/ This is the library that you've all been searching for, it's built for developers and allows

    OneBit 2 Dec 13, 2021
    Examine.com supplement research scraper!

    ExamineScraper Examine.com supplement research scraper! Why I want to be able to search pages for a specific term. For example, I want to be able to s

    Tyler 15 Dec 06, 2022
    Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

    Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

    Douglas Trajano 2 Jan 24, 2022
    茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

    茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

    MaoTai 33 Sep 03, 2022
    This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

    LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

    Rodney 4 Nov 18, 2022
    A pure-python HTML screen-scraping library

    Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

    Scrapy project 1.8k Dec 31, 2022
    Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

    About 千葉県の地域別の詳細感染者統計(Excelファイル) をCSVに変換し、かつ地域別の日時感染者集計値を出力するスクリプトです。 Requirement POSIX互換なシェル, e.g. GNU Bash (1) curl (1) python = 3.8 pandas = 1.1.

    Conv4Japan 1 Nov 29, 2021
    A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

    Annex Bubt Scraping Script I think this is the first public repository that provides free annex-BUBT, BUBT-Soft, and BUBT website scraping API script

    Md Imam Hossain 4 Dec 03, 2022
    Bulk download tool for the MyMedia platform

    MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

    Ege Feyzioglu 3 Oct 14, 2022
    This tool crawls a list of websites and download all PDF and office documents

    This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

    AccessibilityLU 7 Sep 30, 2022
    simple http & https proxy scraper and checker

    simple http & https proxy scraper and checker

    Neospace 11 Nov 15, 2021
    A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

    Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

    Aditya Gupta 15 May 17, 2022
    Scraping Thailand COVID-19 data from the DDC's tableau dashboard

    Scraping COVID-19 data from DDC Dashboard Scraping Thailand COVID-19 data from the DDC's tableau dashboard. Data is updated at 07:30 and 08:00 daily.

    Noppakorn Jiravaranun 5 Jan 04, 2022
    Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

    GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

    laojunjun 6 Nov 21, 2022
    Scrapes proxies and saves them to a text file

    Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

    nell 2 Dec 22, 2021
    Instagram profile scrapper with python

    IG Profile Scrapper Instagram profile Scrapper Just type the username, and boo! :D Instalation clone this repo to your computer git clone https://gith

    its Galih 6 Nov 07, 2022
    The first public repository that provides free BUBT website scraping API script on Github.

    BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

    Md Imam Hossain 3 Feb 10, 2022
    一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

    QQ音乐歌词爬虫 一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件,默认去除了所有演唱会(Live)版本的歌曲。 使用方法 直接运行python run.py即可,然后输入你想获取的歌手名字,然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例,会生成两

    Yang Wei 11 Jul 27, 2022