A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    Dailyiptvlist.com Scraper With Python

    Dailyiptvlist.com scraper Info Made in python Linux only script Script requires to have wget installed Running script Clone repository with: git clone

    1 Oct 16, 2021
    mlscraper: Scrape data from HTML pages automatically with Machine Learning

    🤖 Scrape data from HTML websites automatically with Machine Learning

    Karl Lorey 798 Dec 29, 2022
    哔哩哔哩爬取器:以个人为中心

    Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创

    Boshen Shi 3 Oct 21, 2021
    Snowflake database loading utility with Scrapy integration

    Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

    Oleg T. 0 Dec 06, 2021
    SmartScraper: 简单、自动、快捷的Python网络爬虫

    SmartScraper: 简单、自动、快捷的Python网络爬虫 Note: The origin developer of SmartScraper is Alireza Mika, I only change a little code of AutoScraper. SmartScraper

    DaDeng 9 Apr 16, 2022
    A scrapy pipeline that provides an easy way to store files and images using various folder structures.

    scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

    Panagiotis Simakis 7 Oct 23, 2022
    Command line program to download documents from web portals.

    command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

    16 Dec 26, 2022
    HappyScrapper - Google news web scrapper with python

    HappyScrapper ~ Google news web scrapper INSTALLATION ♦ Clone the repository ♦ O

    Jhon Aguiar 0 Nov 07, 2022
    Download images from forum threads

    Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

    9 Nov 16, 2022
    Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

    Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

    Mgs. M. Rizqi Fadhlurrahman 2 Dec 23, 2021
    Anonymously scrapes onlinesim.ru for new usable phone numbers.

    phone-scraper Anonymously scrapes onlinesim.ru for new usable phone numbers. Usage Clone the repository $ git clone https://github.com/thomasgruebl/ph

    16 Oct 08, 2022
    Simple library for exploring/scraping the web or testing a website you’re developing

    Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit for

    Dan Claudiu Pop 79 Nov 27, 2022
    A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

    cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

    Hitesh Rana 4 Jun 02, 2022
    This is a script that scrapes the longitude and latitude on food.grab.com

    grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

    0 Nov 22, 2021
    Html Content / Article Extractor, web scrapping lib in Python

    Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

    Xavier Grangier 3.8k Jan 02, 2023
    A list of Python Bots used to extract data from several websites

    A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

    Sahil Ladhani 1 Jan 14, 2022
    Collection of code files to scrap different kinds of websites.

    STW-Collection Scrap The Web Collection; blog posts. This repo contains Scrapy sample code to scrap the following kind of websites: Do you want to lea

    Tapasweni Pathak 15 Jun 08, 2022
    Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms.

    Game Scraper Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms. Join the discord About The Proj

    KursK 2 Mar 28, 2022
    A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

    🕳️ CygnusX1 Code by Trong-Dat Ngo. Overviews 🕳️ CygnusX1 is a multithreaded tool 🛠️ , used to search and download images from popular search engine

    DatNgo 32 Dec 31, 2022
    python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤(从2月份稳定运行至今)

    python+selenium实现的web端自动打卡 说明 本打卡脚本适用于郑州大学健康打卡,其他web端打卡也可借鉴学习。(自己用的,从2月分稳定运行至今) 仅供学习交流使用,请勿依赖。开发者对使用本脚本造成的问题不负任何责任,不对脚本执行效果做出任何担保,原则上不提供任何形式的技术支持。 为防止

    Sunday 1 Aug 27, 2022