Libextract: extract data from websites

https://travis-ci.org/datalib/libextract.svg?branch=master

    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>   
Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.
  

  
   Overview 

  
 
  
   
  libextract.api.extract(document, encoding='utf-8', count=5)
 
   
 
  
   
  Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).
 
   

 
  

  
   Installation 
pip install libextract
  

  
   Usage 
Due to our simple definition of "data", we open up a single interfaceable method. Post-processing is up to you. 

 
 
  from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

  
Using lxml's built-in methods for post-processing: 

 
 
  >> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

  
The extraction algo is agnostic to article text as it is with tabular data: 

 
 
  height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))

  

 
 
  >> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

 
  

  
   Dependencies 
lxml
statscounter
  

  
   Disclaimer 
This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated 
:)

Libextract: extract data from websites

Related tags

Overview

Libextract: extract data from websites

Overview

Installation

Usage

Dependencies

Disclaimer

Owner

Scrap-mtg-top-8 - A top 8 mtg scraper using python

A web scraper for nomadlist.com, made to avoid website restrictions.

Scraping weather data using Python to receive umbrella reminders

This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

EBay-email-tracker - Scapes an entire search page of a particular item on eBay and sends regular updates to an email address

The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

SkyScrapers: A collection of variety of Scraping Apps

A Very simple free proxy list scraper.

Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

Simple proxy scraper made by using ProxyScrape's api.

自动完成每日体温上报（Github Actions）

OSTA web scraper, for checking the status of school buses in Ottawa

Instagram profile scrapper with python

Automatically scrapes all menu items from the Taco Bell website

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

Google Developer Profile Badge Scraper

A Powerful Spider(Web Crawler) System in Python.

Telegram group scraper tool

Scrapy, a fast high-level web crawling & scraping framework for Python.