Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing

Last update: Jul 06, 2022

Related tags

Web Crawling scrapegoat

Overview

SCRAPEGOAT

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing. It can be mainly used for non-English language to get accurate and relevant scraped text.

Concept

Initially the data is scraped from a website and processed ( to remove English words if the data required is in other language). The BERT model is feed with processed data and topic to compute the cosine similarity of the given topic with each word of the scraped data then mean of cosine similarity scores of is computed. If the mean is greater than threshold then scraped data is generated as output. Also there is a section where we are using Adaptive threshold.

BERT Model

BERT, which stands for Bidirectional Encoder Representations from Transformers, is based on Transformers, a deep learning model in which every output element is connected to every input element, and the weightings between them are dynamically calculated based upon their connection. The BERT framework was pre-trained using text from Wikipedia. The transformer is the part of the model that gives BERT its increased capacity for understanding context and ambiguity in language. The transformer does this by processing any given word in relation to all other words in a sentence, rather than processing them one at a time. By looking at all surrounding words, the Transformer allows the BERT model to understand the full context of the word, and therefore better understand searcher intent.

Cosine Similarity

Cosine similarity is one of the metrics to measure the text-similarity between two documents irrespective of their size in Natural language Processing. A word can be represented in the vector form, therefore the text documents are represented in n-dimensional vector space. If the Cosine similarity score is 1, it means two vectors have the same orientation. The value closer to 0 indicates that the two documents have less similarity. The Cosine similarity of two documents will range from 0 to 1.

Multi Processing

The multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. The basic ideology of Multi-Processing is that if you have an algorithm that can be divided into different workers (small processors/cores), then you can speed up the program. Machines nowadays come with 4,6,8 and 16 cores, therefore parts of the code can be deployed in parallel.

Using Scrapegoat

The examples/test.py file contains these

url = "https://hindi.newslaundry.com/2021/01/22/loan-developing-countries-and-epidemics#:~:text=120%20%E0%A4%A8%E0%A4%BF%E0%A4%AE%E0%A5%8D%E0%A4%A8%20%E0%A4%94%E0%A4%B0%20%E0%A4%AE%E0%A4%A7%E0%A5%8D%E0%A4%AF%E0%A4%AE%20%E0%A4%86%E0%A4%AF,8.1%20%E0%A4%85%E0%A4%B0%E0%A4%AC%20%E0%A4%A1%E0%A5%89%E0%A4%B2%E0%A4%B0%20%E0%A4%B9%E0%A5%8B%20%E0%A4%97%E0%A4%AF%E0%A4%BE."
topic = "Debt of developing countries"
language = 'hi'

if __name__=="__main__":
    from scrapegoat.utils import automate
    from scrapegoat.main import getLinkData
    text,score = getLinkData(url, topic, language=language, tag='p')
    print(text, score)

VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

VG-Scraper VG-Scraper is a convinient program where you can find all the news articles instead of finding one yourself. Installing [Linux] Open a term

3 Feb 13, 2022

A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 3, 2022

This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

7 Sep 30, 2022

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

1 Jan 12, 2022

Comments

Type Error
Describe the bug While running generateData() a type error was encountered, which displays

Search() got an unexpected keyword argument 'tld'

To Reproduce Steps to reproduce the behavior:

# scrape and download data topic = " cricket" language = 'hi' generateData(topic, language, n_links=20
bug
opened by pritamkandula 1
Feature Request to Scrape images given a topic from web based on relevance

Is your feature request related to a problem? Please describe. Finding similar images and downloading it is a time consuming problem. Please provide an feature for scraping images as well given a topic.

Describe the solution you'd like Compare the images in internet and collect the most similar images. Use transformer/deep learning based approach for doing it.
enhancement

opened by Navaneeth-Sharma 0
Need a code to preview wikipedia page

Describe the bug Like if there is a word with many meanings and I want to extract data for least popular meaning. When I give the word it opens wiki and collects the data for the popular meaning one, then the relevancy of the data won't even happen. So there is need of print statement for the wiki data to preview , also an option to input the relevant data that we already have.
bug

opened by spect-o-sagar 0

Releases(v1.0.0.7)

v1.0.0.7(Jul 6, 2022)
The Major Features

[x] Scrape automatically with Deep Learning

[x] Generate Data, given a topic

[x] Progress Bar (NEW)

Source code(tar.gz)
Source code(zip)
v1.0.0.0(Jul 5, 2022)
The Major Features

Scrape automatically with Deep Learning

Generate Data given a topic

Source code(tar.gz)
Source code(zip)
Latest(Sep 29, 2021)

This version can scrape the text faster and gets the similarity score quickly
Source code(tar.gz)
Source code(zip)

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing

Related tags

Overview

SCRAPEGOAT

Concept

BERT Model

Cosine Similarity

Multi Processing

Using Scrapegoat

You might also like...

VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

A tool to easily scrape youtube data using the Google API

This tool crawls a list of websites and download all PDF and office documents

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Scrapy-soccer-games - Scraping information about soccer games from a few websites

This is python to scrape overview and reviews of companies from Glassdoor.

A Python web scraper to scrape latest posts from official Coinbase's Blog.

A python tool to scrape NFT's off of OpenSea

Python framework to scrape Pastebin pastes and analyze them

Comments

Type Error

Feature Request to Scrape images given a topic from web based on relevance

Need a code to preview wikipedia page

Releases(v1.0.0.7)

v1.0.0.7(Jul 6, 2022)

v1.0.0.0(Jul 5, 2022)

Latest(Sep 29, 2021)

Owner

Searching info from Google using Python Scrapy

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

茅台抢购最新优化版本，茅台秒杀，优化了抢购协程队列

Incredibly fast crawler designed for OSINT.

Nekopoi scraper using python3

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

A Python library for automating interaction with websites.

👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

Scrap the 42 Intranet's elearning videos in a single click

A Python module to bypass Cloudflare's anti-bot page.

A Pixiv web crawler module

CreamySoup - a helper script for automated SourceMod plugin updates management.

A python module to parse the Open Graph Protocol

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

fork huanghyw/jd_seckill

a way to scrape a database of all of the isef projects

京东茅台抢购

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Web Scraping COVID 19 Meta Portal with Python

A web scraper that exports your entire WhatsApp chat history.