Web-Scraper-for-a-news-website

This is a webscraper for a specific website (Economic Times). It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Installation

Install the following:

Selenium: Please follow the link https://selenium-python.readthedocs.io/installation.html and install the selenium.
Chromedriver: Check your Chrome browser's version (Menu -> Help -> About Google Chrome) and download the relevant Chromedriver from https://sites.google.com/chromium.org/driver/home
TQDM: https://pypi.org/project/tqdm/
BeautifulSoup4: https://pypi.org/project/beautifulsoup4/

Using the webscraper

It is important to take care of the sequence of executing these files. Please follow the sequence below:

ET_Archive_Links.py: Use this website as it is the source of everything that we'll do later. This scripy gives us the initial links in the Archive page of the website.
ET_All_Links_Inside_Archive.py: This is the script that takes the output (csv file) of the previous script. It produces a new file which contain URLs of all the archived news on the website since 2002.
ET_Content.py: Finally, this is the script that scrapes the headlines along with the dates. ( If you want to scrap any other part of the website then this is the script that you have to edit )

Dataset

I used the scraper on another news website named "Businessline". It's dataset is available on Kaggle(https://www.kaggle.com/rsiyanwal/20182019-businessline-headlines).

This is a webscraper for a specific website

Related tags

Overview

Web-Scraper-for-a-news-website

Installation

Using the webscraper

Dataset

Owner

Rahul Siyanwal

Telegram group scraper tool

Complete pipeline for crawling online newspaper article.

Crawl the information of a given keyword on Google search engine

淘宝、天猫半价抢购，抢电视、抢茅台，干死黄牛党

京东茅台抢购 2021年4月最新版

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

Automated Linkedin bot that will improve your visibility and increase your network.

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

This project was created using Python technology and flask tools to scrape a music site

This is my CS 20 final assesment.

Python framework to scrape Pastebin pastes and analyze them

A dead simple crawler to get books information from Douban.

TikTok Username Swapper/Claimer/etc

This is a module that I had created along with my friend. It's a basic web scraping module

This tool can be used to extract information from any website

A high-level distributed crawling framework.

Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms.

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan