Complete pipeline for crawling online newspaper article.

Related tags

Web Crawlingnewspipe
Overview

NewsPipe

This repository contains the complete pipeline for collecting online newspaper article. The articles are stored in a MongoDB. The whole pipeline is dockerized, thus the user does not need to worry about dependencies. Additionally, docker-compose is available to increase the useability for the user.

drawing

Requirement

To use this system, you need to create a .env file in which the MongoDB information is available:

MONGO_ROOT_USER=devroot
MONGO_ROOT_PASSWORD=devroot
MONGOEXPRESS_LOGIN=dev
MONGOEXPRESS_PASSWORD=dev
MONGO_CHART_USERNAME=dev
MONGO_CHART_PASSWORD=dev
POSTGRES_USER=airflow
POSTGRES_PASS=airflow

If you want to specify the number of threads then open airflow-newspipe-docker and adjust the sed command in airflow-docker/Dockerfile. If you want 4 threads per process:

&& sed -i'.orig' 's/max_threads = 2/max_threads = 4/g' ${AIRFLOW_HOME}/airflow.cfg \

Additionally, you can also specify the number of processes (2 processes in this case):

&& sed -i'.orig' 's/parallelism = 32/parallelism = 2/g' ${AIRFLOW_HOME}/airflow.cfg \

Getting Started

To start this application, run:

docker-compose up
  • To see the database collections, mongo-express is in use and available on localhost:8081. The MongoDB itself is available on port 27017.
  • The airflow application should be available on localhost:8083. You will see the airflow dashboard with the default examples.
  • For the mongo chart dashboard, open localhost

Adding article sources

Each crawler is defined as DAG in 'dag'. To add a data source, you must therefore add DAGs in the dags folder. A DAG is a Python script that contains the settings for an entire crawling pipeline. Use the default example as a template. The DAGs are very simple and straightforward.

import os
import datetime

from dag_factory import create_dag

url = "taz.de" # url of newspaper source

# Defining the crawling intervals
airflow_config = {'schedule_interval': '@hourly', # set a interval, for continuous crawling
                  'start_date': datetime.datetime(2020, 6, 4, 21), # set a date, on which the dag will run
                  'end_date':datetime.datetime(2020, 6, 5, 6), # optinal, set if it is needed
                  }

# Create crawling DAG
DAG = create_dag(url=url,
                 airflow_config=airflow_config,
                 name=os.path.basename(__file__))

Options for schedule_interval:

preset meaning cron
@once Schedule once and only once
@hourly Run once an hour at the beginning of the hour 0 * * * *
@daily Run once a day at midnight 0 0 * * *
@weekly Run once a week at midnight on Sunday morning 0 0 * * 0
@monthly Run once a month at midnight of the first day of the month 0 0 1 * *
@quarterly Run once a quarter at midnight on the first day 0 0 1 */3 *
@yearly Run once a year at midnight of January 1 0 0 1 1 *

Mongo Charts

MongoDB Charts is a data visualization tool that is integrated within the MongoDB ecosystem. By default, there are no visualization available or shipped with NewsPipe. Therefore, you have to create dashboard on your needs. This involves following 3 steps:

  • Setup data source
  • Data aggregation
  • Dashboard creation which are well documented on docs.mongodb.com.

Credentials:

The credentials for mongo charts are:

Connection URI

You might also like...
Basic-html-scraper - A complete how to of web scraping with Python for beginners

basic-html-scraper Code from YT Video This video includes a complete how to of w

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.
A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

New to Streaming Scraper An in-progress web scraping project built with Python, R, and SQL. The scraped data are movie and TV show information. The go

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.
A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API
topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

NLP Space News Topic Modeling Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com Table of Contents Project Idea Data acquisition Primary data sour

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore
A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

 Framework for the Complete Gaze Tracking Pipeline
Framework for the Complete Gaze Tracking Pipeline

Framework for the Complete Gaze Tracking Pipeline The figure below shows a general representation of the camera-to-screen gaze tracking pipeline [1].

Complete portable pipeline for masking of Aadhaar Number adhering to Govt. Privacy Guidelines.
Complete portable pipeline for masking of Aadhaar Number adhering to Govt. Privacy Guidelines.

Aadhaar Number Masking Pipeline Implementation of a complete pipeline that masks the Aadhaar Number in given images to adhere to Govt. of India's Priv

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

TuShare is a utility for crawling historical data of China stocks
TuShare is a utility for crawling historical data of China stocks

TuShare Tushare Pro版已发布,请访问新的官网了解和查询数据接口! https://tushare.pro TuShare是实现对股票/期货等金融数据从数据采集、清洗加工 到 数据存储过程的工具,满足金融量化分析师和学习数据分析的人在数据获取方面的需求,它的特点是数据覆盖范围广,接口

Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

Web crawling framework  based on asyncio.
Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

declutters url lists for crawling/pentesting
declutters url lists for crawling/pentesting

uro Using a URL list for security testing can be painful as there are a lot of URLs that have uninteresting/duplicate content; uro aims to solve that.

Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

An University Project of Quera Web Crawling.

WebCrawlerProject An University Project of Quera Web Crawling. خزشگر اینستاگرام در این پروژه شما باید با استفاده از کتابخانه های زیر یک خزشگر اینستاگر

This is a repository for the Duke University Cloud Computing course project on Serveless Data Engineering Pipeline. For this project, I recreated the below pipeline.
This is a repository for the Duke University Cloud Computing course project on Serveless Data Engineering Pipeline. For this project, I recreated the below pipeline.

AWS Data Engineering Pipeline This is a repository for the Duke University Cloud Computing course project on Serverless Data Engineering Pipeline. For

Comments
  • Airflow Web UI crashing

    Airflow Web UI crashing

    Web UI crashes after some time.It seems that the problem lies with Postgres. This is not a serious issue, because the pipeline will continue working without UI

    opened by steven-mi 1
  • docs: Fix a few typos

    docs: Fix a few typos

    There are small typos in:

    • README.md

    Fixes:

    • Should read usability rather than useability.
    • Should read optional rather than optinal.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
Releases(1.1)
  • 1.1(Jan 2, 2021)

    New Features:

    • MongoDB Chart for data visualization
    • Refactored DAG factory interface
    • No data cleaning in DAG. Data is cleaned by the crawler (less code duplication, better testability of cleaning module)
    Source code(tar.gz)
    Source code(zip)
  • 1.0(Oct 12, 2020)

    New features:

    • DAG for updating old MongoDB documents
    • Add goose for text extraction to make it more stable e.g. Spiegel works now!
    • Code refactor for DAG factory
    Source code(tar.gz)
    Source code(zip)
  • 0.1.1-distributed(Sep 7, 2020)

  • 0.1-parallel(Sep 7, 2020)

  • 0.1-local(Sep 7, 2020)

Owner
newspipe
A complete ecosystem for training and deploying GPT-2
newspipe
Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

Iceberg Locations Antarctic large iceberg positions derived from ASCAT and OSCAT-2. All data collected here are from the NASA SCP website Overview Thi

Joel Hanson 5 Jul 27, 2022
Scrape Twitter for Tweets

Backers Thank you to all our backers! 🙏 [Become a backer] Sponsors Support this project by becoming a sponsor. Your logo will show up here with a lin

Ahmet Taspinar 2.2k Jan 05, 2023
一些爬虫相关的签名、验证码破解

cracking4crawling 一些爬虫相关的签名、验证码破解,目前已有脚本: 小红书App接口签名(shield)(2020.12.02) 小红书滑块(数美)验证破解(2020.12.02) 海南航空App接口签名(hnairSign)(2020.12.05) 说明: 脚本按目标网站、App命

XNFA 90 Feb 09, 2021
Introduction to WebScraping Workshop - Semcomp 24 Beta

Extrair informações da internet de forma automatizada. Existem diversas maneiras de fazer isso, nesse tutorial vamos ver algumas delas, por meio de bibliotecas de python.

Luísa Moura 19 Sep 11, 2022
An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Web-Scrapping-1 An application that on a given url, crowls a web page and gets all words, sorts and counts them. Installation Using the package manage

adriano atambo 1 Jan 16, 2022
A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022
Web-Scrapper using Python and Flask

Web-Scrapper "[초급]Python으로 웹 스크래퍼 만들기" 코스 -NomadCoders 기초적인 Python 문법강의부터 시작하여 웹사이트의 html파일에서 원하는 내용을 Scrapping해서 출력, csv 파일로 저장, flask를 이용한 간단한 웹페이지

윤성도 1 Nov 10, 2021
Haphazard scripts for scraping bitcoin/bitcoin data from GitHub

This is a quick-and-dirty tool used to scrape bitcoin/bitcoin pull request and commentary data. Each output/pr number folder contains comments.json:

James O'Beirne 8 Oct 12, 2022
This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

- Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

Khaled Tofailieh 4 Feb 13, 2022
CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform

CRI Scrape CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform Disclaimer This code is only for educational purpose. So

Vincenzo Cardone 0 Jul 23, 2022
Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

WebScrapperRoBot Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup. Mark your Star ⭐ ⭐ What is Web Scraping ? Web s

Nuhman Pk 53 Dec 21, 2022
download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 02, 2022
A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

Eddy Harrington 87 Jan 06, 2023
This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Faisal Ahmed 1 Jan 10, 2022
🤖 Threaded Scraper to get discord servers from disboard.org written in python3

Disboard-Scraper Threaded Scraper to get discord servers from disboard.org written in python3. Setup. One thread / tag If you whant to look for multip

Ѵιcнч 11 Nov 01, 2022
Console application for downloading images from Reddit in Python

RedditImageScraper Console application for downloading images from Reddit in Python Introduction This short Python script was created for the mass-dow

James 0 Jul 04, 2021
Scrap the 42 Intranet's elearning videos in a single click

42intra_scraper Scrap the 42 Intranet's elearning videos in a single click. Why you would want to use it ? Adjust speed at your convenience. (The intr

Noufel 5 Oct 27, 2022
An IpVanish Proxies Scraper

EzProxies Tired of searching for good proxies for hours? Just get an IpVanish account and get thousands of good proxies in few seconds! Showcase Watch

11 Nov 13, 2022
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022