Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Overview

Toxicity comments crawler

Quality Gate Status

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Twitter

Tweets and replies are scraped from Twitter API for a given list of users.

Twitch

Coming soon.

YouTube

Coming soon.

Facebook

Coming soon.

Instagram

Coming soon.

The toxic level of a given comment is calculated using the Perspective API.

Architecture

Usage

To run the crawler, you need to provide the following environment variables:

Variable Description Default Required
AWS_ROLE_ARN AWS Role ARN None Optional
AWS_WEB_IDENTITY_TOKEN_FILE AWS Web Identity Token File None Optional
AWS_ACCESS_KEY_ID AWS Access Key ID None Optional
AWS_SECRET_ACCESS_KEY AWS Secret Access Key None Optional
AWS_S3_BUCKET AWS S3 Bucket None Required
AWS_S3_BUCKET_PREFIX AWS S3 Bucket Prefix None Required
LOG_LEVEL Log level INFO Optional
PERSPECTIVE_API_KEY Perspective API Key None Required
PERSPECTIVE_THRESHOLD Perspective Threshold 0.5 Required
FILTER_TOXIC_COMMENTS Filter Toxic Comments True Required
TWITTER_CONSUMER_KEY Twitter Consumer Key None Required
TWITTER_CONSUMER_SECRET Twitter Consumer Secret None Required
TWITTER_ACCESS_TOKEN Twitter Access Token None Required
TWITTER_ACCESS_TOKEN_SECRET Twitter Access Token Secret None Required
TWITTER_MAX_TWEETS Twitter Max Tweets or replies None Required

If AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE are provided, the crawler will use them to assume a role, and will not use AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.

Running

Prerequisites

Then, you can run the crawler with the following command:

docker run --env-file .env -d dougtrajano/toxicity-crawler:latest

License

The project is licensed under the Apache 2.0 License.

You might also like...
This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

This is a script that scrapes the longitude and latitude on food.grab.com
This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

Releases(0.2.1)
  • 0.2.1(Dec 27, 2021)

    What's Changed

    • Add wait_on_rate_limit in TwitterAPI by @DougTrajano in https://github.com/DougTrajano/toxicity-crawler/pull/29

    Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.2.0...0.2.1

    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(Dec 25, 2021)

    What's Changed

    • Fixed an issue with tweet content in TwitterAPI by @DougTrajano
    • Added an exploratory notebook to test TwitterAPI by @DougTrajano
    • Bump pyyaml from 5.4.1 to 6.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/12
    • Bump google-api-python-client from 2.22.0 to 2.33.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/26
    • Bump metaflow from 2.3.6 to 2.4.7 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/28

    Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.1.4...0.2.0

    Source code(tar.gz)
    Source code(zip)
  • 0.1.4(Sep 26, 2021)

  • 0.1.3(Sep 24, 2021)

  • 0.1.2(Sep 24, 2021)

  • 0.1.1(Sep 24, 2021)

  • 0.1.0(Sep 24, 2021)

Owner
Douglas Trajano
Data Scientist
Douglas Trajano
Jobinja.ir jobs scraper.

Jobinja.ir Dataset Introduction This project is a simple web scraper that scraps pages of jobinja.ir concurrently and writes and update (if file gets

Iman Kermani 3 Apr 15, 2022
热搜榜-python爬虫+正则re+beautifulsoup+xpath

仓库简介 微博热搜榜, 参数wb 百度热搜榜, 参数bd 360热点榜, 参数360 csdn热榜接口, 下方查看 其他热搜待加入 如何使用? 注册vercel fork到你的仓库, 右上角 点击这里完成部署(一键部署) 请求参数 vercel配置好的地址+api?tit=+参数(仓库简介有参数信息

Harry 3 Jul 08, 2022
Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

Facebook Scraper Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key. (Currently working 2021) Setup Befo

Encore Shao 2 Dec 27, 2021
An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022
Nekopoi scraper using python3

Features Scrap from url Todo [+] Search by genre [+] Search by query [+] Scrap from homepage Example # Hentai Scraper from nekopoi import Hent

MhankBarBar 9 Apr 06, 2022
feapder 是一款简单、快速、轻量级的爬虫框架。以开发快速、抓取快速、使用简单、功能强大为宗旨。支持分布式爬虫、批次爬虫、多模板爬虫,以及完善的爬虫报警机制。

feapder 是一款简单、快速、轻量级的爬虫框架。起名源于 fast、easy、air、pro、spider的缩写,以开发快速、抓取快速、使用简单、功能强大为宗旨,历时4年倾心打造。支持轻量爬虫、分布式爬虫、批次爬虫、爬虫集成,以及完善的爬虫报警机制。 之

boris 1.4k Dec 29, 2022
Python framework to scrape Pastebin pastes and analyze them

pastepwn - Paste-Scraping Python Framework Pastebin is a very helpful tool to store or rather share ascii encoded data online. In the world of OSINT,

Rico 105 Dec 29, 2022
A scrapy pipeline that provides an easy way to store files and images using various folder structures.

scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

Panagiotis Simakis 7 Oct 23, 2022
A Web Scraping Program.

Web Scraping AUTHOR: Saurabh G. MTech Information Security, IIT Jammu. If you find this repository useful. I would appreciate if you Star it and Fork

Saurabh G. 2 Dec 14, 2022
Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

Twitter Turbo / Auto Claimer / Swapper Version: 1.0 Last Update: 01/26/2022 Use this at your own descretion. I've only used this on test accounts and

Underscores 6 May 02, 2022
Libextract: extract data from websites

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python

499 Dec 09, 2022
The first public repository that provides free BUBT website scraping API script on Github.

BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

Md Imam Hossain 3 Feb 10, 2022
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
Open Crawl Vietnamese Text

Open Crawl Vietnamese Text This repo contains crawled Vietnamese text from multiple sources. This list of a topic-centric public data sources in high

QAI Research 4 Jan 05, 2022
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党

taobao_seckill 淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党 依赖 安装chrome浏览器,根据浏览器的版本找到对应的chromedriver下载安装 web版使用说明 1、抢购前需要校准本地时间,然后把需要抢购的商品加入购物车 2、如果要打包成可执行文件,可使用pyinstalle

2k Jan 05, 2023
a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

George Reyes 7 Nov 27, 2022
Searching info from Google using Python Scrapy

Python-Search-Engine-Scrapy || Python-爬虫-索引/利用爬虫获取谷歌信息**/ Searching info from Google using Python Scrapy /* 利用 PYTHON 爬虫获取天气信息,以及城市信息和资料**/ translatio

HONGVVENG 1 Jan 06, 2022
Scrape all the media from an OnlyFans account - Updated regularly

Scrape all the media from an OnlyFans account - Updated regularly

CRIMINAL 3.2k Dec 29, 2022
Pelican plugin that adds site search capability

Search: A Plugin for Pelican This plugin generates an index for searching content on a Pelican-powered site. Why would you want this? Static sites are

22 Nov 21, 2022