Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Last update: Jan 24, 2022

Overview

Toxicity comments crawler

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Twitter

Tweets and replies are scraped from Twitter API for a given list of users.

Twitch

Coming soon.

YouTube

Coming soon.

Facebook

Coming soon.

Instagram

Coming soon.

The toxic level of a given comment is calculated using the Perspective API.

Architecture

Usage

To run the crawler, you need to provide the following environment variables:

Variable	Description	Default	Required
`AWS_ROLE_ARN`	AWS Role ARN	`None`	Optional
`AWS_WEB_IDENTITY_TOKEN_FILE`	AWS Web Identity Token File	`None`	Optional
`AWS_ACCESS_KEY_ID`	AWS Access Key ID	`None`	Optional
`AWS_SECRET_ACCESS_KEY`	AWS Secret Access Key	`None`	Optional
`AWS_S3_BUCKET`	AWS S3 Bucket	`None`	Required
`AWS_S3_BUCKET_PREFIX`	AWS S3 Bucket Prefix	`None`	Required
`LOG_LEVEL`	Log level	`INFO`	Optional
`PERSPECTIVE_API_KEY`	Perspective API Key	`None`	Required
`PERSPECTIVE_THRESHOLD`	Perspective Threshold	`0.5`	Required
`FILTER_TOXIC_COMMENTS`	Filter Toxic Comments	`True`	Required
`TWITTER_CONSUMER_KEY`	Twitter Consumer Key	`None`	Required
`TWITTER_CONSUMER_SECRET`	Twitter Consumer Secret	`None`	Required
`TWITTER_ACCESS_TOKEN`	Twitter Access Token	`None`	Required
`TWITTER_ACCESS_TOKEN_SECRET`	Twitter Access Token Secret	`None`	Required
`TWITTER_MAX_TWEETS`	Twitter Max Tweets or replies	`None`	Required

If AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE are provided, the crawler will use them to assume a role, and will not use AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.

Running

Prerequisites

Docker

Then, you can run the crawler with the following command:

docker run --env-file .env -d dougtrajano/toxicity-crawler:latest

License

The project is licensed under the Apache 2.0 License.

This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 5, 2021

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

2.9k Jan 3, 2023

A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

13 Dec 21, 2022

This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

0 Nov 22, 2021

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

1 Nov 7, 2021

Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

2 Jun 6, 2022

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

1 Dec 30, 2021

Releases(0.2.1)

0.2.1(Dec 27, 2021)
What's Changed

Add wait_on_rate_limit in TwitterAPI by @DougTrajano in https://github.com/DougTrajano/toxicity-crawler/pull/29

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.2.0...0.2.1
Source code(tar.gz)
Source code(zip)
0.2.0(Dec 25, 2021)
What's Changed

Fixed an issue with tweet content in TwitterAPI by @DougTrajano

Added an exploratory notebook to test TwitterAPI by @DougTrajano

Bump pyyaml from 5.4.1 to 6.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/12

Bump google-api-python-client from 2.22.0 to 2.33.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/26

Bump metaflow from 2.3.6 to 2.4.7 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/28

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.1.4...0.2.0
Source code(tar.gz)
Source code(zip)
0.1.4(Sep 26, 2021)
Changes

Bump google-api-python-client from 2.21.0 to 2.22.0 #3

Fix Python path in Dockerfile

Source code(tar.gz)
Source code(zip)
0.1.3(Sep 24, 2021)
Changes

Updated GitHub Action.

Fix error in Docker execution.

Source code(tar.gz)
Source code(zip)
0.1.2(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.1(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 24, 2021)

Initial version
Source code(tar.gz)
Source code(zip)

Owner

Douglas Trajano

Data Scientist

GitHub Repository

Example of scraping a paginated API endpoint and dumping the data into a DB

Provider API Scraper Example Example of scraping a paginated API endpoint and dumping the data into a DB. Pre-requisits Python = 3.9 Pipenv Setup # i

1 Oct 20, 2021

A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

🕳️ CygnusX1 Code by Trong-Dat Ngo. Overviews 🕳️ CygnusX1 is a multithreaded tool 🛠️ , used to search and download images from popular search engine

32 Dec 31, 2022

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers.

13 Oct 15, 2022

对于有验证码的站点爆破，用于安全合法测试

使用方法 python3 main.py + 配置好的文件 python3 main.py Verify.json python3 main.py NoVerify.json 以上分别对应有验证码的demo和无验证码的demo Tips: 你可以以域名作为配置文件名字加载：python3 main

47 Nov 09, 2022

Basic-html-scraper - A complete how to of web scraping with Python for beginners

basic-html-scraper Code from YT Video This video includes a complete how to of w

12 Oct 22, 2022

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

1 Nov 28, 2021

Download images from forum threads

Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

9 Nov 16, 2022

抢京东茅台脚本，定时自动触发，自动预约，自动停止

jd_maotai 抢京东茅台脚本，定时自动触发，自动预约，自动停止小白信用 99.6，暂时还没抢到过，朋友 80 多抢到了一瓶，所以我感觉是跟信用分没啥关系，完全是看运气的。

117 Dec 22, 2022

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Pyrics Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes. ./test/run.py provides the full function in terminal cmd

1 Feb 12, 2022

Minimal set of tools to conduct stealthy scraping.

Stealthy Scraping Tools Do not use puppeteer and playwright for scraping. Explanation. We only use the CDP to obtain the page source and to get the ab

88 Jan 04, 2023

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

3 Oct 04, 2022

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

TikTok Scraper An utility library to scrape data from TikTok hassle-free Go to the website » View Demo · Report Bug · Request Feature About The Projec

6 Jan 08, 2023

A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

87 Jan 06, 2023

New World Market Scraper

Bean Seller A New Worlds market scraper. Deployment This must be installed on Windows as it uses the Windows api to do its stuff Install Prerequisites

4 Sep 21, 2022

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program i

347 Jan 07, 2023

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Related tags

Overview

Toxicity comments crawler

Architecture

Usage

Running

Prerequisites

License

You might also like...

This program scrapes information and images for movies and TV shows.

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

A web crawler script that crawls the target website and lists its links

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

This is a script that scrapes the longitude and latitude on food.grab.com

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

Scrapes all articles and their headlines from theonion.com

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Releases(0.2.1)

0.2.1(Dec 27, 2021)

What's Changed

0.2.0(Dec 25, 2021)

What's Changed

0.1.4(Sep 26, 2021)

Changes

0.1.3(Sep 24, 2021)

Changes

0.1.2(Sep 24, 2021)

0.1.1(Sep 24, 2021)

0.1.0(Sep 24, 2021)

Owner

Douglas Trajano

Example of scraping a paginated API endpoint and dumping the data into a DB

A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

对于有验证码的站点爆破，用于安全合法测试

Basic-html-scraper - A complete how to of web scraping with Python for beginners

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

Download images from forum threads

抢京东茅台脚本，定时自动触发，自动预约，自动停止

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Minimal set of tools to conduct stealthy scraping.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

A web scraper that exports your entire WhatsApp chat history.

New World Market Scraper

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

A Web Scraping Program.

An arxiv spider

Amazon web scraping using Scrapy Framework

京东茅台抢购最新优化版本，京东秒杀，添加误差时间调整，优化了茅台抢购进程队列

Screenhook is a script that captures an image of a web page and send it to a discord webhook.