The Toxicity Dataset

Saving the internet is fun. Combing through thousands of online comments to build a toxicity dataset isn't. That's why we're creating the world's largest dataset of social media toxicity — so you can skip the slog and get to work.

We hope you find this dataset useful, whether you want to flag hateful speech, develop content moderation tools, or build classifiers to detect toxic messages.

Need a larger dataset of toxicity to train your ML models, or toxicity in other languages (Spanish, French, German, Japanese, Portuguese, and 17+ more)? We work with top AI and Safety companies around the world. Reach out to [email protected]!

Dataset

This repo contains 500 toxic and 500 non-toxic comments from a variety of popular social media platforms. Click on toxicity_en.csv to see a spreadsheet of 1000 English examples. Rather than operating under a strict definition of toxicity, we asked our team to identify comments that they personally found toxic.

Columns

text: the text of the comment
is_toxic: whether or not the comment is toxic

Future

We'll be adding more languages and annotations (e.g., augmenting each comment with a severity ranking, adding categories, etc) over time.

If you're also interested in a dataset of profanity, check out our obscenity list.

The world's largest toxicity dataset.

Related tags

Overview

The Toxicity Dataset

Dataset

Columns

Future

Owner

Surge AI

How to Predict Stock Prices Easily Demo

Generic ecosystem for feature extraction from aerial and satellite imagery

Easy Parallel Library (EPL) is a general and efficient deep learning framework for distributed model training.

[NeurIPS 2021 Spotlight] Code for Learning to Compose Visual Relations

Applying PVT to Semantic Segmentation

Official Pytorch Implementation for Splicing ViT Features for Semantic Appearance Transfer presenting Splice

Implementation of ICCV2021(Oral) paper - VMNet: Voxel-Mesh Network for Geodesic-aware 3D Semantic Segmentation

A non-linear, non-parametric Machine Learning method capable of modeling complex datasets

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Hidden-Fold Networks (HFN): Random Recurrent Residuals Using Sparse Supermasks

for taichi voxel-challange event

Using some basic methods to show linkages and transformations of robotic arms

StyleGAN - Official TensorFlow Implementation

DWIPrep is a robust and easy-to-use pipeline for preprocessing of diverse dMRI data.

A package for music online and offline rhythmic information analysis including music Beat, downbeat, tempo and meter tracking.

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

Six - a Python 2 and 3 compatibility library

Pytorch implementation of "A simple neural network module for relational reasoning" (Relational Networks)

NeuroFind - A solution to the to the Task given by the Oberseminar of Messtechnik Institute of TU Dresden in 2021

Metric learning algorithms in Python