Masader

The first online catalogue for Arabic NLP datasets. This catalogue contains 200 datasets with more than 25 metadata annotations for each dataset. You can view the list of all datasets using the link of the webiste https://arbml.github.io/masader/

Title Masader: Metadata Sourcing for Arabic Text and Speech Data Resources
Authors Zaid Alyafeai, Maraim Masoud, Mustafa Ghaleb, Maged S. Al-shaibani
https://arxiv.org/abs/2110.06744

Abstract: The NLP pipeline has evolved dramatically in the last few years. The first step in the pipeline is to find suitable annotated datasets to evaluate the tasks we are trying to solve. Unfortunately, most of the published datasets lack metadata annotations that describe their attributes. Not to mention, the absence of a public catalogue that indexes all the publicly available datasets related to specific regions or languages. When we consider low-resource dialectical languages, for example, this issue becomes more prominent. In this paper we create \textit{Masader}, the largest public catalogue for Arabic NLP datasets, which consists of 200 datasets annotated with 25 attributes. Furthermore, We develop a metadata annotation strategy that could be extended to other languages. We also make remarks and highlight some issues about the current status of Arabic NLP datasets and suggest recommendations to address them.*

Metadata

No. dataset number
Name name of the dataset
Subsets subsets of the datasets
Link direct link to the dataset or instructions on how to download it
License license of the dataset
Year year of the publishing the dataset/paper
Language ar or multilingual
Dialect region ar-LEV: (Arabic(Levant)), country ar-EGY: (Arabic (Egypt)) or type ar-MSA: (Arabic (Modern Standard Arabic))
Domain social media, news articles, reviews, commentary, books, transcribed audio or other
Form text, audio or sign language
Collection style crawling, crawling and annotation (translation), crawling and annotation (other), machine translation, human translation, human curation or other
Description short statement describing the dataset
Volume the size of the dataset in numbers
Unit unit of the volume, could be tokens, sentences, documents, MB, GB, TB, hours or other
Provider company or university providing the dataset
Related Datasets any datasets that is related in terms of content to the dataset
Paper Title title of the paper
Paper Link direct link to the paper pdf
Script writing system either Arab, Latn, Arab-Latn or other
Tokenized whether the dataset is segmented using morphology: Yes or No
Host the host website for the data i.e GitHub
Access is the data free, upon-request or with-fee.
Cost cost of the data is with-fee.
Test split does the data contain test split: Yes or No
Tasks the tasks included in the dataset spearated by comma
Evaluation Set is the data included in the evaluation suit by BigScience
Venue Title the venue title i.e ACL
Citations the number of citations
Venue Type conference, workshop, journal or preprint
Venue Name full name of the venue i.e Associations of computation linguistics
authors list of the paper authors separated by comma
affiliations list of the paper authors' affiliations separated by comma
abstract abstract of the paper
Added by name of the person who added the entry
Notes any extra notes on the dataset

Contribution

If you want to add a new dataset feel free to update the sheet. Please follow the instructions there for adding the entry.

Citation

@misc{alyafeai2021masader,
      title={Masader: Metadata Sourcing for Arabic Text and Speech Data Resources}, 
      author={Zaid Alyafeai and Maraim Masoud and Mustafa Ghaleb and Maged S. Al-shaibani},
      year={2021},
      eprint={2110.06744},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

The first online catalogue for Arabic NLP datasets.

Related tags

Overview

Masader

Metadata

Contribution

Citation

Owner

ARBML

A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

An open source library for deep learning end-to-end dialog systems and chatbots.

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

Open solution to the Toxic Comment Classification Challenge

pyupbit 라이브러리를 활용하여 upbit에서 비트코인을 자동매매하는 코드입니다. 조코딩 유튜브 채널에서 자세한 강의 영상을 보실 수 있습니다.

Turn clang-tidy warnings and fixes to comments in your pull request

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

ConferencingSpeech2022; Non-intrusive Objective Speech Quality Assessment (NISQA) Challenge

Twitter-NLP-Analysis - Twitter Natural Language Processing Analysis

💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Python library for parsing resumes using natural language processing and machine learning

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Conditional probing: measuring usable information beyond a baseline

I can help you convert your images to pdf file.

:P Some basic stuff I'm gonna use for my upcoming Agile Software Development and Devops

Spam filtering made easy for you

Client library to download and publish models and other files on the huggingface.co hub

Making text a first-class citizen in TensorFlow.