DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

Overview

DANeS - Open-source E-newspaper dataset

12613 Source: Technology vector created by macrovector - www.freepik.com.

DANeS is an open-source E-newspaper dataset by collaboration between DATASET .JSC (dataset.vn) and AIV Group (aivgroup.vn) that contains over 600.000 online paper's articles. The articles are gathered from a number of Vietnamese Publishing Houses such as: tuoitre.vn, baobinhduong.vn, baoquangbinh.vn, kinhtechungkhoan.vn, doanhnghiep.vn, vnexpress.net, ...

We hope to support the community by providing a multi-purpose set of raw data for different subjects (students, developers, companies, …). So if you create something with this dataset, please share with us through our e-mail: [email protected]

Table of Contents

  1. Folder Tree
  2. Data format
  3. Labeling process
  4. Reviewing process
  5. Updating process
  6. License of annotated dataset
  7. About-us

Folder Tree

DANeS
  |
  |____README.md
  |
  |____raw_data
  |	   |____ DANeS_batch_#1.json
  |	   |____ DANeS_batch_#2.json
  |	   |____ DANeS_batch_#3.json
  |	   |____ DANeS_batch_#4.json
  |	   |____ DANeS_batch_#5.json
  |	   |____ DANeS_batch_#6.json
  |	   |____ DANeS_batch_#7.json
  |	   |____ DANeS_batch_#8.json
  |	   |____ README.md
  |
  |____annotated_data
  |	   |____ #contains annotated data
  |
  |____model
	   |____ Train_opensource.py
	   |____ README.md
	   |____ LICENSE

Data format

The raw dataset is stored in raw_data folder with .json format and has been divided into 8 batches. Each batch has an array that contains many json and each json is a record of the dataset. Here’s the example of each record's format:

Key Type Description
text string title of the digital news
meta json metadata of the digital news
uri string link to the digital news
description string description of the digital news

Example for a record of dataset:

{
        "text": "Ba ra đi vào ngày nhận điểm thi, nữ sinh được hỗ trợ học phí",
        "meta": {
            		"description": "Ngày nhận được tin đỗ đại học cũng là lúc bố mất vì Covid-19, L.A dường như gục ngã. Thế nhưng, bên cạnh em đã có các mạnh thường quân hỏi han, hỗ trợ về kinh tế.",
            		"uri": "https://yan.vn/ba-ra-di-vao-ngay-nhan-diem-thi-nu-sinh-duoc-ho-tro-hoc-phi-277328.html"
        	}
}

Labeling process

  • Log in:

DANeS 1 (1)

  • Annotating:

    • The article should be classified under one out of three sentiment: Negative, Positive and Neutral.
    • The article will then be classified by 22 topics: World, Politics, Economics, Sports, Cultures, Entertainment,Technology, Science, Education, Daily life, Regulations, Real estate, Social, Traffic, Environment, Stock market, Covid-19, Breaking news, Game, Movies, Health, Travel, Unidentified. Each article can carry numerous relevant and suitable topics.

DANeS 2

Reviewing process

The admin or the owner of the project will select qualified reviewers based on their attitude and performance. Reviewing process contains two main phases: cross validation and project reviewing.

  • The person who is assigned to cross validating will be given 20% of the annotated records from other annotators. This person will also be in charge of re-correcting the mislabeled records.
  • After the cross validation phase, the person who is assigned to review the project will randomly pick 20 - 50% of the total annotated records. Records that are not meet the given quality can either be:
    • Re-corrected by the project reviewer.
    • Re-assigned and re-corrected by the formal annotator.

Updating process

  • The raw data is expected to be fully uploaded at one time.

  • The annotated records are expected to be updated once a month to official repository of DANeS (https://github.com/dataset-vn/DANeS)

License of annotated dataset

Giấy phép Creative Commons
The annotated dataset of DANeS is licensed under Creative Commons Attribution 4.0 International License.

This license lets others distribute, remix, tweak, and build upon your work, even commercially, as long as they credit you for the original creation. This is the most accommodating of licenses offered. Recommended for maximum dissemination and use of licensed materials.

About us

DATASET .JSC - (+84) 98 442 0826 - [email protected]

Dataset’s mission is to support individuals and organizations with data collecting and data processing services by providing tools that simplify and enhance the efficiency of the processes. With the large and professional workers system, Dataset aspires to provide partners with a comprehensive and quality solution, suitable with the characteristics of the technology market.

Website: Dataset.vn

LinkedIn: Dataset.vn - Data Crowdsourcing Platform

Facebook: Dataset.vn - Data Crowdsourcing Platform

AIV Group - (+84) 931 458 189 - [email protected]

AIV Group aims to apply advanced technologies, especially Artificial Intelligence (AI), Cloud Computing, Big Data, … to digitize, modernize the long-established processes of information production and consumption in Viet Nam society. At the same time, we are working on solutions that solve new problems arising in the field of communication that relate to technology’s problems such as: fake news, images, videos are automatically cut and merged ..

Website: AIV Group

Facebook: AIV Group

Owner
DATASET .JSC
DATASET .JSC - A Data Crowdsourcing Platform
DATASET .JSC
The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

Zhiyu Chen 114 Dec 29, 2022
A method for cleaning and classifying text using transformers.

NLP Translation and Classification The repository contains a method for classifying and cleaning text using NLP transformers. Overview The input data

Ray Chamidullin 0 Nov 15, 2022
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
Word2Wave: a framework for generating short audio samples from a text prompt using WaveGAN and COALA.

Word2Wave is a simple method for text-controlled GAN audio generation. You can either follow the setup instructions below and use the source code and CLI provided in this repo or you can have a play

Ilaria Manco 91 Dec 23, 2022
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents [Project Page] [Paper] [Video] Wenlong Huang1, Pieter Abbee

Wenlong Huang 114 Dec 29, 2022
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Dec 30, 2022
An open source framework for seq2seq models in PyTorch.

pytorch-seq2seq Documentation This is a framework for sequence-to-sequence (seq2seq) models implemented in PyTorch. The framework has modularized and

International Business Machines 1.4k Jan 02, 2023
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 07, 2022
A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

GuwenModels: 古文自然语言处理模型合集, 收录互联网上的古文相关模型及资源. A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Ethan 66 Dec 26, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
Blazing fast language detection using fastText model

Luga A blazing fast language detection using fastText's language models Luga is a Swahili word for language. fastText provides a blazing fast language

Prayson Wilfred Daniel 18 Dec 20, 2022
Build Text Rerankers with Deep Language Models

Reranker is a lightweight, effective and efficient package for training and deploying deep languge model reranker in information retrieval (IR), question answering (QA) and many other natural languag

Luyu Gao 140 Dec 06, 2022
This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Joseph Imperial 1 Oct 05, 2021
Finetune gpt-2 in google colab

gpt-2-colab finetune gpt-2 in google colab sample result (117M) from retraining on A Tale of Two Cities by Charles Di

212 Jan 02, 2023
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

YITUTech 237 Dec 10, 2022
Fastseq 基于ONNXRUNTIME的文本生成加速框架

Fastseq 基于ONNXRUNTIME的文本生成加速框架

Jun Gao 9 Nov 09, 2021
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
Shirt Bot is a discord bot which uses GPT-3 to generate text

SHIRT BOT · Shirt Bot is a discord bot which uses GPT-3 to generate text. Made by Cyclcrclicly#3420 (474183744685604865) on Discord. Support Server EX

31 Oct 31, 2022