Open Crawl Vietnamese Text

Last update: Jan 05, 2022

Related tags

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

学习强国自动化百分百正确、瞬间答题，分值45分

Telegram group scraper tool

Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

基于Github Action的定时HITsz疫情上报脚本，开箱即用

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

哔哩哔哩爬取器：以个人为中心

Web scrapping

Lovely Scrapper

A python script to extract answers to any question on Quora (Quora+ included)

12306抢票脚本

Scrapy-soccer-games - Scraping information about soccer games from a few websites

Demonstration on how to use async python to control multiple playwright browsers for web-scraping

An IpVanish Proxies Scraper

Web3 Pancakeswap Sniper bot written in python3

Scrap-mtg-top-8 - A top 8 mtg scraper using python

Collection of code files to scrap different kinds of websites.

对于有验证码的站点爆破，用于安全合法测试

A universal package of scraper scripts for humans

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

👁️ Tool for Data Extraction and Web Requests.

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

学习强国 自动化 百分百正确、瞬间答题，分值45分

Telegram group scraper tool

Current Antarctic large iceberg positions derived from ASCAT and OSCAT-2

基于Github Action的定时HITsz疫情上报脚本，开箱即用

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

哔哩哔哩爬取器：以个人为中心

Web scrapping

Lovely Scrapper

A python script to extract answers to any question on Quora (Quora+ included)

12306抢票脚本

Scrapy-soccer-games - Scraping information about soccer games from a few websites

Demonstration on how to use async python to control multiple playwright browsers for web-scraping

An IpVanish Proxies Scraper

Web3 Pancakeswap Sniper bot written in python3

Scrap-mtg-top-8 - A top 8 mtg scraper using python

Collection of code files to scrap different kinds of websites.

对于有验证码的站点爆破，用于安全合法测试

A universal package of scraper scripts for humans

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

👁️ Tool for Data Extraction and Web Requests.

学习强国自动化百分百正确、瞬间答题，分值45分