Open Crawl Vietnamese Text

Last update: Jan 05, 2022

Related tags

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Web Scraping Framework

CreamySoup - a helper script for automated SourceMod plugin updates management.

A scalable frontier for web crawlers

Poolbooru gelscraper - a simple python script for scraping images off gelbooru pools.

腾讯课堂，模拟登陆，获取课程信息，视频下载，视频解密。

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Creating Scrapy scrapers via the Django admin interface

Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

Twitter Scraper

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Find papers by keywords and venues. Then download it automatically

This is a module that I had created along with my friend. It's a basic web scraping module

A Scrapper with python

A Python module to bypass Cloudflare's anti-bot page.

基于Github Action的定时HITsz疫情上报脚本，开箱即用

This program scrapes information and images for movies and TV shows.

Scrapes Every Email Address of Every Society in Every University