A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Overview



GitHub issues GitHub stars GitHub license

古文自然语言处理模型合集,收录互联网上的古文相关模型及资源。

更多内容请参考:

古文预训练语言模型

古文预训练语言模型是处理各种古文任务的基础模型,需要结合各种下游任务数据微调,才能发挥最大作用。这里收集了所有互联网上公开的古文预训练语言模型:

名称 简/繁 下载链接 备注
guwenbert-base Hugging Face 基于殆知阁语料和中文模型训练
guwenbert-large Hugging Face
guwenbert-fs-base One Drive 基于殆知阁语料从头训练
roberta-classical-chinese-base-char 简繁 Hugging Face 基于guwenbert训练,扩展了繁体词表
roberta-classical-chinese-large-char 简繁 Hugging Face
sikubert Hugging Face 基于四库全书语料和中文模型训练
sikuroberta Hugging Face

古文应用模型

古文应用模型是基于古文预训练模型,结合特定领域数据微调得到的模型,能够实现古文的各种实际应用。其中guwen-X模型使用的训练数据可以在CCLUE中下载,如果输入包含繁体字请先使用本页最下方提到的工具进行转换。

古文断句

guwen-seg: 基于guwenbert-fs-base的断句模型。

古文标点

guwen-punc: 基于guwenbert-fs-base的标点模型。

古文引号检测

guwen-quote: 基于guwenbert-fs-base的引号检测模型。

注意:如下图所示,使用Transformers自带的序列标注模型存在一定误差,请在实际场景中使用CRF模型解码。相关代码参考 crf_example.ipynb

古文命名实体识别

guwen-ner: 基于guwenbert-base的命名实体识别模型。

注意:为取得最好表现,推荐在实际场景中使用CRF模型解码。相关代码参考 crf_example.ipynb

古文分类

guwen-cls: 基于guwenbert-fs-base的古文分类模型。

古诗情感分类

guwen-sent: 基于guwenbert-base的古文分类模型。

其他古文相关资源

  • OpenCC: 简繁转换工具
  • zhconv: 简繁转换工具 (注意需使用zh-hans选项,只转换单字,避免转换地区词)
  • 甲言Jiayan: 古汉语处理的NLP工具包,古文分词,词性标注,断句,标点等工具
  • UD-Kanbun: 古文分词,词性标注,句法解析
  • daizhigev20: 殆知阁古代文献v2.0语料库
  • chinese-poetry: 最全中文诗歌古典文集数据库
  • Classical-Chinese: 古文现代文翻译平行语料库

关于

本仓库旨在收集互联网上的开源古文NLP模型,版权归原作者所有,欢迎补充更多资源,如有问题可以在Issue区讨论,或邮件联系ethanyt at qq.com

Owner
Ethan
Natural Language Processing, Deep Learning, Information Retrieval, Full-stack Development.
Ethan
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 08, 2023
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ 💻 Technologies Concurrent code

Bobotinho 14 Nov 29, 2022
SciBERT is a BERT model trained on scientific text.

SciBERT is a BERT model trained on scientific text.

AI2 1.2k Dec 24, 2022
IEEEXtreme15.0 Questions And Answers

IEEEXtreme15.0 Questions And Answers IEEEXtreme is a global challenge in which teams of IEEE Student members – advised and proctored by an IEEE member

Dilan Perera 15 Oct 24, 2022
The official implementation of "BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?, ACL 2021 main conference"

BERT is to NLP what AlexNet is to CV This is the official implementation of BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Iden

Asahi Ushio 20 Nov 03, 2022
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 03, 2021
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 02, 2023
Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

537 Jan 05, 2023
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

Maarten van Gompel 46 Dec 14, 2022
Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode Introduction This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode

Patrick von Platen 9 Jul 21, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 07, 2022
Prompt tuning toolkit for GPT-2 and GPT-Neo

mkultra mkultra is a prompt tuning toolkit for GPT-2 and GPT-Neo. Prompt tuning injects a string of 20-100 special tokens into the context in order to

61 Jan 01, 2023
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
结巴中文分词

jieba “结巴”中文分词:做最好的 Python 中文分词组件 "Jieba" (Chinese for "to stutter") Chinese text segmentation: built to be the best Python Chinese word segmentation

Sun Junyi 29.8k Jan 02, 2023
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022