TPlinker for NER 中文/英文命名实体识别

Overview

TPLinker-NER

喜欢本项目的话,欢迎点击右上角的star,感谢每一个点赞的你。

项目介绍

本项目是参考 TPLinker 中HandshakingTagging思想,将TPLinker由原来的关系抽取(RE)模型修改为命名实体识别(NER)模型。

【注意】 事实上,本项目使用的base模型是TPLinker_plus,这是因为若严格地按照TPLinker的设计思想,在NER任务上几乎无法使用。具体原因,在Q&A部分有介绍。

TPLinker-NER相比于之前的序列标注、半指针-半标注等NER模型,更加有效的解决了实体嵌套问题。因为TPLinker本身在RE领域已经取得了优异的成绩,而TPLinker-NER作为从中提取的子功能,理论上效果也会太差。 由于本人拥有的算力有限,无法在大规模语料库上进行试验,此次只在 CLUENER 数据集上做了实验。

CLUENER验证集F1

Best F1 on dev: 0.9111

Usage

实验环境

本次实验进行时Python版本为3.6,其他主要的第三方库包括:

  • pytorch==1.8.1
  • wandb==0.10.26 #for logging the result
  • glove-python-binary==0.1.0
  • transformers==4.1.1
  • tqdm==4.54.1

NOTE:

  1. wandb 是一款优秀的机器学习可视化库。本项目默认未启用wandb,如果想使用wandb管理日志,请在tplinker_plus_ner/config.py文件中修改相关配置即可。
  2. 如果你使用的Windows系统且尚未安装Glove库,或者只想用BERT作编码器,主文件请使用train_only_bert.py

数据准备

格式要求

TPLinker-NER约定数据集的的格式如下:

  • 训练集train_data.json与验证集valid_data.json
[
    {
        "id":"",
        "text":"原始语句",
        "entity_list":[{"text":"实体","type":"实体类型","char_span":"实体char级别的span","token_span":"实体token级别的span"}]
    },
    ...
]
  • 测试集test_data.json
[
    {
        "id":"",
        "text":"原始语句"
    },
    ...
]

数据转换

如果需要将其他格式的数据集转换到TPLinker-NER,请参考raw_data/convert_dataset.py的转换逻辑。

数据存放

准备好的数据需放在data4bert/{exp_name}data4bilstm/{exp_name}中,其中exp_nametplinker_plus_ner/config.py中配置的实验名。

预训练模型与词向量

请下载Bert的中文预训练模型bert-base-chinese存放至pretrained_models/,并在tplinker_plus_ner/config.py中配置正确的bert_path

如果你想使用BiLSTM,需要准备预训练word embeddings存放至pretrained_emb/,如何预训练请参考preprocess/Pretrain_Word_Embedding.ipynb

Train

请阅读tplinker_plus_ner/config.py中的内容,并根据自己的需求修改配置与超参数。

然后开始训练

cd tplinker_plus_ner
python train.py

Evaluation

你仍然需要在tplinker_plus_ner/config.py中配置Evaluation相关参数。尤其注意eval_config中的model_state_dict_dir参数值与你所用的日志模块一致。

然后开始Evaluate

cd tplinker_plus_ner
python evaluate.py

Q&A

以下问题为个人在改写项目的想法,仅供参考,如有错误,欢迎指正。

  1. 为什么TPLinker不适合直接用在NER上,而要用TPLinker_plus?

    个人理解:讨论这个问题就要先了解最初的TPLinker设计模式,除了HandShaking外,作者还预定义了三大种类型ent, head_rel, tail_rel,每个类型下又有子类型,ent:{"O":0,"ENT-H2T":1}, head_rel:{"O":0, "REL-SH2OH":1, "REL-OH2SH":2}, head_tail:{"O":0, "REL-ST2OT":1, "REL-OT2ST":2}。在模型实际做分类时,三大类之间是独立的。以head_rel为例,其原数据整理得y_true矩阵shape为(batch_size, rel_size, shaking_seq_len),这里rel_size即有多少种关系。模型预测的结果y_pred矩阵shape为(batch_size, rel_size, shaking_seq_len, 3)。可以想象,这样的y_true矩阵已经很稀疏了,只有0,1,2三种标签。而如果换做NER,这样(batch_size, ent_size, shaking_seq_len)的矩阵将更加稀疏(只有0,1两种标签),对于一个(ent_size,shaking_seq_len)的矩阵来说,可能只有1至2个地方为1,这将导致模型无限地将预测结果都置为0,从而学习失败(事实实验也是这样)。作者在TPLinker中是如何解决这一问题的呢?其实作者用了个小trick回避了这一问题,具体做法是不再区分实体的类型,将所有实体都看作是DEFAULT类型,这样就把y_true压缩成了(batch_size,shaking_seq_len),降低了矩阵的稀疏性。作者对于这一做法的解释是"Because it is not necessary to recognize the type of entities for the relation extraction task since a predefined relation usually has fixed types for its subject and object.",即实体类别信息对关系抽取不太重要,因为每种关系某种程度上已经预定义了实体类型。综上,如果想直接把TPLinker应用到NER上是不合适的。

    而TPLinker_plus改变了这一做法,他不再将ent, head_rel, tail_rel当做三个独立任务,而是将所有的关系与标签组合,形成一个大的标签库,只用一个HandShaking矩阵表示句子中的所有关系。举个例子,假设有以下3个关系(或实体类型):主演、出生于、作者,那么其与标记标签EH-ET,SH-OH,OH-SH,ST-OT,OT-ST组合后会产生15种tag,这极大地扩充了标签库。相应的,TPLinker_plus的输入也就变成了(batch_size,shaking_seq_len,tag_size)。这样的改变让矩阵中的非0值相对增多,降低了矩阵的稀疏性。(这只是一方面原因,更加重要原因的请参考问题2)

  2. TPLinker_plus还做了哪些优化?

    • 任务模式的转变:从问题1最后的结论可以看出,TPLinker_plus扩充标签库的同时,也将模型任务由原来的多分类任务转变成了多标签分类任务,即每个句子形成的shaking_seq可以出现多个的标签,且出现的数量不确定。形如
    # 设句子的seq_len=10,那么shaking_seq=55
    # 标签组合有8种tag_size=8
    [
        [0,0,1,0,1,0,1,0],
        [1,0,1,0,0,0,0,1],
        ...
        # 剩下的53行
    ]
  3. TPLinker-NER中几个关键词怎么理解?

    对于一个text中含有n个token的情况

    • shaking_matrixn*n的矩阵,若shaking_maxtrix[i][j]=1表示从第i个token到第j个token为一个实体。(实际用到的只有上三角矩阵,以为实体的起始位置一定在结束位置前。)
    • matrix_index:上三角矩阵的坐标,(0,0),(0,1),(0,2)...(0,n-1),(1,1),(1,2)...(1,n-1)...(n-1,n-1)
    • shaking_index:上三角矩阵的索引,长度为$\frac{n(n+1)}{2}$,即[0,1,2,...,n(n+1)/2 - 1]
    • shaking_ind2matrix_ind:将索引映射到矩阵坐标,即[(0,0),(0,1),...,(n-1,n-1)]
    • matrix_ind2shaking_ind:将坐标映射到索引,即
      [[0, 1, 2,    ...,        n-1],
      [0, n, n+1, n+2,  ...,  2n-2]
      ...
      [0, 0, 0, ...,  n(n+1)/2 - 1]]
      
    • spot:一个实体对应的起止span和类型id,例如实体“北京”在矩阵中起始位置在7,终止位置在9,类型为LOC"(id:3),那么其对应spot为(7, 9, 3)。

致谢

Owner
GodK
GodK
2021语言与智能技术竞赛:机器阅读理解任务

LICS2021 MRC 1. 项目&任务介绍 本项目基于官方给定的baseline(DuReader-Checklist-BASELINE)进行二次改造,对整个代码框架做了简单的重构,对核心网络结构添加了注释,解耦了数据读取的模块,并添加了阈值确认的功能,一些小的细节也做了改进。 本次任务为202

roar 29 Dec 05, 2022
Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Graph4NLP Graph4NLP is an easy-to-use library for R&D at the intersection of Deep Learning on Graphs and Natural Language Processing (i.e., DLG4NLP).

Graph4AI 1.5k Dec 23, 2022
Every Google, Azure & IBM text to speech voice for free

TTS-Grabber Quick thing i made about a year ago to download any text with any tts voice, over 630 voices to choose from currently. It will split the i

16 Dec 07, 2022
Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation (SIGGRAPH Asia 2021)

Live Speech Portraits: Real-Time Photorealistic Talking-Head Animation This repository contains the implementation of the following paper: Live Speech

OldSix 575 Dec 31, 2022
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

4 Oct 15, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
A machine learning model for analyzing text for user sentiment and determine whether its a positive, neutral, or negative review.

Sentiment Analysis on Yelp's Dataset Author: Roberto Sanchez, Talent Path: D1 Group Docker Deployment: Deployment of this application can be found her

Roberto Sanchez 0 Aug 04, 2021
A Practitioner's Guide to Natural Language Processing

Learn how to process, classify, cluster, summarize, understand syntax, semantics and sentiment of text data with the power of Python! This repository contains code and datasets used in my book, Text

Dipanjan (DJ) Sarkar 1.5k Jan 03, 2023
Download videos from YouTube/Twitch/Twitter right in the Windows Explorer, without installing any shady shareware apps

youtube-dl and ffmpeg Windows Explorer Integration Download videos from YouTube/Twitch/Twitter and more (any platform that is supported by youtube-dl)

Wolfgang 226 Dec 30, 2022
Natural Language Processing at EDHEC, 2022

Natural Language Processing Here you will find the teaching materials for the "Natural Language Processing" course at EDHEC Business School, 2022 What

1 Feb 04, 2022
VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

VampiresVsWerewolves Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition. Our Algorithm finish

Shawn 1 Jan 21, 2022
Sapiens is a human antibody language model based on BERT.

Sapiens: Human antibody language model ____ _ / ___| __ _ _ __ (_) ___ _ __ ___ \___ \ / _` | '_ \| |/ _ \ '

Merck Sharp & Dohme Corp. a subsidiary of Merck & Co., Inc. 13 Nov 20, 2022
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 04, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
Codes for coreference-aware machine reading comprehension

Data and code for the paper "Tracing Origins: Coreference-aware Machine Reading Comprehension" at ACL2022. Dataset There are three folders for our thr

11 Sep 29, 2022
A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. Flair is: A powerful NLP library. Flair allo

flair 12.3k Jan 02, 2023
Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Fast (GAN Based Neural) Vocoder Chinese README Todo Submit demo Support NHV Discription Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe include N

Zhengxi Liu (刘正曦) 134 Dec 16, 2022
Py65 65816 - Add support for the 65C816 to py65

Add support for the 65C816 to py65 Py65 (https://github.com/mnaberez/py65) is a

4 Jan 04, 2023
OCR을 이용하여 인원수를 인식 후 줌을 Kill 해줍니다

How To Use killtheZoom-2.0 Windows 0. https://joyhong.tistory.com/79 이 글을 보면서 tesseract를 C:\Program Files\Tesseract-OCR 경로로 설치해주세요(한국어 언어 추가 필요) 상단의 초

김정인 9 Sep 13, 2021
Unsupervised text tokenizer focused on computational efficiency

YouTokenToMe YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE)

VK.com 847 Dec 19, 2022