Chinese named entity recognization with BiLSTM using Keras

Overview

Chinese named entity recognization (Bilstm with Keras)

Project Structure

./
├── README.md
├── data
│   ├── README.md
│   ├── data							数据集
│   │   ├── test.txt
│   │   └── train.txt
│   ├── plain_text.txt
│   └── vocab.txt                       词表
├── evaluate
│   ├── __init__.py
│   └── f1_score.py                     计算实体F1得分
├── keras_contrib                       keras_contrib包,也可以pip装
├── log                                 训练nohup日志
│   ├── __init__.py
│   └── nohup.out
├── model                               模型
│   ├── BiLSTMCRF.py
│   ├── __init__.py
│   └── __pycache__
├── predict                             输出预测
│   ├── __init__.py
│   ├── __pycache__
│   ├── predict.py
│   └── predict_process.py
├── preprocess                          数据预处理
│   ├── README.md
│   ├── __pycache__
│   ├── convert_jsonl.py
│   ├── data_add_line.py
│   ├── generate_vocab.py               生成词表
│   ├── process_data.py                 数据处理转换
│   ├── splite.py
│   └── vocab.py                        词表对应工具
├── public
│   ├── __init__.py
│   ├── __pycache__
│   ├── config.py                       训练设置
│   ├── generate_label_id.py            生成label2id文件
│   ├── label2id.json                   标签dict
│   ├── path.py                         所有路径
│   └── utils.py                        小工具
├── report
│   └── report.out                      F1评估报告
├── train.py
└── weight                              保存的权重
    └── bilstm_ner.h5

52 directories, 214 files

Dataset

三甲医院肺结节数据集,20000+字,BIO格式,形如:

中	B-ORG
共	I-ORG
中	I-ORG
央	I-ORG
致	O
中	B-ORG
国	I-ORG
致	I-ORG
公	I-ORG
党	I-ORG
十	I-ORG
一	I-ORG
大	I-ORG
的	O
贺	O
词	O

ATTENTION: 在处理自己数据集的时候需要注意:

  • 字与标签之间用tab("\t")隔开
  • 其中句子与句子之间使用空行隔开

Steps

  1. 替换数据集
  2. 修改public/path.py中的地址
  3. 使用public/generate_label_id.py生成label2id.txt文件,将其中的内容填到preprocess/vocab.py的get_tag2index中。注意:序号必须从0开始
  4. 修改public/config.py中的MAX_LEN(超过截断,少于填充,最好设置训练集、测试集中最长句子作为MAX_LEN)
  5. 运行preprocess/generate_vocab.py生成词表,词表按词频生成
  6. 根据需要修改BiLSTMCRF.py模型结构
  7. 修改public/config.py的参数
  8. 训练前debug看下train_data,train_label对不对
  9. 训练

Model

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, None)              0
_________________________________________________________________
embedding_1 (Embedding)      (None, None, 128)         81408
_________________________________________________________________
bidirectional_1 (Bidirection (None, None, 256)         263168
_________________________________________________________________
dropout_1 (Dropout)          (None, None, 256)         0
_________________________________________________________________
bidirectional_2 (Bidirection (None, None, 128)         164352
_________________________________________________________________
dropout_2 (Dropout)          (None, None, 128)         0
_________________________________________________________________
time_distributed_1 (TimeDist (None, None, 29)          3741
_________________________________________________________________
dropout_3 (Dropout)          (None, None, 29)          0
_________________________________________________________________
crf_1 (CRF)                  (None, None, 29)          1769
=================================================================
Total params: 514,438
Trainable params: 514,438
Non-trainable params: 0
_________________________________________________________________

Train

运行train.py

Epoch 1/500
806/806 [==============================] - 15s 18ms/step - loss: 2.4178 - crf_viterbi_accuracy: 0.9106

Epoch 00001: loss improved from inf to 2.41777, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 2/500
806/806 [==============================] - 10s 13ms/step - loss: 0.6370 - crf_viterbi_accuracy: 0.9106

Epoch 00002: loss improved from 2.41777 to 0.63703, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 3/500
806/806 [==============================] - 11s 14ms/step - loss: 0.5295 - crf_viterbi_accuracy: 0.9106

Epoch 00003: loss improved from 0.63703 to 0.52950, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 4/500
806/806 [==============================] - 11s 13ms/step - loss: 0.4184 - crf_viterbi_accuracy: 0.9064

Epoch 00004: loss improved from 0.52950 to 0.41838, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 5/500
806/806 [==============================] - 12s 14ms/step - loss: 0.3422 - crf_viterbi_accuracy: 0.9104

Epoch 00005: loss improved from 0.41838 to 0.34217, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 6/500
806/806 [==============================] - 10s 13ms/step - loss: 0.3164 - crf_viterbi_accuracy: 0.9106

Epoch 00006: loss improved from 0.34217 to 0.31637, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 7/500
806/806 [==============================] - 10s 12ms/step - loss: 0.3003 - crf_viterbi_accuracy: 0.9111

Epoch 00007: loss improved from 0.31637 to 0.30032, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 8/500
806/806 [==============================] - 10s 12ms/step - loss: 0.2906 - crf_viterbi_accuracy: 0.9117

Epoch 00008: loss improved from 0.30032 to 0.29058, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 9/500
806/806 [==============================] - 9s 12ms/step - loss: 0.2837 - crf_viterbi_accuracy: 0.9118

Epoch 00009: loss improved from 0.29058 to 0.28366, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 10/500
806/806 [==============================] - 9s 11ms/step - loss: 0.2770 - crf_viterbi_accuracy: 0.9142

Epoch 00010: loss improved from 0.28366 to 0.27696, saving model to /home/bureaux/Projects/BiLSTMCRF_TimeDistribute/weight/bilstm_ner.h5
Epoch 11/500
806/806 [==============================] - 10s 12ms/step - loss: 0.2713 - crf_viterbi_accuracy: 0.9160

Evaluate

运行evaluate/f1_score.py

100%|█████████████████████████████████████████| 118/118 [00:38<00:00,  3.06it/s]
TP: 441
TP+FP: 621
precision: 0.7101449275362319
TP+FN: 604
recall: 0.7301324503311258
f1: 0.72

classification report:
              precision    recall  f1-score   support

     ANATOMY       0.74      0.75      0.74       220
    BOUNDARY       1.00      0.75      0.86         8
     DENSITY       0.78      0.88      0.82         8
    DIAMETER       0.82      0.88      0.85        16
     DISEASE       0.54      0.72      0.62        43
   LUNGFIELD       0.83      0.83      0.83         6
      MARGIN       0.57      0.67      0.62         6
      NATURE       0.00      0.00      0.00         6
       ORGAN       0.62      0.62      0.62        13
    QUANTITY       0.88      0.87      0.87        83
       SHAPE       1.00      0.43      0.60         7
        SIGN       0.66      0.65      0.65       189
     TEXTURE       0.75      0.43      0.55         7
   TREATMENT       0.25      0.33      0.29         9

   micro avg       0.71      0.71      0.71       621
   macro avg       0.67      0.63      0.64       621
weighted avg       0.71      0.71      0.71       621

Predict

运行predict/predict_bio.py

Human Activity Recognition example using TensorFlow on smartphone sensors dataset and an LSTM RNN. Classifying the type of movement amongst six activity categories - Guillaume Chevalier

LSTMs for Human Activity Recognition Human Activity Recognition (HAR) using smartphones dataset and an LSTM RNN. Classifying the type of movement amon

Guillaume Chevalier 3.1k Dec 30, 2022
Efficient and Scalable Physics-Informed Deep Learning and Scientific Machine Learning on top of Tensorflow for multi-worker distributed computing

Notice: Support for Python 3.6 will be dropped in v.0.2.1, please plan accordingly! Efficient and Scalable Physics-Informed Deep Learning Collocation-

tensordiffeq 74 Dec 09, 2022
A Library for Modelling Probabilistic Hierarchical Graphical Models in PyTorch

A Library for Modelling Probabilistic Hierarchical Graphical Models in PyTorch

Korbinian Pöppel 47 Nov 28, 2022
OpenPCDet Toolbox for LiDAR-based 3D Object Detection.

OpenPCDet OpenPCDet is a clear, simple, self-contained open source project for LiDAR-based 3D object detection. It is also the official code release o

OpenMMLab 3.2k Dec 31, 2022
Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

[Paper] [Project page] This repository contains code for the paper: Andrew Owens, Alexei A. Efros. Audio-Visual Scene Analysis with Self-Supervised Mu

Andrew Owens 202 Dec 13, 2022
This is an open source library implementing hyperbox-based machine learning algorithms

hyperbox-brain is a Python open source toolbox implementing hyperbox-based machine learning algorithms built on top of scikit-learn and is distributed

Complex Adaptive Systems (CAS) Lab - University of Technology Sydney 21 Dec 14, 2022
使用yolov5训练自己数据集(详细过程)并通过flask部署

使用yolov5训练自己的数据集(详细过程)并通过flask部署 依赖库 torch torchvision numpy opencv-python lxml tqdm flask pillow tensorboard matplotlib pycocotools Windows,请使用 pycoc

HB.com 19 Dec 28, 2022
Personals scripts using ageitgey/face_recognition

HOW TO USE pip3 install requirements.txt Add some pictures of known people in the folder 'people' : a) Create a folder called by the name of the perso

Antoine Bollengier 1 Jan 06, 2022
A study project using the AA-RMVSNet to reconstruct buildings from multiple images

3d-building-reconstruction This is part of a study project using the AA-RMVSNet to reconstruct buildings from multiple images. Introduction It is exci

17 Oct 17, 2022
Real-time face detection and emotion/gender classification using fer2013/imdb datasets with a keras CNN model and openCV.

Real-time face detection and emotion/gender classification using fer2013/imdb datasets with a keras CNN model and openCV.

Octavio Arriaga 5.3k Dec 30, 2022
A collection of 100 Deep Learning images and visualizations

A collection of Deep Learning images and visualizations. The project has been developed by the AI Summer team and currently contains almost 100 images.

AI Summer 65 Sep 12, 2022
My implementation of Image Inpainting - A deep learning Inpainting model

Image Inpainting What is Image Inpainting Image inpainting is a restorative process that allows for the fixing or removal of unwanted parts within ima

Joshua V Evans 1 Dec 12, 2021
The Deep Learning with Julia book, using Flux.jl.

Deep Learning with Julia DL with Julia is a book about how to do various deep learning tasks using the Julia programming language and specifically the

Logan Kilpatrick 67 Dec 25, 2022
Özlem Taşkın 0 Feb 23, 2022
Code for Paper: Self-supervised Learning of Motion Capture

Self-supervised Learning of Motion Capture This is code for the paper: Hsiao-Yu Fish Tung, Hsiao-Wei Tung, Ersin Yumer, Katerina Fragkiadaki, Self-sup

Hsiao-Yu Fish Tung 87 Jul 25, 2022
Top #1 Submission code for the first https://alphamev.ai MEV competition with best AUC (0.9893) and MSE (0.0982).

alphamev-winning-submission Top #1 Submission code for the first alphamev MEV competition with best AUC (0.9893) and MSE (0.0982). The code won't run

70 Oct 29, 2022
Deep Learning segmentation suite designed for 2D microscopy image segmentation

Deep Learning segmentation suite dessigned for 2D microscopy image segmentation This repository provides researchers with a code to try different enco

7 Nov 03, 2022
Discord bot for notifying on github events

Git-Observer Discord bot for notifying on github events ⚠️ This bot is meant to write messages to only one channel (implementing this for multiple pro

ilu_vatar_ 0 Apr 19, 2022
HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

HiFiGAN Denoiser This is a Unofficial Pytorch implementation of the paper HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep F

Rishikesh (ऋषिकेश) 134 Dec 27, 2022