ReaLiSe
ReaLiSe is a multi-modal Chinese spell checking model.
This the office code for the paper Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking.
The paper has been accepted in ACL Findings 2021.
Environment
- Python: 3.6
- Cuda: 10.0
- Packages: pip install -r requirements.txt
Data
Raw Data
SIGHAN Bake-off 2013: http://ir.itc.ntnu.edu.tw/lre/sighan7csc.html
 SIGHAN Bake-off 2014: http://ir.itc.ntnu.edu.tw/lre/clp14csc.html
 SIGHAN Bake-off 2015: http://ir.itc.ntnu.edu.tw/lre/sighan8csc.html
 Wang271K: https://github.com/wdimmy/Automatic-Corpus-Generation
Data Processing
The code and cleaned data are in the data_process directory.
You can also directly download the processed data from this and put them in the data directory. The data directory would look like this:
data
|- trainall.times2.pkl
|- test.sighan15.pkl
|- test.sighan15.lbl.tsv
|- test.sighan14.pkl
|- test.sighan14.lbl.tsv
|- test.sighan13.pkl
|- test.sighan13.lbl.tsv
Pretrain
-  BERT: chinese-roberta-wwm-ext Huggingface hfl/chinese-roberta-wwm-ext: https://huggingface.co/hfl/chinese-roberta-wwm-ext
 Local:/data/dobby_ceph_ir/neutrali/pretrained_models/roberta-base-ch-for-csc/
-  Phonetic Encoder: pretrain_pho.sh
-  Graphic Encoder: pretrain_res.sh
-  Merge: merge.py
You can also directly download the pretrained and merged BERT, Phonetic Encoder, and Graphic Encoder from this, and put them in the pretrained directory:
pretrained
|- pytorch_model.bin
|- vocab.txt
|- config.json
Train
After preparing the data and pretrained model, you can train ReaLiSe by executing the train.sh script. Note that you should set up the PRETRAINED_DIR, DATE_DIR, and OUTPUT_DIR in it.
sh train.sh
Test
Test ReaLiSe using the test.sh script. You should set up the DATE_DIR, CKPT_DIR, and OUTPUT_DIR in it. CKPT_DIR is the OUTPUT_DIR of the training process.
sh test.sh
Well-trained Model
You can also download well-trained model from this direct using. The performance scores of RealiSe and some baseline models on the SIGHAN13, SIGHAN14, SIGHAN15 test set are here:
Methods
- FASpell: FASPell: A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-Decoder Paradigm
- Soft-Masked BERT: Spelling Error Correction with Soft-Masked BERT
- SpellGCN: SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check
- BERT: Our implementation
Metrics
- "D" means "Detection Level", "C" means "Correction Level".
- "A", "P", "R", "F" means "Accuracy", "Precision", "Recall", and "F1" respectively.
SIGHAN15
| Method | D-A | D-P | D-R | D-F | C-A | C-P | C-R | C-F | 
|---|---|---|---|---|---|---|---|---|
| FASpell | 74.2 | 67.6 | 60.0 | 63.5 | 73.7 | 66.6 | 59.1 | 62.6 | 
| Soft-Masked BERT | 80.9 | 73.7 | 73.2 | 73.5 | 77.4 | 66.7 | 66.2 | 66.4 | 
| SpellGCN | - | 74.8 | 80.7 | 77.7 | - | 72.1 | 77.7 | 75.9 | 
| BERT | 82.4 | 74.2 | 78.0 | 76.1 | 81.0 | 71.6 | 75.3 | 73.4 | 
| ReaLiSe | 84.7 | 77.3 | 81.3 | 79.3 | 84.0 | 75.9 | 79.9 | 77.8 | 
SIGHAN14
| Method | D-A | D-P | D-R | D-F | C-A | C-P | C-R | C-F | 
|---|---|---|---|---|---|---|---|---|
| Pointer Network | - | 63.2 | 82.5 | 71.6 | - | 79.3 | 68.9 | 73.7 | 
| SpellGCN | - | 65.1 | 69.5 | 67.2 | - | 63.1 | 67.2 | 65.3 | 
| BERT | 75.7 | 64.5 | 68.6 | 66.5 | 74.6 | 62.4 | 66.3 | 64.3 | 
| ReaLiSe | 78.4 | 67.8 | 71.5 | 69.6 | 77.7 | 66.3 | 70.0 | 68.1 | 
SIGHAN13
| Method | D-A | D-P | D-R | D-F | C-A | C-P | C-R | C-F | 
|---|---|---|---|---|---|---|---|---|
| FASpell | 63.1 | 76.2 | 63.2 | 69.1 | 60.5 | 73.1 | 60.5 | 66.2 | 
| SpellGCN | 78.8 | 85.7 | 78.8 | 82.1 | 77.8 | 84.6 | 77.8 | 81.0 | 
| BERT | 77.0 | 85.0 | 77.0 | 80.8 | 77.4 | 83.0 | 75.2 | 78.9 | 
| ReaLiSe | 82.7 | 88.6 | 82.5 | 85.4 | 81.4 | 87.2 | 81.2 | 84.1 | 
Citation
@misc{xu2021read,
      title={Read, Listen, and See: Leveraging Multimodal Information Helps Chinese Spell Checking}, 
      author={Heng-Da Xu and Zhongli Li and Qingyu Zhou and Chao Li and Zizhen Wang and Yunbo Cao and Heyan Huang and Xian-Ling Mao},
      year={2021},
      eprint={2105.12306},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
