当前位置:网站首页>10 first and 2 second, the NLP team of Dharma hall won the championship in semeval 2022
10 first and 2 second, the NLP team of Dharma hall won the championship in semeval 2022
2022-07-18 18:10:00 【Alibaba Technology】
author : Yongjiang

Hands on the court NLP The team won in the international multilingual complex named entity recognition competition 10 First 、2 Second ,13 individual track Average F1 More than the second ranked team +2%, relevant NER Technology is at the international top conference ACL、EMNLP publish 10+ Papers , Pass respectively AliNLP Platform and Alibaba cloud NLP Spread to both inside and outside the Group , At the same time, focus on promoting multilingual search within the Group AE and ICBU.
One 、 background
SemEval(Semantic Evaluation) International Association for Computational Linguistics (Association for Computational Linguistics, ACL) Subordinates SIGLEX Sponsored in natural language processing (NLP) The field has the strongest influence in the world 、 Largest scale 、 The semantic evaluation competition with the largest number of participants . since 2001 From the year onwards ,SemEval So far, it has been successfully held for 15 times . Multilingual understanding since the first SemEval It attracted much attention from the beginning .
In this time we attended SemEval In the game , The goal of the competition is 11 Language construction NER System , Including English 、 Spanish 、 Dutch 、 Russian 、 Turkish language 、 Korean 、 Persian 、 German 、 chinese 、 Hindi and Bengali . The task has 13 Track (s) , Include 1 A multilingual track 、11 A monolingual track and 1 A mixed language track . Multilingual racing requires training in multilingualism that can handle all languages Entity recognition Model . Monolingual track requires training monolingual model is only applicable to one language , In the mixed language track, a sentence contains multiple languages at the same time . The data set of this competition mainly contains sentences from three fields : Wikipedia 、 Web Q & A and user search . These sentences are often short and lack of context . Besides , These short sentences usually contain semantically ambiguous and complex entities , This makes the problem more difficult . We propose a method based on multilingual knowledge base retrieval NER System , The submitted system obtains 10 First ,2 Second ,13 individual track Average F1 More than the second ranked team +2%.
Two 、 Team Foundation
As Dharma hall NLP The basic algorithm group of the team , We undertake e-commerce internally 、 Journalism 、 entertainment 、 Address 、 The construction of information extraction capacity in power and other industries , Externally, we commercialize our existing capabilities .
In this competition, we put all of the past few years in multi language according to business scenarios NER Most of the technologies accumulated in aspect have been tried , It includes the following work :
Give a conference | The title of the paper | topic |
ACL 2020 | Structure-Level Knowledge Distillation for Multilingual Sequence Labeling | Distillation / The first mock exam |
EMNLP 2020 | AIN: Fast and Accurate Sequence Labeling with Approximate Inference Network | Model acceleration |
EMNLP 2020 | More Embeddings, Better Sequence Labelers? | performance optimization |
EMNLP 2020 | An Investigation of Potential Function Designs for Neural CRF | performance optimization |
ACL 2021 | Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor | Distillation / The first mock exam |
ACL 2021 | Automated Concatenation of Embeddings for Structured Prediction | Ultimate performance |
ACL 2021 | Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning | Knowledge increases |
ACL 2021 | Multi-View Cross-Lingual Structured Prediction with Minimum Supervision | Cross language |
ACL 2021 | Risk Minimization for Zero-shot Sequence Labeling | Cross domain / Cross language |
EMNLP 2021 | Word Reordering for Zero-shot Cross-lingual Structured Prediction | Cross language |
NAACL 2022 | ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition | Multimodal |
3、 ... and 、 The origin of the game : The stadium is a training ground with excellent skills
In the text understanding of various industries , Entity extraction is the most basic / The most extensive NLP One of the landing applications . Whether within the Group alinlp Or the public cloud , The number of calls and users of entity extraction are among the best . We face the same data challenges in massive business scenarios : Search terms 、 Commodity title 、 Express bill 、 Power dispatching text 、 News articles 、 voice ASR After the text and so on in the industry , These texts come from different sources :
industry | Translation text | Short text | Highly divergent text | Poor quality |
Online retailers Query | Y | Y | Y | N |
E-commerce Title | Y | N | Y | Y |
Address industry | N | Y | N | N |
voice NER | N | Y | Y | Y |
The news industry | N | Y | Y | N |
SemEval | Y | Y | Y | Y |
It can be seen that , The style of the competition data basically inherits the various problems we encounter in the business scenario , Therefore, it is a training ground with particularly good technology .
Four 、 The challenge of the game
The difficulty of this multilingual information extraction has the following two aspects :
Data angle :
The tagging cost of multilingual corpus is high . Tagging named entities in multilingual corpus , Need annotators with different language abilities , Especially some small languages , There are few taggers with tagging ability , Marking costs are high . However, the annotation quality of samples generated by annotation methods that rely on translation or remote supervision is poor , It is difficult to meet the needs of model training and evaluation .
1. Samples are sparse in low resource languages . The corpus of low resource languages is scarce , However, some cross language data augmentation methods are difficult to ensure semantic coherence 、 Annotate named entities in both source and target languages on the premise of correct syntax .
2. Data imbalance . The corpus of high resource language is generally much higher than that of low resource language , Caused data imbalance between different languages . The performance of models learned directly from unbalanced data differs significantly in different languages , It is difficult to apply to actual scenes .
Method angle :
1. Understanding of multilingual common sense knowledge : In the absence of context , Identify simple common entities in sentences , For most NER It is difficult for models . Therefore, how to use a lot of language external knowledge to enhance the common sense understanding ability of the model is a problem we need to solve .
2. Conflicts and connections between different languages : On the one hand, task related knowledge in different languages can be mutually reinforcing , On the other hand, the noise of different languages ( Data annotation noise 、 There are semantic differences across languages ) It also affects each other . The design of unified multilingual model in multilingual scene needs to take into account knowledge and noise , Make full use of multilingual data , Achieve the maximum performance gain under multi language settings .
5、 ... and 、 How do we do it
Our final optimized scheme includes multiple processes , Here I mainly introduce what I think is the core ( It is also the biggest improvement ) Technology , " Named entity recognition system based on knowledge ", Here is a brief introduction to our technical scheme , complete report Can be in arxiv Check out .
In the past, in different business scenarios and the process of public dataset optimization in academia , We have learned one of the most important lessons : Introducing additional knowledge can greatly improve the ability of entity understanding .
So I got the official data of the game ( Training set + Verification set ) after , We analyzed the data , There are several interesting findings
1. Most training sets are long
2. The distribution of validation sets is more diverse , Including many short translations query
When I didn't get the test data , We think the challenge of domain migration may be very big . At the beginning of the game , When we design models , The following factors are considered :
1. Because there are 13 individual track, Our plan is different track The plan should be unified as far as possible , This is conducive to model iteration
2. When faced with different model choices , We take English as the data set of the debugging model
3. The test phase is only four days ( Later, it was postponed to six days ), At the same time, the number of test sets is relatively large , Our model reasoning speed cannot be a bottleneck
4. Facing the problem of domain migration , We hope to integrate external knowledge to make the model learn context Imitation Based on external knowledge , Instead of over fitting the training data
meanwhile , We analyze some examples and find , This competition also has a great demand for knowledge , Like the following example : köpings is rate. there "köpings is" It's a sports club , Therefore, it is a group entity type (GRP). Without additional knowledge input , In this grammatical rule (xx is xxx) Next , The model is easy to köpings Predict place names (LOC).

And we search through search engines , You can get rich context , These contexts can provide additional knowledge to help the model disambiguate .
Therefore, we adopted our last published in ACL 2021 Upper Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning Method , At that time, we found that it can significantly improve the performance of the model . But Google / Baidu's search data are dirty . Feed the model with such dirty data , Models can have better performance . This makes me think further , How to obtain more sufficient and clean knowledge from other sources . And easy to get , And there are more than a dozen languages , Multilingual Wikipedia is a good knowledge base . Then we analyze what additional knowledge Wikipedia can provide to help us with model training .
1. Rich texts of various industries
2. Massive phrase information (span/mention knowledge)
3. Jump information from phrase to entity name (wiki Link jump function in , namely mention->entity Information )

We propose a multilingual information extraction system based on general retrieval knowledge base . By retrieving the relevant knowledge of the input sentence in the knowledge base , It is easier to recognize and extract entities . First , be based on 11 Wikipedia in two languages , We build a multilingual knowledge base to search the relevant knowledge of input sentences , This can be seen as our construction wiki Text index of the document . In the process of Retrieval , We did two important things
1. Mark the phrase information in the search text
2. Mark phrases and their entities
for example Wikipedia provides a wealth of entity link information , Such as '''Apple -> Apple Inc''','''Steve Jobs -> Steve Jobs''' , So this sentence Steve Jobs founded Apple} It can be transformed into <e:Steve Jobs>Steve Jobs</e> founded <e:Apple\_inc>Apple</e>. By carrying <e> </e> Mark , We can easily introduce the information of the full name of the entity , We think this can provide more disambiguation information .
We make use of ElasticSearch Yes wiki Text construction index , In the retrieval process , We consider the following retrieval methods
1. Sentence retrieval : Directly throw the text to be processed into ElasticSearch Search in
2. Interactive entity retrieval : First, mark the text through an existing model , Then mark the marking result and the whole sentence with OR Search in the form of
For the retrieved text , We consider the following different ways of utilization
1. Only use the retrieved sentences
2. Use the retrieved paragraphs
3. We have also added one “ Phrases and their entities ” Deleted comparative experiment
The final scheme is shown in the figure below :

In the application , We use the retrieved knowledge by introducing context , In terms of specific measures , We put together the input sentences and the retrieved knowledge , And input the connected string into the information extraction model . Concrete , For a sentence x, We get its corresponding context x', Combine into new input , And then through the optimized XLMR-large Pre training model .

Part of this approach came from our top conference on natural language processing last year ACL Published papers Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning .
The experimental results are as follows (lowner yes in-domain,MSQ and OSCAS yes out-of-domain)

We found that , The enhanced knowledge based on knowledge base retrieval can greatly improve the performance of information extraction system , It can bring absolute improvement under the same distribution of data 7% Of F1, Under cross domain conditions ( As it appears MSQ Web question and answer data set and OSCAS User search term data set ) Can bring 10%-20% Of F1 Performance improvement .
Final , We The submitted system obtains 10 First ,2 Second ,13 individual track Average F1 More than the second ranked team +2%. The teams involved are 47 A team , Including NetEase / Hkust xunfei / Ping An technology / Huawei /IBM/Cisco/ Samsung Electronics / Shenzhen apple tree , China University of science and technology / The Chinese Academy of Sciences / Humboldt University / University of Aalto / Indian Institute of technology, etc . The detailed results are in here . Next, we selected and compared the effects of several teams , You can find , Our plan is in F1 On average Surpass the second ranked system +2%, In English / Russian and other languages greatly surpass other submission systems .

Other tips to significantly improve performance , The following skills can be further improved on the benchmark of the above method , Some techniques are more general , It can be applied to all kinds of NLP In the task of :
1. After getting the data , We use multilingual pre training language model XLMR-large, In the game data set masked language modeling Of continue pretraining, On all datasets Can bring 0.5%-1% F1 Performance improvement of .
2. Let's put all the data together first , Conduct finetune after , And then in each track Data set for secondary finetune, Can bring 2% F1 The promotion of .
3. Through our publication on EMNLP 2020 and ACL 2021 The technique of combining vectors It can further bring performance 0.8% About improvement .
4. Last , We have done many model training , Compare the results ensemble, Can improve model performance +0.5%-1%.
The performance improvement between the above strategies will not conflict with each other , Finally, our plan won 10 First 、2 Second ,13 individual track Average F1 More than the second ranked team +2%
6、 ... and 、 Application landing
NER yes NLP One of the most widely used technologies , We promote both inside and outside the Group , Include :
1. Multilingual search AE & ICBU We supported AE and ICBU Of Query And commodity entity extraction , And with AE and ICBU Business partners work together to promote search relevance .
2.AliNLP Platform and Alibaba cloud NLP Mentioned in the article NER Relevant technical precipitation , We go through AliNLP platform as well as Alibaba cloud NLP Promote to internal and external customers of the Group , Welcome to try and feedback .
Links to this article :
1. Official website of the competition : SemEval 2022 Task 11: MultiCoNER
2. Game code : Open source , stay GitHub - Alibaba-NLP/KB-NER: Winner system (DAMO-NLP) of SemEval 2022 MultiCoNER shared task over 10 out of 13 tracks.
3. our report: https://arxiv.org/pdf/2203.00545.pdf
4. Finally, each track ranking : SemEval 2022 Task 11: MultiCoNER
边栏推荐
- (manual) [sqli labs46, 47] order by injection, error echo, post injection, number / character type
- (Fibonacci sequence) use the function to output the number of Fibonacc within the specified range (PTA)
- (pytorch进阶之路二)transformer学习与难点代码实现
- How to adjust and set win11 color temperature
- (pc+wap) industry general website of Zhimeng template company
- Web crawler realizes sending SMS verification code
- NoSQLAttack工具下载与使用
- It's decided. There are 93 open source tasks in 6 fields. Alibaba open source tutor will take you to participate in the open source summer 2022 of the Chinese Academy of Sciences
- 不用登陆账户,实现网页中内容复制小技巧
- 26 top open source projects, 87 open tasks, Alibaba programming summer 2022 student registration channel opened
猜你喜欢

SAP Fiori Launchpad 上看不到任何 tile 应该怎么办?

2022年P气瓶充装考试试题及模拟考试

暑期学习Matlab笔记

QT creator debug mode breakpoint does not work mincw can

准确率100%,阿里商旅账单系统架构设计实践

使用plt.savefig()方法保存绘图时出现图片全白或全黑的问题

Pay equal attention to quality and efficiency, test left shift power block storage technology research and development

20220715 国内Conda不fq安装Pytorch最新版本的方法

可爱的小猫咪

Resolved SQL_ ERROR_ INFO: “You have an error in your SQL syntax; check the manual that corresponds to your
随机推荐
UOS安装MariaDB
Export data with navicatpremium
ROS 创建工作空间流程
Boost create folder
Web crawler crawls the inspirational bullet screen of station B and generates word cloud (careful note summary)
什么是丢包,为什么会丢包
SAP ABAP CDS view 视图的 Replacement 技术介绍
If an element is above another element, clicking on the above element will trigger the following element event operation
老板想要的简单方案 vs. 程序员理解的需求 | 漫画
监督学习week 3: Logistic Regression optional_lab收获记录
100% accuracy, Alibaba business travel billing system architecture design practice
使用plt.savefig()方法保存绘图时出现图片全白或全黑的问题
15. 三数之和【List<List<Integer>> ans、 ans.add(Arrays.asList(nums[i], nums[j], nums[k]))】
网络爬虫实现发送短信验证码
The web crawler crawls the titles and contents of all chapters of the romance of the Three Kingdoms (beautifulsoup analysis)
Long time no see, I'm back
Quick completion guide of manipulator (zero five): resources related to manipulator
达梦数据库表SQL语句
10个第一、2个第二,达摩院NLP团队在SemEval 2022的夺冠之旅
【Leetcode】225. 用队列实现栈