当前位置：网站首页>10 first and 2 second, the NLP team of Dharma hall won the championship in semeval 2022

10 first and 2 second, the NLP team of Dharma hall won the championship in semeval 2022

2022-07-18 18:10:00 【Alibaba Technology】

author ： Yongjiang

Hands on the court NLP The team won in the international multilingual complex named entity recognition competition 10 First 、2 Second ,13 individual track Average F1 More than the second ranked team +2%, relevant NER Technology is at the international top conference ACL、EMNLP publish 10+ Papers , Pass respectively AliNLP Platform and Alibaba cloud NLP Spread to both inside and outside the Group , At the same time, focus on promoting multilingual search within the Group AE and ICBU.

One 、 background

SemEval（Semantic Evaluation） International Association for Computational Linguistics （Association for Computational Linguistics, ACL） Subordinates SIGLEX Sponsored in natural language processing (NLP) The field has the strongest influence in the world 、 Largest scale 、 The semantic evaluation competition with the largest number of participants . since 2001 From the year onwards ,SemEval So far, it has been successfully held for 15 times . Multilingual understanding since the first SemEval It attracted much attention from the beginning .

In this time we attended SemEval In the game , The goal of the competition is 11 Language construction NER System , Including English 、 Spanish 、 Dutch 、 Russian 、 Turkish language 、 Korean 、 Persian 、 German 、 chinese 、 Hindi and Bengali . The task has 13 Track (s) , Include 1 A multilingual track 、11 A monolingual track and 1 A mixed language track . Multilingual racing requires training in multilingualism that can handle all languages Entity recognition Model . Monolingual track requires training monolingual model is only applicable to one language , In the mixed language track, a sentence contains multiple languages at the same time . The data set of this competition mainly contains sentences from three fields ： Wikipedia 、 Web Q & A and user search . These sentences are often short and lack of context . Besides , These short sentences usually contain semantically ambiguous and complex entities , This makes the problem more difficult . We propose a method based on multilingual knowledge base retrieval NER System , The submitted system obtains 10 First ,2 Second ,13 individual track Average F1 More than the second ranked team +2%.

Two 、 Team Foundation

As Dharma hall NLP The basic algorithm group of the team , We undertake e-commerce internally 、 Journalism 、 entertainment 、 Address 、 The construction of information extraction capacity in power and other industries , Externally, we commercialize our existing capabilities .

In this competition, we put all of the past few years in multi language according to business scenarios NER Most of the technologies accumulated in aspect have been tried , It includes the following work ：

Give a conference	The title of the paper	topic
ACL 2020	Structure-Level Knowledge Distillation for Multilingual Sequence Labeling	Distillation / The first mock exam
EMNLP 2020	AIN: Fast and Accurate Sequence Labeling with Approximate Inference Network	Model acceleration
EMNLP 2020	More Embeddings, Better Sequence Labelers?	performance optimization
EMNLP 2020	An Investigation of Potential Function Designs for Neural CRF	performance optimization
ACL 2021	Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor	Distillation / The first mock exam
ACL 2021	Automated Concatenation of Embeddings for Structured Prediction	Ultimate performance
ACL 2021	Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning	Knowledge increases
ACL 2021	Multi-View Cross-Lingual Structured Prediction with Minimum Supervision	Cross language
ACL 2021	Risk Minimization for Zero-shot Sequence Labeling	Cross domain / Cross language
EMNLP 2021	Word Reordering for Zero-shot Cross-lingual Structured Prediction	Cross language
NAACL 2022	ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition	Multimodal

3、 ... and 、 The origin of the game ： The stadium is a training ground with excellent skills

In the text understanding of various industries , Entity extraction is the most basic / The most extensive NLP One of the landing applications . Whether within the Group alinlp Or the public cloud , The number of calls and users of entity extraction are among the best . We face the same data challenges in massive business scenarios ： Search terms 、 Commodity title 、 Express bill 、 Power dispatching text 、 News articles 、 voice ASR After the text and so on in the industry , These texts come from different sources ：

industry	Translation text	Short text	Highly divergent text	Poor quality
Online retailers Query	Y	Y	Y	N
E-commerce Title	Y	N	Y	Y
Address industry	N	Y	N	N
voice NER	N	Y	Y	Y
The news industry	N	Y	Y	N
SemEval	Y	Y	Y	Y

It can be seen that , The style of the competition data basically inherits the various problems we encounter in the business scenario , Therefore, it is a training ground with particularly good technology .

Four 、 The challenge of the game

The difficulty of this multilingual information extraction has the following two aspects ：

Data angle ：

The tagging cost of multilingual corpus is high . Tagging named entities in multilingual corpus , Need annotators with different language abilities , Especially some small languages , There are few taggers with tagging ability , Marking costs are high . However, the annotation quality of samples generated by annotation methods that rely on translation or remote supervision is poor , It is difficult to meet the needs of model training and evaluation .

1. Samples are sparse in low resource languages . The corpus of low resource languages is scarce , However, some cross language data augmentation methods are difficult to ensure semantic coherence 、 Annotate named entities in both source and target languages on the premise of correct syntax .

2. Data imbalance . The corpus of high resource language is generally much higher than that of low resource language , Caused data imbalance between different languages . The performance of models learned directly from unbalanced data differs significantly in different languages , It is difficult to apply to actual scenes .

Method angle ：

1. Understanding of multilingual common sense knowledge ： In the absence of context , Identify simple common entities in sentences , For most NER It is difficult for models . Therefore, how to use a lot of language external knowledge to enhance the common sense understanding ability of the model is a problem we need to solve .

2. Conflicts and connections between different languages ： On the one hand, task related knowledge in different languages can be mutually reinforcing , On the other hand, the noise of different languages （ Data annotation noise 、 There are semantic differences across languages ） It also affects each other . The design of unified multilingual model in multilingual scene needs to take into account knowledge and noise , Make full use of multilingual data , Achieve the maximum performance gain under multi language settings .

5、 ... and 、 How do we do it

Our final optimized scheme includes multiple processes , Here I mainly introduce what I think is the core （ It is also the biggest improvement ） Technology , " Named entity recognition system based on knowledge ", Here is a brief introduction to our technical scheme , complete report Can be in arxiv Check out .

In the past, in different business scenarios and the process of public dataset optimization in academia , We have learned one of the most important lessons ： Introducing additional knowledge can greatly improve the ability of entity understanding .

So I got the official data of the game ( Training set + Verification set ) after , We analyzed the data , There are several interesting findings

1. Most training sets are long

2. The distribution of validation sets is more diverse , Including many short translations query

When I didn't get the test data , We think the challenge of domain migration may be very big . At the beginning of the game , When we design models , The following factors are considered ：

1. Because there are 13 individual track, Our plan is different track The plan should be unified as far as possible , This is conducive to model iteration

2. When faced with different model choices , We take English as the data set of the debugging model

3. The test phase is only four days ( Later, it was postponed to six days ), At the same time, the number of test sets is relatively large , Our model reasoning speed cannot be a bottleneck

4. Facing the problem of domain migration , We hope to integrate external knowledge to make the model learn context Imitation Based on external knowledge , Instead of over fitting the training data

meanwhile , We analyze some examples and find , This competition also has a great demand for knowledge , Like the following example : köpings is rate. there "köpings is" It's a sports club , Therefore, it is a group entity type (GRP). Without additional knowledge input , In this grammatical rule (xx is xxx) Next , The model is easy to köpings Predict place names (LOC).

And we search through search engines , You can get rich context , These contexts can provide additional knowledge to help the model disambiguate .

Therefore, we adopted our last published in ACL 2021 Upper Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning Method , At that time, we found that it can significantly improve the performance of the model . But Google / Baidu's search data are dirty . Feed the model with such dirty data , Models can have better performance . This makes me think further , How to obtain more sufficient and clean knowledge from other sources . And easy to get , And there are more than a dozen languages , Multilingual Wikipedia is a good knowledge base . Then we analyze what additional knowledge Wikipedia can provide to help us with model training .

1. Rich texts of various industries

2. Massive phrase information (span/mention knowledge)

3. Jump information from phrase to entity name (wiki Link jump function in , namely mention->entity Information )

We propose a multilingual information extraction system based on general retrieval knowledge base . By retrieving the relevant knowledge of the input sentence in the knowledge base , It is easier to recognize and extract entities . First , be based on 11 Wikipedia in two languages , We build a multilingual knowledge base to search the relevant knowledge of input sentences , This can be seen as our construction wiki Text index of the document . In the process of Retrieval , We did two important things

1. Mark the phrase information in the search text

2. Mark phrases and their entities

for example Wikipedia provides a wealth of entity link information , Such as '''Apple -> Apple Inc''','''Steve Jobs -> Steve Jobs''' , So this sentence Steve Jobs founded Apple} It can be transformed into <e:Steve Jobs>Steve Jobs</e> founded <e:Apple\_inc>Apple</e>. By carrying <e> </e> Mark , We can easily introduce the information of the full name of the entity , We think this can provide more disambiguation information .

We make use of ElasticSearch Yes wiki Text construction index , In the retrieval process , We consider the following retrieval methods

1. Sentence retrieval ： Directly throw the text to be processed into ElasticSearch Search in

2. Interactive entity retrieval ： First, mark the text through an existing model , Then mark the marking result and the whole sentence with OR Search in the form of

For the retrieved text , We consider the following different ways of utilization

1. Only use the retrieved sentences

2. Use the retrieved paragraphs

3. We have also added one “ Phrases and their entities ” Deleted comparative experiment

The final scheme is shown in the figure below ：

In the application , We use the retrieved knowledge by introducing context , In terms of specific measures , We put together the input sentences and the retrieved knowledge , And input the connected string into the information extraction model . Concrete , For a sentence x, We get its corresponding context x', Combine into new input , And then through the optimized XLMR-large Pre training model .

Part of this approach came from our top conference on natural language processing last year ACL Published papers Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning .

The experimental results are as follows (lowner yes in-domain,MSQ and OSCAS yes out-of-domain)

We found that , The enhanced knowledge based on knowledge base retrieval can greatly improve the performance of information extraction system , It can bring absolute improvement under the same distribution of data 7% Of F1, Under cross domain conditions ( As it appears MSQ Web question and answer data set and OSCAS User search term data set ) Can bring 10%-20% Of F1 Performance improvement .

Final , We The submitted system obtains 10 First ,2 Second ,13 individual track Average F1 More than the second ranked team +2%. The teams involved are 47 A team , Including NetEase / Hkust xunfei / Ping An technology / Huawei /IBM/Cisco/ Samsung Electronics / Shenzhen apple tree , China University of science and technology / The Chinese Academy of Sciences / Humboldt University / University of Aalto / Indian Institute of technology, etc . The detailed results are in here . Next, we selected and compared the effects of several teams , You can find , Our plan is in F1 On average Surpass the second ranked system +2%, In English / Russian and other languages greatly surpass other submission systems .

Other tips to significantly improve performance , The following skills can be further improved on the benchmark of the above method , Some techniques are more general , It can be applied to all kinds of NLP In the task of ：

1. After getting the data , We use multilingual pre training language model XLMR-large, In the game data set masked language modeling Of continue pretraining, On all datasets Can bring 0.5%-1% F1 Performance improvement of .

2. Let's put all the data together first , Conduct finetune after , And then in each track Data set for secondary finetune, Can bring 2% F1 The promotion of .

3. Through our publication on EMNLP 2020 and ACL 2021 The technique of combining vectors It can further bring performance 0.8% About improvement .

4. Last , We have done many model training , Compare the results ensemble, Can improve model performance +0.5%-1%.

The performance improvement between the above strategies will not conflict with each other , Finally, our plan won 10 First 、2 Second ,13 individual track Average F1 More than the second ranked team +2%

6、 ... and 、 Application landing

NER yes NLP One of the most widely used technologies , We promote both inside and outside the Group , Include :

1. Multilingual search AE & ICBU We supported AE and ICBU Of Query And commodity entity extraction , And with AE and ICBU Business partners work together to promote search relevance .

2.AliNLP Platform and Alibaba cloud NLP Mentioned in the article NER Relevant technical precipitation , We go through AliNLP platform as well as Alibaba cloud NLP Promote to internal and external customers of the Group , Welcome to try and feedback .