This is Assignment1 code for the Web Data Processing System.

Last update: Dec 04, 2022

Related tags

Text Data & NLP wdps2126

Overview

First Assignment - Entity Linking

Web Data Processing System Assignment 1 - 2021 - Group 26

Zhining Bai
Bowen Lyu
Tianshi Chen
Yiming Xu

Description

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata). The pipeline for this program as below:

Read WARC

Use pyspark to read large-scale warc files, so the program supports parallel computing.
Extract text information from HTML files by using beautifulsoup.

Named entity recognition

Extract entities by using recognize_entities_bert model from sparknlp.

Disambiguation and NIL

We considered the popularity of the candidate page as well as the semantic similarity between the sentence where the entity is located and the candidate description to achieve Disambiguation.

Popularity: Calculate popularity rankings using the Elasticsearch scoring algorithm and the number of properties of the mention from the knowledge graph.
Sentence similarity: Measure the difference between text and description using the Levenshtein distance.

NIL: Retain results with distances < 40.

Prerequisites

Codes are run on the DAS cluster at /var/scratch/wdps2106/wdps_2126, result1 is a conda virtual environment that has been created. Below are the packages installed to run the assignment.

# if you want to use pip(pip for python3) to install the packages, use the following command(python version 3.8)
pip install pyspark==3.1.2
pip install spark-nlp==3.3.3
pip install beautifulsoup4
pip install python-Levenshtein
pip install elasticsearch

# if you want to use conda to install the packages, use the following command(recommended)
conda create -n 
   
     python=3.8
conda install pyspark
conda install bs4
conda install elasticsearch
pip install python-Levenshtein
pip install sparknlp

Run

To run the program, you can simply use the command below. The parameter Keyname is the name of page ID in WARC files such as WARC_TREC_ID. You need to declare the name of the page ID using this parameter. Be aware that the result file will be renamed as result.tsv.

sh run.sh /path/to/warc/file.warc.gz /path/to/result/ Keyname

If you use DAS cluster, you also need to add this command before running:

export OPENBLAS_NUM_THREADS=10

To check the score of the result file, use the command below.

python3 score.py /sample/annotation/file/sample.tsv /generated/result/file/result.tsv

Result

We tested our entity linking code using sample.warc.gz. Since sample_annotations.tsv only contains the entities that page_id is less than 92, our test results only output entity links with page_id <= 92. The f1 score of the sample data is 0.1122.

Metric	Value
Gold	500
Predicted	480
Correct	55
Precision	0.1145
Recall	0.11
F1 Score	0.1122

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

21 Aug 12, 2022

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

2.1k Jan 1, 2023

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

652 Jan 6, 2023

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

1.2k Dec 21, 2022

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

19.5k Feb 13, 2021

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers provides thousands of pretrained models to perform tasks o

77.3k Jan 3, 2023

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

12.3k Dec 31, 2022

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

8.4k Dec 26, 2022

State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

3k Jan 5, 2023

Releases(wdps)

wdps(Jun 1, 2022)

This is a releas test.
Source code(tar.gz)
Source code(zip)

This is Assignment1 code for the Web Data Processing System.

Related tags

Overview

First Assignment - Entity Linking

Description

Read WARC

Named entity recognition

Disambiguation and NIL

Prerequisites

Run

Result

You might also like...

🗣️ NALP is a library that covers Natural Adversarial Language Processing.

Basic Utilities for PyTorch Natural Language Processing (NLP)

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

💫 Industrial-strength Natural Language Processing (NLP) in Python

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

A very simple framework for state-of-the-art Natural Language Processing (NLP)

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

State of the Art Natural Language Processing

Releases(wdps)

wdps(Jun 1, 2022)

Owner

Rank-One Model Editing for Locating and Editing Factual Knowledge in GPT

BiQE: Code and dataset for the BiQE paper

Pretrained Japanese BERT models

Adversarial Examples for Extreme Multilabel Text Classification

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Türkçe küfürlü içerikleri bulan bir yapay zeka kütüphanesi / An ML library for profanity detection in Turkish sentences

👄 The most accurate natural language detection library for Python, suitable for long and short text alike

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

2021语言与智能技术竞赛：机器阅读理解任务

Contract Understanding Atticus Dataset

Tools, wrappers, etc... for data science with a concentration on text processing

Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

Python library for parsing resumes using natural language processing and machine learning

spaCy plugin for Transformers , Udify, ELmo, etc.

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Pre-training with Extracted Gap-sentences for Abstractive SUmmarization Sequence-to-sequence models

BERT Attention Analysis

DiY Oxygen Concentrator based on the OxiKit

Weaviate demo with the text2vec-openai module