This is Assignment1 code for the Web Data Processing System.

Overview

First Assignment - Entity Linking

Web Data Processing System Assignment 1 - 2021 - Group 26

  • Zhining Bai
  • Bowen Lyu
  • Tianshi Chen
  • Yiming Xu

Description

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata). The pipeline for this program as below:

image

Read WARC

  • Use pyspark to read large-scale warc files, so the program supports parallel computing.
  • Extract text information from HTML files by using beautifulsoup.

Named entity recognition

  • Extract entities by using recognize_entities_bert model from sparknlp.

Disambiguation and NIL

We considered the popularity of the candidate page as well as the semantic similarity between the sentence where the entity is located and the candidate description to achieve Disambiguation.

  • Popularity: Calculate popularity rankings using the Elasticsearch scoring algorithm and the number of properties of the mention from the knowledge graph.
  • Sentence similarity: Measure the difference between text and description using the Levenshtein distance.

NIL: Retain results with distances < 40.

image

Prerequisites

Codes are run on the DAS cluster at /var/scratch/wdps2106/wdps_2126, result1 is a conda virtual environment that has been created. Below are the packages installed to run the assignment.

# if you want to use pip(pip for python3) to install the packages, use the following command(python version 3.8)
pip install pyspark==3.1.2
pip install spark-nlp==3.3.3
pip install beautifulsoup4
pip install python-Levenshtein
pip install elasticsearch

# if you want to use conda to install the packages, use the following command(recommended)
conda create -n 
   
     python=3.8
conda install pyspark
conda install bs4
conda install elasticsearch
pip install python-Levenshtein
pip install sparknlp

   

Run

To run the program, you can simply use the command below. The parameter Keyname is the name of page ID in WARC files such as WARC_TREC_ID. You need to declare the name of the page ID using this parameter. Be aware that the result file will be renamed as result.tsv.

sh run.sh /path/to/warc/file.warc.gz /path/to/result/ Keyname

If you use DAS cluster, you also need to add this command before running:

export OPENBLAS_NUM_THREADS=10

To check the score of the result file, use the command below.

python3 score.py /sample/annotation/file/sample.tsv /generated/result/file/result.tsv

Result

We tested our entity linking code using sample.warc.gz. Since sample_annotations.tsv only contains the entities that page_id is less than 92, our test results only output entity links with page_id <= 92. The f1 score of the sample data is 0.1122.

Metric Value
Gold 500
Predicted 480
Correct 55
Precision 0.1145
Recall 0.11
F1 Score 0.1122
You might also like...
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Basic Utilities for PyTorch Natural Language Processing (NLP)

Basic Utilities for PyTorch Natural Language Processing (NLP) PyTorch-NLP, or torchnlp for short, is a library of basic utilities for PyTorch NLP. tor

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing Trankit is a light-weight Transformer-based Pyth

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing
PORORO: Platform Of neuRal mOdels for natuRal language prOcessing

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing pororo performs Natural Language Processing and Speech-related tasks. It is easy to

💫 Industrial-strength Natural Language Processing (NLP) in Python

spaCy: Industrial-strength NLP spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest researc

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.
🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

State-of-the-art Natural Language Processing for PyTorch and TensorFlow 2.0 🤗 Transformers provides thousands of pretrained models to perform tasks o

A very simple framework for state-of-the-art Natural Language Processing (NLP)

A very simple framework for state-of-the-art NLP. Developed by Humboldt University of Berlin and friends. IMPORTANT: (30.08.2020) We moved our models

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

State of the Art Natural Language Processing

Spark NLP: State of the Art Natural Language Processing Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provide

Releases(wdps)
Text classification on IMDB dataset using Keras and Bi-LSTM network

Text classification on IMDB dataset using Keras and Bi-LSTM Text classification on IMDB dataset using Keras and Bi-LSTM network. Usage python3 main.py

Hamza Rashid 2 Sep 27, 2022
Summarization, translation, sentiment-analysis, text-generation and more at blazing speed using a T5 version implemented in ONNX.

Summarization, translation, Q&A, text generation and more at blazing speed using a T5 version implemented in ONNX. This package is still in alpha stag

Abel 211 Dec 28, 2022
Trex is a tool to match semantically similar functions based on transfer learning.

Trex is a tool to match semantically similar functions based on transfer learning.

62 Dec 28, 2022
Espial is an engine for automated organization and discovery of personal knowledge

Live Demo (currently not running, on it) Espial is an engine for automated organization and discovery in knowledge bases. It can be adapted to run wit

Uzay-G 159 Dec 30, 2022
Kurumi ChatBot

KurumiChatBot Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @TokisakiChatB

Yoga Pranata 3 Jun 28, 2022
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

Texar-PyTorch is a toolkit aiming to support a broad set of machine learning, especially natural language processing and text generation tasks. Texar

ASYML 726 Dec 30, 2022
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 01, 2023
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
Sinkhorn Transformer - Practical implementation of Sparse Sinkhorn Attention

Sinkhorn Transformer This is a reproduction of the work outlined in Sparse Sinkhorn Attention, with additional enhancements. It includes a parameteriz

Phil Wang 217 Nov 25, 2022
REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

Dani El-Ayyass 47 Sep 05, 2022
A deep learning-based translation library built on Huggingface transformers

DL Translate A deep learning-based translation library built on Huggingface transformers and Facebook's mBART-Large 💻 GitHub Repository 📚 Documentat

Xing Han Lu 244 Dec 30, 2022
Code for Discovering Topics in Long-tailed Corpora with Causal Intervention.

Code for Discovering Topics in Long-tailed Corpora with Causal Intervention ACL2021 Findings Usage 0. Prepare environment Requirements: python==3.6 te

Xiaobao Wu 8 Dec 16, 2022
GPT-3 command line interaction

Writer_unblock Straight-forward command line interfacing with GPT-3. Finding yourself stuck at a conceptual stage? Spinning your wheels needlessly on

Seth Nuzum 6 Feb 10, 2022
A highly sophisticated sequence-to-sequence model for code generation

CoderX A proof-of-concept AI system by Graham Neubig (June 30, 2021). About CoderX CoderX is a retrieval-based code generation AI system reminiscent o

Graham Neubig 39 Aug 03, 2021
Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

smart-school-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
Code for CVPR 2021 paper: Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning

Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning This is the PyTorch companion code for the paper: A

Amazon 69 Jan 03, 2023
ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

Tam Zher Min 2 Jun 24, 2022
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022
Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra. What is Lightning Tran

Pytorch Lightning 581 Dec 21, 2022