Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Overview

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

In this project, our aim is to tune, compare, and contrast the performance of the Hidden Markov Model (HMM) POS tagger and the Brill POS tagger. To perform this task, we will train these two taggers using data from a specific domain and test their accuracy in predicting tag sequences from data belonging to the same domain and data from a different domain.

How to Execute?

To run this project,

  1. Download the repository as a zip file.

  2. Extract the zip to get the project folder.

  3. Open Terminal in the directory you extracted the project folder to.

  4. Change directory to the project folder using:

    cd part-of-speech-taggers-main

  5. Install the required libraries, NLTK and scikit-learn using the following commands:

    pip3 install nltk

    pip3 install -U scikit-learn

  6. Now to execute the code, use any of the following commands (in the current directory):

HMM Tagger Predictions: python3 src/main.py --tagger hmm --train data/train.txt --test data/test.txt --output output/test_hmm.txt

Brill Tagger Predictions: python3 src/main.py --tagger brill --train data/train.txt --test data/test.txt --output output/test_brill.txt

Description of the execution command

Our program src/main.py that takes four command-line options. The first is --tagger to indicate the tagger type, second is --train for the path to a training corpus, the third option is --test for the path to a test corpus, and the fourth option is --output for the output file.

The two possible values for --tagger option are:

  • hmm for the Hidden Markov Model POS Tagger

  • brill for the Brill POS Tagger

The training data can be found in data/train.txt, the in-domain test data can be found in data/test.txt, and the out-of-domain test data can be found in data/test_ood.txt.

The output file must be generated in the output/ directory.

So specifying these paths, one example of a possible execution command is:

python3 src/main.py --tagger hmm --train data/train.txt --test data/test.txt --output output/test_hmm.txt

References

https://docs.huihoo.com/nltk/0.9.5/api/nltk.tag.hmm.HiddenMarkovModelTrainer-class.html

https://tedboy.github.io/nlps/generated/generated/nltk.tag.HiddenMarkovModelTagger.html

https://www.kite.com/python/docs/nltk.HiddenMarkovModelTagger.train

https://gist.github.com/blumonkey/007955ec2f67119e0909

https://docs.huihoo.com/nltk/0.9.5/api/nltk.tag.brill-module.html

https://www.nltk.org/api/nltk.tag.brill_trainer.html

https://www.nltk.org/_modules/nltk/tag/brill.html

https://www.geeksforgeeks.org/nlp-brill-tagger/

https://www.nltk.org/howto/probability.html

Owner
Chirag Daryani
Software Engineer | Data Science | Machine Learning | Python | Blog: https://chiragdaryani.medium.com/
Chirag Daryani
An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

GPT Neo 🎉 1T or bust my dudes 🎉 An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. If you're just here t

EleutherAI 6.7k Dec 28, 2022
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022
[ICLR'19] Trellis Networks for Sequence Modeling

TrellisNet for Sequence Modeling This repository contains the experiments done in paper Trellis Networks for Sequence Modeling by Shaojie Bai, J. Zico

CMU Locus Lab 460 Oct 13, 2022
ADCS - Automatic Defect Classification System (ADCS) for SSMC

Table of Contents Table of Contents ADCS Overview Summary Operator's Guide Demo System Design System Logic Training Mode Production System Flow Folder

Tam Zher Min 2 Jun 24, 2022
Smart discord chatbot integrated with Dialogflow

academic-NLP-chatbot Smart discord chatbot integrated with Dialogflow to interact with students naturally and manage different classes in a school. De

Tom Huynh 5 Oct 24, 2022
NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Project 3: Web APIs & NLP Problem Statement How do r/Libertarian and r/Neoliberal differ on Biden post-inaguration? The goal of the project is to see

Adam Muhammad Klesc 2 Mar 29, 2022
justCTF [*] 2020 challenges sources

justCTF [*] 2020 This repo contains sources for justCTF [*] 2020 challenges hosted by justCatTheFish. TLDR: Run a challenge with ./run.sh (requires Do

justCatTheFish 25 Dec 27, 2022
Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

Phil Wang 167 Dec 27, 2022
Predict the spans of toxic posts that were responsible for the toxic label of the posts

toxic-spans-detection An attempt at the SemEval 2021 Task 5: Toxic Spans Detection. The Toxic Spans Detection task of SemEval2021 required participant

Ilias Antonopoulos 3 Jul 24, 2022
NVDA, the free and open source Screen Reader for Microsoft Windows

NVDA NVDA (NonVisual Desktop Access) is a free, open source screen reader for Microsoft Windows. It is developed by NV Access in collaboration with a

NV Access 1.6k Jan 07, 2023
Simple translation demo showcasing our headliner package.

Headliner Demo This is a demo showcasing our Headliner package. In particular, we trained a simple seq2seq model on an English-German dataset. We didn

Axel Springer News Media & Tech GmbH & Co. KG - Ideas Engineering 16 Nov 24, 2022
End-to-end MLOps pipeline of a BERT model for emotion classification.

image source EmoBERT-MLOps The goal of this repository is to build an end-to-end MLOps pipeline based on the MLOps course from Made with ML, but this

Dimitre Oliveira 4 Nov 06, 2022
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 07, 2021
Datasets of Automatic Keyphrase Extraction

This repository contains 20 annotated datasets of Automatic Keyphrase Extraction made available by the research community. Following are the datasets and the original papers that proposed them. If yo

LIAAD - Laboratory of Artificial Intelligence and Decision Support 163 Dec 23, 2022
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
中文空间语义理解评测

中文空间语义理解评测 最新消息 2021-04-10 🚩 排行榜发布: Leaderboard 2021-04-05 基线系统发布: SpaCE2021-Baseline 2021-04-05 开放数据提交: 提交结果 2021-04-01 开放报名: 我要报名 2021-04-01 数据集 pa

40 Jan 04, 2023
MiCECo - Misskey Custom Emoji Counter

MiCECo Misskey Custom Emoji Counter Introduction This little script counts custo

7 Dec 25, 2022
An open collection of annotated voices in Japanese language

声庭 (Koniwa): オープンな日本語音声とアノテーションのコレクション Koniwa (声庭): An open collection of annotated voices in Japanese language 概要 Koniwa(声庭)は利用・修正・再配布が自由でオープンな音声とアノテ

Koniwa project 32 Dec 14, 2022