TFIDF-based QA system for AIO2 competition

Last update: Feb 19, 2022

Related tags

Overview

AIO2 TF-IDF Baseline

This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition.

In the training stage, the model builds a sparse matrix of TF-IDF features from the questions in training dataset. In the inference stage, the model predicts answers of unseen questions by finding the most similar training question to the input by computing dot product scores of TF-IDF features.

Therefore, in principle, the model cannot predict answers unseen in the training data.

Steps to experiment with the model

Install requirements

$ pip install -r requirements.txt

Train

$ python train.py \
--train_file <data dir>/aio_02_train.jsonl \
--output_dir model \
--pos_list 名詞 \
--stop_words でしょ う \
--max_features 10000

Predict

$ python predict.py \
--model_dir model \
--test_file <data dir>/aio_02_dev_unlabeled_v1.0.jsonl \
--prediction_file <output dir>/predictions.jsonl

Building Docker image

$ docker build -t aio2-tfidf-baseline .

Test locally:

:/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl "> $ docker run --rm -v ":/app/input" -v ":/app/output" aio2-tfidf-baseline bash ./submission.sh input/aio_02_dev_unlabeled_v1.0.jsonl output/predictions.jsonl 

Save the docker image to file:

$ docker save aio2-tfidf-baseline | gzip > aio2-tfidf-baseline.tar.gz

License

The codes in this repository are open-sourced under MIT License.

TFIDF-based QA system for AIO2 competition

Related tags

Overview

AIO2 TF-IDF Baseline

Steps to experiment with the model

Install requirements

Train

Predict

Building Docker image

License

Owner

Masatoshi Suzuki

A simple recipe for training and inferencing Transformer architecture for Multi-Task Learning on custom datasets. You can find two approaches for achieving this in this repo.

Deal or No Deal? End-to-End Learning for Negotiation Dialogues

gaiic2021-track3-小布助手对话短文本语义匹配复赛rank3、决赛rank4

Code and datasets for our paper "PTR: Prompt Tuning with Rules for Text Classification"

A toolkit for document-level event extraction, containing some SOTA model implementations

Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Levenshtein and Hamming distance computation

Sapiens is a human antibody language model based on BERT.

多语言降噪预训练模型MBart的中文生成任务

Code to reproduce the results of the paper 'Towards Realistic Few-Shot Relation Extraction' (EMNLP 2021)

A library for end-to-end learning of embedding index and retrieval model

NLP Core Library and Model Zoo based on PaddlePaddle 2.0

PIZZA - a task-oriented semantic parsing dataset

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

An open collection of annotated voices in Japanese language

Datasets of Automatic Keyphrase Extraction

Topic Inference with Zeroshot models

Reproducing the Linear Multihead Attention introduced in Linformer paper (Linformer: Self-Attention with Linear Complexity)

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

PG-19 Language Modelling Benchmark