The aim of this task is to predict someone's English proficiency based on a text input.

Last update: Dec 13, 2021

Overview

English_proficiency_prediction_NLP

The aim of this task is to predict someone's English proficiency based on a text input.

Using the The NICT JLE Corpus available here : https://alaginrc.nict.go.jp/nict_jle/index_E.html

The source of the corpus data is the transcripts of the audio-recorded speech samples of 1,281 participants (1.2 million words, 300 hours in total) of English oral proficiency interview test. Each participant got a SST (Standard Speaking Test) score between 1 (low proficiency) and 9 (high proficiency) based on this test.

The goal is to build a machine learning algorithm for predicting the SST score of each participant based on their transcript.

Steps:

1 - Pre-process the dataset: extract the participant transcript (all tags). Inside participant transcript, you can remove all other tags and extract only English words.

2 - Process the dataset: extract features with the Bag of Word (BoW) technique

3 - Train a classifier to predict the SST score

4 - Compute the accuracy of your system (the number of participant classified correctly) and plot the confusion matrix.

5 - Try to improve your system (for example you can try to use GloVe instead of BoW).

The aim of this task is to predict someone's English proficiency based on a text input.

Related tags

Overview

English_proficiency_prediction_NLP

Owner

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

🏆 • 5050 most frequent words in 109 languages

An open collection of annotated voices in Japanese language

texlive expressions for documents

This library is testing the ethics of language models by using natural adversarial texts.

TaCL: Improve BERT Pre-training with Token-aware Contrastive Learning

A2T: Towards Improving Adversarial Training of NLP Models (EMNLP 2021 Findings)

SimCSE: Simple Contrastive Learning of Sentence Embeddings

CDLA: A Chinese document layout analysis (CDLA) dataset

kochat

中文問句產生器；使用台達電閱讀理解資料集(DRCD)

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

SimBERT升级版（SimBERTv2）！

Fine-tune GPT-3 with a Google Chat conversation history

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

NLP project that works with news (NER, context generation, news trend analytics)

Utilizing RBERT model for KLUE Relation Extraction task

Submit issues and feature requests for our API here.