NLP

T5 Project proposal

Topic Modeling and Clustering of News-Articles-and-Essays

Students:

Nasser Alshehri
Abdullah Bushnag
Abdulrhman Alqurashi

OVERVIEW

News come in different formats, different types and different categories. Here we attempt to use Topic modeling and Clustering to get answers on what each content containt based on its content and then we try to do it based only on its title.

The process would be: We load the data. Keep what we need from the data. Clean the text(ex:stopwords).

Build the bag of words for all documents. Build the bag of words for each document.

Vectorize the data. Run the LDA model. Run the model on all data and save the output to dataframe

Run the Clustering algorithm. Save the data to csv. Make the charts.

Data

The data is acquired from: https://components.one/datasets/all-the-news-articles-dataset

The Raw data containts 12 features: id, title, author, date, content, year, month, publication, category, digital, section, url.

The features we are using are only the 'title' and 'content'.

The data we are not interested in will be dropped/ignored.

The 'title' is the headling/name/title of the news/Article/Essay. The 'Content' is the body/content/Essay/Article/News itself.

TOOLS

Pandas Numpy Scikit-learn Matplotlib Seaborn nltk gensim

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Related tags

Overview

NLP

Students:

OVERVIEW

Data

TOOLS

Owner

🎐 a python library for doing approximate and phonetic matching of strings.

Natural Language Processing

A Transformer Implementation that is easy to understand and customizable.

Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

pytorch implementation of Attention is all you need

ConvBERT-Prod

Malaya-Speech is a Speech-Toolkit library for bahasa Malaysia, powered by Deep Learning Tensorflow.

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

A telegram bot to translate 100+ Languages

NLP-Project - Used an API to scrape 2000 reddit posts, then used NLP analysis and created a classification model to mixed succcess

Augmenty is an augmentation library based on spaCy for augmenting texts.

Super easy library for BERT based NLP models

Every Google, Azure & IBM text to speech voice for free

NLP project that works with news (NER, context generation, news trend analytics)

GrammarTagger — A Neural Multilingual Grammar Profiler for Language Learning

Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Code associated with the Don't Stop Pretraining ACL 2020 paper

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

This simple Python program calculates a love score based on your and your crush's full names in English