NLP topic mdel LDA - Gathered from New York Times website

Overview

NLP-topic-mdel-LDA

1. Dataset

the dataset were gathered from New York Times website, Energy section. (nytimes.com). the Website offers the journals by categories, and I used the category energy. For the text mining, I had to check the structure of website. The websiste basically using HTML base, and had four big frames. To create the crawler, I used selenium chrome web driver and python. For the first put the url and access address. In this step, I already put the url which is energy section so that I can avoid additional step. The journals I wanted to crawl is only for renewable energy, so I used send_keys function from BeautifulSoup. Then make the sorting option as newest. This sorting option was found as Xpath from chrome instpection. Then use the selenium to scroll down and at the end download the date, title and headline and save as csv file.

This dataset has date, title and headline of the journals related renewable energy from Dec 11 2020 to Feb 26, 2021, and it has total 110 rows without missing values. The ‘news’ column is combination of ‘title’ column and ‘headline’ column. for the topic modeling, mostly the ‘news’ column has been used.

2. text pre-processing

  1. special characters, numbers and punctuation marks are removed. For this step, python replace function has been applied. Every character excludes English al-phabet (a-zA-Z) is replaced to blank. (“ “).

  2. Second step is removing the short length words. In this project, the words have less than 3 alphabet character are assumed as not useful information. For example, “if”, “it”, “of”, “at”. For this step, for loop and if statement has been applied.

  3. convert capital letters to lower letters. By this steps, the total number of words can be re-duced. For this step, apply function has been applied

3. LDA

LDA is an unsupervised machine learning model that find topics from the literature and one of the representative algorithms of topic modeling. in this code, gensim library has been applied for the model.

4. Visualization

For the visualization of LDA model, pyLDAvis package has been applied. The distance of each circle shows how different each topic is from each other. If the two circles overlapped, it indicates that these two topics are similar topics

By clicking each circle, each words term frequency is shown as bar chart representation. The blue bar indicates overall term frequency and the red bar indicates estimated term frequency within the selected topic, and the bar chart is sorted by the red line LDA is an unsupervised machine learning model that find topics from the literature and one of the representative algorithms of topic modeling

image

A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
Almost State-of-the-art Text Generation library

Ps: we are adding transformer model soon Text Gen 🐐 Almost State-of-the-art Text Generation library Text gen is a python library that allow you build

Emeka boris ama 63 Jun 24, 2022
Two-stage text summarization with BERT and BART

Two-Stage Text Summarization Description We experiment with a 2-stage summarization model on CNN/DailyMail dataset that combines the ability to filter

Yukai Yang (Alexis) 6 Oct 22, 2022
A Python 3.6+ package to run .many files, where many programs written in many languages may exist in one file.

RunMany Intro | Installation | VSCode Extension | Usage | Syntax | Settings | About A tool to run many programs written in many languages from one fil

6 May 22, 2022
To be a next-generation DL-based phenotype prediction from genome mutations.

Sequence -----------+-- 3D_structure -- 3D_module --+ +-- ? | |

Eric Alcaide 18 Jan 11, 2022
Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Ye

Yi-Chang Chen 5 Dec 15, 2022
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
Huggingface Transformers + Adapters = ❤️

adapter-transformers A friendly fork of HuggingFace's Transformers, adding Adapters to PyTorch language models adapter-transformers is an extension of

AdapterHub 1.2k Jan 09, 2023
Model parallel transformers in JAX and Haiku

Table of contents Mesh Transformer JAX Updates Pretrained Models GPT-J-6B Links Acknowledgments License Model Details Zero-Shot Evaluations Architectu

Ben Wang 4.9k Jan 04, 2023
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023
Community and sentiment analysis based on tweets

The project has set itself the goal of analyzing the thoughts and interaction of Italian users through the social posts expressed through the Twitter platform on the day of the entry into force of th

3 Nov 17, 2022
Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

NeX: Real-time View Synthesis with Neural Basis Expansion Project Page | Video | Paper | COLAB | Shiny Dataset We present NeX, a new approach to novel

537 Jan 05, 2023
A simple visual front end to the Maya UE4 RBF plugin delivered with MetaHumans

poseWrangler Overview PoseWrangler is a simple UI to create and edit pose-driven relationships in Maya using the MayaUE4RBF plugin. This plugin is dis

Christopher Evans 105 Dec 18, 2022
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
Stand-alone language identification system

langid.py readme Introduction langid.py is a standalone Language Identification (LangID) tool. The design principles are as follows: Fast Pre-trained

2k Jan 04, 2023
Repository for the paper: VoiceMe: Personalized voice generation in TTS

🗣 VoiceMe: Personalized voice generation in TTS Abstract Novel text-to-speech systems can generate entirely new voices that were not seen during trai

Pol van Rijn 80 Dec 29, 2022
This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection"

Splinter This repository contains the code, models and datasets discussed in our paper "Few-Shot Question Answering by Pretraining Span Selection", to

Ori Ram 88 Dec 31, 2022
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Rishikesh (ऋषिकेश) 33 Sep 22, 2022