topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

Last update: Jan 03, 2022

Overview

NLP Space News Topic Modeling

Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com

Project Idea
Data acquisition
Analysis
Usage
Project Organization

Project Idea

This project aims to learn topics published in Space news from the Guardian (UK) news publication¹.

1: articles were also retrieved from the blog Space.com (web scraping), the New York Times (space news from the science section) and from the Hubble Telescope news archive, but these data sources were not used in analysis

Data acquisition

Primary data source

News articles are retrieved using the official API provided by the Guardian.

Supplementary data sources

Data is also acquired from articles published by the Hubble Telescope, the New York Times (US) and blog publication Space.com

Although these articles were acquired, they were not used in analysis.

Data file creation

Use 1_get_list_of_urls.ipynb
- programmatically retrieves urls from API or archive of publication
- retrieves metadata such as date and time, section, sub-section, headline/abstract/short summary, etc.
Use 2_scrape_urls.ipynb
- scrapes news article text from publication url
Use 3_merge_scraped_and_filter.ipynb
- merge metadata (1_get_list_of_urls.ipynb) with scraped article text (2_scrape_urls.ipynb)

Analysis

Analysis will be performed using an un-supervised learning model. Details are included in the 8_gensim_coherence_nlp_trials_v3.ipynb notebook in the root directory.

Usage

Clone this repository
```
$ git clone
```
Create Python virtual environment, install packages and launch interactive Python platform
```
$ make build
```
Run notebooks in the following order
- 3_merge_scraped_and_filter.ipynb (view) (covers data from the Hubble news feed, New York Times and Space.com)
  - merge multiple files of articles text data retrieved from news publications API or archive
  - filter out articles of less than 500 words
  - export to *.csv file for use in unsupervised machine learning models
- 8_gensim_coherence_nlp_trials_v3.ipynb (view) (does not cover data from the Hubble news feed, New York Times and Space.com)
  - experiments in selecting number of topics using
    - coherence score from built-in coherence model to score Gensim's NMF
    - sklearn's implementation of TFIDF + NMF, using best number of topics found using Gensim's NMF
  - manually reading articles that NMF associates with each topic
- 9_nlp_workflow.ipynb (view)
  - code-only version of 9_gensim_coherence_nlp_trials_v3.ipynb, with necessary considerations for deployment of topic model

Project Organization

├── .pre-commit-config.yaml       <- configuration file for pre-commit hooks
├── .github
│   ├── workflows
│       └── integrate.yml         <- configuration file for Github Actions
├── LICENSE
├── environment.yml               <- configuration file to create environment to run project on Binder
├── Makefile                      <- Makefile with commands like `make lint` or `make build`
├── README.md                     <- The top-level README for developers using this project.
├── app
│   ├── data                      <- data exported from training topic modeler, for use with API
|   └── tests                     <- Source code for use in API tests
|       ├── test-logs             <- Reports from running unit tests on API
|       └── testing_utils         <- Source code for use in unit tests
|           └── *.py              <- Scripts to use in testing API routes
|       ├── __init__.py           <- Allows Python modules to be imported from testing_utils
|       └── test_api.py           <- Unit tests for API
├── api.py                        <- Defines API routes
├── pytest.ini                    <- Test configuration
├── requirements.txt              <- Packages required to run and test API
├── s*,t*.py                      <- Scripts to use in defining API routes
├── data
│   ├── raw                       <- raw data retrieved from news publication
|   └── processed                 <- merged and filtered data
├── executed-notebooks            <- Notebooks with output.
├── *.ipynb                       <- Jupyter notebooks. Naming convention is a number (for ordering),
│                                    and a short `-` delimited description
├── requirements.txt              <- packages required to execute all Jupyter notebooks interactively (not from CI)
├── setup.py                      <- makes project pip installable (pip install -e .) so `src` can be imported
├── src                           <- Source code for use in this project.
│   ├── __init__.py               <- Makes src a Python module
│   └── *.py                      <- Scripts to use in analysis for pre-processing, training, etc.
├── papermill_runner.py           <- Python functions that execute system shell commands.
└── tox.ini                       <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

Related tags

Overview

NLP Space News Topic Modeling

Table of Contents

Project Idea

Data acquisition

Primary data source

Supplementary data sources

Data file creation

Analysis

Usage

Project Organization

Owner

edesz

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Source code for the paper "TearingNet: Point Cloud Autoencoder to Learn Topology-Friendly Representations"

Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

A python framework to transform natural language questions to queries in a database query language.

NLP, before and after spaCy

🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.

Topic Inference with Zeroshot models

Natural Language Processing Specialization

Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

Multiple implementations for abstractive text summurization , using google colab

LCG T-TEST USING EUCLIDEAN METHOD

Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Code for paper "Which Training Methods for GANs do actually Converge? (ICML 2018)"

BERT-based Financial Question Answering System

It analyze the sentiment of the user, whether it is postive or negative.

This program do translate english words to portuguese

A NLP program: tokenize method, PoS Tagging with deep learning