topic modeling on unstructured data in Space news articles retrieved from the Guardian (UK) newspaper using API

Overview

NLP Space News Topic Modeling

Photos by nasa.gov (1, 2, 3, 4, 5) and extremetech.com

Binder Open In Colab nbviewer pre-commit CI CodeQL License: MIT OpenSource Code style: black prs-welcome pyup

Table of Contents

  1. Project Idea
  2. Data acquisition
  3. Analysis
  4. Usage
  5. Project Organization

Project Idea

This project aims to learn topics published in Space news from the Guardian (UK) news publication1.

1: articles were also retrieved from the blog Space.com (web scraping), the New York Times (space news from the science section) and from the Hubble Telescope news archive, but these data sources were not used in analysis

Data acquisition

Primary data source

News articles are retrieved using the official API provided by the Guardian.

Supplementary data sources

Data is also acquired from articles published by the Hubble Telescope, the New York Times (US) and blog publication Space.com

Although these articles were acquired, they were not used in analysis.

Data file creation

  1. Use 1_get_list_of_urls.ipynb
    • programmatically retrieves urls from API or archive of publication
    • retrieves metadata such as date and time, section, sub-section, headline/abstract/short summary, etc.
  2. Use 2_scrape_urls.ipynb
    • scrapes news article text from publication url
  3. Use 3_merge_scraped_and_filter.ipynb
    • merge metadata (1_get_list_of_urls.ipynb) with scraped article text (2_scrape_urls.ipynb)

Analysis

Analysis will be performed using an un-supervised learning model. Details are included in the 8_gensim_coherence_nlp_trials_v3.ipynb notebook in the root directory.

Usage

  1. Clone this repository
    $ git clone
  2. Create Python virtual environment, install packages and launch interactive Python platform
    $ make build
  3. Run notebooks in the following order
    • 3_merge_scraped_and_filter.ipynb (view) (covers data from the Hubble news feed, New York Times and Space.com)
      • merge multiple files of articles text data retrieved from news publications API or archive
      • filter out articles of less than 500 words
      • export to *.csv file for use in unsupervised machine learning models
    • 8_gensim_coherence_nlp_trials_v3.ipynb (view) (does not cover data from the Hubble news feed, New York Times and Space.com)
      • experiments in selecting number of topics using
        • coherence score from built-in coherence model to score Gensim's NMF
        • sklearn's implementation of TFIDF + NMF, using best number of topics found using Gensim's NMF
      • manually reading articles that NMF associates with each topic
    • 9_nlp_workflow.ipynb (view)
      • code-only version of 9_gensim_coherence_nlp_trials_v3.ipynb, with necessary considerations for deployment of topic model

Project Organization

├── .pre-commit-config.yaml       <- configuration file for pre-commit hooks
├── .github
│   ├── workflows
│       └── integrate.yml         <- configuration file for Github Actions
├── LICENSE
├── environment.yml               <- configuration file to create environment to run project on Binder
├── Makefile                      <- Makefile with commands like `make lint` or `make build`
├── README.md                     <- The top-level README for developers using this project.
├── app
│   ├── data                      <- data exported from training topic modeler, for use with API
|   └── tests                     <- Source code for use in API tests
|       ├── test-logs             <- Reports from running unit tests on API
|       └── testing_utils         <- Source code for use in unit tests
|           └── *.py              <- Scripts to use in testing API routes
|       ├── __init__.py           <- Allows Python modules to be imported from testing_utils
|       └── test_api.py           <- Unit tests for API
├── api.py                        <- Defines API routes
├── pytest.ini                    <- Test configuration
├── requirements.txt              <- Packages required to run and test API
├── s*,t*.py                      <- Scripts to use in defining API routes
├── data
│   ├── raw                       <- raw data retrieved from news publication
|   └── processed                 <- merged and filtered data
├── executed-notebooks            <- Notebooks with output.
├── *.ipynb                       <- Jupyter notebooks. Naming convention is a number (for ordering),
│                                    and a short `-` delimited description
├── requirements.txt              <- packages required to execute all Jupyter notebooks interactively (not from CI)
├── setup.py                      <- makes project pip installable (pip install -e .) so `src` can be imported
├── src                           <- Source code for use in this project.
│   ├── __init__.py               <- Makes src a Python module
│   └── *.py                      <- Scripts to use in analysis for pre-processing, training, etc.
├── papermill_runner.py           <- Python functions that execute system shell commands.
└── tox.ini                       <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Owner
edesz
edesz
PyTorch source code of NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models"

This repository contains source code for NAACL 2019 paper "An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models" (P

Alexandra Chronopoulou 89 Aug 12, 2022
Flaxformer: transformer architectures in JAX/Flax

Flaxformer: transformer architectures in JAX/Flax Flaxformer is a transformer library for primarily NLP and multimodal research at Google. It is used

Google 114 Dec 29, 2022
edge-SR: Super-Resolution For The Masses

edge-SR: Super Resolution For The Masses Citation Pablo Navarrete Michelini, Yunhua Lu and Xingqun Jiang. "edge-SR: Super-Resolution For The Masses",

Pablo 40 Nov 10, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 02, 2022
Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Chinese NER(Named Entity Recognition) using BERT(Softmax, CRF, Span)

Weitang Liu 1.6k Jan 03, 2023
Unlimited Call - Text Bombing Tool

FastBomber Unlimited Call - Text Bombing Tool Installation On Termux

Aryan 6 Nov 10, 2022
Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

RoBERTaABSA This repo contains the code for NAACL 2021 paper titled Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoB

106 Nov 28, 2022
American Sign Language (ASL) to Text Converter

Signterpreter American Sign Language (ASL) to Text Converter Recommendations Although there is grayscale and gaussian blur, we recommend that you use

0 Feb 20, 2022
Large-scale Knowledge Graph Construction with Prompting

Large-scale Knowledge Graph Construction with Prompting across tasks (predictive and generative), and modalities (language, image, vision + language, etc.)

ZJUNLP 161 Dec 28, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
Shared code for training sentence embeddings with Flax / JAX

flax-sentence-embeddings This repository will be used to share code for the Flax / JAX community event to train sentence embeddings on 1B+ training pa

Nils Reimers 23 Dec 30, 2022
Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding

Bethge Lab 61 Dec 21, 2022
मराठी भाषा वाचविण्याचा एक प्रयास. इंग्रजी ते मराठीचा शब्दकोश. An attempt to preserve the Marathi language. A lightweight and ad free English to Marathi thesaurus.

For English, scroll down मराठी शब्द मराठी भाषा वाचवण्यासाठी मी हा ओपन सोर्स प्रोजेक्ट सुरू केला आहे. माझ्या मते, आपली भाषा हळूहळू आणि कोणाचाही लक्षात

मुक्त स्त्रोत 20 Oct 11, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
[AAAI 21] Curriculum Labeling: Revisiting Pseudo-Labeling for Semi-Supervised Learning

◥ Curriculum Labeling ◣ Revisiting Pseudo-Labeling for Semi-Supervised Learning Paola Cascante-Bonilla, Fuwen Tan, Yanjun Qi, Vicente Ordonez. In the

UVA Computer Vision 113 Dec 15, 2022
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
⚖️ A Statutory Article Retrieval Dataset in French.

A Statutory Article Retrieval Dataset in French This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code

Maastricht Law & Tech Lab 19 Nov 17, 2022
Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

Balaji R 1 Jan 01, 2022
Turn clang-tidy warnings and fixes to comments in your pull request

clang-tidy pull request comments A GitHub Action to post clang-tidy warnings and suggestions as review comments on your pull request. What platisd/cla

Dimitris Platis 30 Dec 13, 2022