DRIFT is a tool for Diachronic Analysis of Scientific Literature.

Overview

Logo

GitHub license GitHub stars GitHub forks GitHub issues

Open in Streamlit arXiv


About

DRIFT is a tool for Diachronic Analysis of Scientific Literature. The application offers user-friendly and customizable utilities for two modes: Training and Analysis. Currently, the application supports customizable training of diachronic word embeddings with the TWEC model. The application supports a variety of analysis methods to monitor trends and patterns of development in scientific literature:

  1. Word Cloud
  2. Productivity/Frequency Plot
  3. Acceleration Plot
  4. Semantic Drift
  5. Tracking Clusters
  6. Acceleration Heatmap
  7. Track Trends with Similarity
  8. Keyword Visualisation
  9. LDA Topic Modelling

NOTE: The online demo is hosted using Streamlit sharing. This is a single-instance single-process deployment, accessible to all the visitors publicly (avoid sharing sensitive information on the demo). Hence, it is highly recommended to use your own independent local deployment of the application for a seamless and private experience. One can alternatively fork this repository and host it using Streamlit sharing.

We would love to know about any issues found on this repository. Please submit an issue for any query, or contact us here. If you use this application in your work, you can cite this repository and the paper here.


Setup

Clone the repository:

git clone https://github.com/rajaswa/DRIFT.git
cd DRIFT

Install the requirements:

make install_req

Data

The dataset we have used for our demo, and the analysis in the paper was scraped using the arXiv API (see script). We scraped papers from the cs.CL subject. This dataset is available here.

The user can upload their own dataset to the DRIFT application. The unprocessed dataset should be present in the following format (as a JSON file):

{
   <year_1>:[
      <paper_1>,
      <paper_2>,
      ...
   ],
   <year_2>:[
      <paper_1>,
      <paper_2>,
      ...
   ],
   ...,
   <year_m>:[
      <paper_1>,
      <paper_2>,
      ...
   ],
}

where year_x is a string (e.g., "1998"), and paper_x is a dictionary. An example is given below:

{
   "url":"http://arxiv.org/abs/cs/9809020v1",
   "date":"1998-09-15 23:49:32+00:00",
   "title":"Linear Segmentation and Segment Significance",
   "authors":[
      "Min-Yen Kan",
      "Judith L. Klavans",
      "Kathleen R. McKeown"
   ],
   "abstract":"We present a new method for discovering a segmental discourse structure of a\ndocument while categorizing segment function. We demonstrate how retrieval of\nnoun phrases and pronominal forms, along with a zero-sum weighting scheme,\ndetermines topicalized segmentation. Futhermore, we use term distribution to\naid in identifying the role that the segment performs in the document. Finally,\nwe present results of evaluation in terms of precision and recall which surpass\nearlier approaches.",
   "journal ref":"Proceedings of 6th International Workshop of Very Large Corpora\n  (WVLC-6), Montreal, Quebec, Canada: Aug. 1998. pp. 197-205",
   "category":"cs.CL"
}

The only important key is "abstract", which has the raw text. The user can name this key differently. See the Training section below for more details.


Usage

Launch the app

To launch the app, run the following command from the terminal:

streamlit run app.py

Train Mode

Preprocesing

The preprocessing stage takes a JSON file structured as shown in the Data section. They key for raw text is provided on which preprocessing takes place. During the preprocessing of the text, year-wise text files are created in a desired directory. During the preprocessing:

  • All html tags are removed from the text.
  • Contractions are replaced (e.g. 'don't' is converted to 'do not')
  • Punctuations, non-ascii characters, stopwords are removed.
  • All verbs are lemmatized.

After this, each processed text is stored in the respective year file separated by a new-line, along with all the data in a single file as compass.txt

Training

word_cloud_usage

The training mode uses the path where the processed text files are stored, and trains the TWEC model on the given text. The TWEC model trains a Word2Vec model on compass.txt and then the respective time-slices are trained on this model to get corresponding word vectors. In the sidebar, we provide several options like - whether to use Skipgram over CBOW, number of dynamic iterations for training, number of static iterations for training, negative sampling, etc. After training, we store the models at the specified path, which are used later in the analysis.

Analysis Mode

Word Cloud

word_cloud_usage

A word cloud, or tag cloud, is a textual data visualization which allows anyone to see in a single glance the words which have the highest frequency within a given body of text. Word clouds are typically used as a tool for processing, analyzing and disseminating qualitative sentiment data.

References:

Productivity/Frequency Plot

prod_freq_usage

Our main reference for this method is this paper. In short, this paper uses normalized term frequency and term producitvity as their measures.

  • Term Frequency: This is the normalized frequency of a given term in a given year.
  • Term Productivity: This is a measure of the ability of the concept to produce new multi-word terms. In our case we use bigrams. For each year y and single-word term t, and associated n multi-word terms m, the productivity is given by the entropy:

prod_plot_1

prod_plot_2

Based on these two measures, they hypothesize three kinds of terms:

  • Growing Terms: Those which have increasing frequency and productivity in the recent years.
  • Consolidated Terms: Those that are growing in frequency, but not in productivity.
  • Terms in Decline: Those which have reached an upper bound of productivity and are being used less in terms of frequency.

Then, they perform clustering of the terms based on their frequency and productivity curves over the years to test their hypothesis. They find that the clusters formed show similar trends as expected.

NOTE: They also evaluate quality of their clusters using pseudo-labels, but we do not use any automated labels here. They also try with and without double-counting multi-word terms, but we stick to double-counting. They suggest it is more explanable.

Acceleration Plot

acceleration_plot_usage

This plot is based on the word-pair acceleration over time. Our inspiration for this method is this paper. Acceleration is a metric which calculates how quickly the word embeddings for a pair of word get close together or farther apart. If they are getting closer together, it means these two terms have started appearing more frequently in similar contexts, which leads to similar embeddings. In the paper, it is described as:

acc_plot_1

acc_plot_2

Below, we display the top few pairs between the given start and end year in dataframe, then one can select years and then select word-pairs in the plot parameters expander. A reduced dimension plot is displayed.

NOTE: They suggest using skip-gram method over CBOW for the model. They use t-SNE representation to view the embeddings. But their way of aligning the embeddings is different. They also use some stability measure to find the best Word2Vec model. The also use Word2Phrase which we are planning to add soon.

Semantic Drift

semantic_drift_usage

This plot represents the change in meaning of a word over time. This shift is represented on a 2-dimensional representation of the embedding space. To find the drift of a word, we calculate the distance between the embeddings of the word in the final year and in the initial year. We find the drift for all words and sort them in descending order to find the most drifted words. We give an option to use one of two distance metrics: Euclidean Distance and Cosine Distance.

sem_drift_1

sem_drift_2

We plot top-K (sim.) most similar words around the two representations of the selected word.

In the Plot Parameters expander, the user can select the range of years over which the drift will be computed. He/She can also select the dimensionality reduction method for plotting the embeddings.

Below the graph, we provide a list of most drifted words (from the top-K keywords). The user can also choose a custom word.

Tracking Clusters

track_clusters_usage

Word meanings change over time. They come closer or drift apart. In a certain year, words are clumped together, i.e., they belong to one cluster. But over time, clusters can break into two/coalesce together to form one. Unlike the previous module which tracks movement of one word at a time, here, we track the movement of clusters.

We plot the formed clusters for all the years lying in the selected range of years. NOTE: We give an option to use one of two libraries for clustering: sklearn or faiss. faiss' KMeans implementation is around 10 times faster than sklearn's.

Acceleration Heatmap

acceleration_heatmap_usage

This plot is based on the word-pair acceleration over time. Our inspiration for this method is this paper. Acceleration is a metric which calculates how quickly the word embeddings for a pair of word get close together or farther apart. If they are getting closer together, it means these two terms have started appearing more frequently in similar contexts, which leads to similar embeddings. In the paper, it is described as:

acc_hm_1

acc_hm_2

For all the selected keywords, we display a heatmap, where the brightness of the colour determines the value of the acceleration between that pair, i.e., the brightness is directly proportional to the acceleration value.

NOTE: They suggest using skip-gram method over CBOW for the model.

Track Trends with Similarity

track_trends_usage

In this method, we wish to chart the trajectory of a word/topic from year 1 to year 2.

To accomplish this, we allow the user to pick a word from year 1. At the same time, we ask the user to provide the desired stride. We search for the most similar word in the next stride years. We keep doing this iteratively till we reach year 2, updating the word at each step.

The user has to select a word and click on Generate Dataframe. This gives a list of most similar words in the next stride years. The user can now iteratively select the next word from the drop-down till the final year is reached.

Keyword Visualisation

keyword_viz_usage

Here, we use the YAKE Keyword Extraction method to extract keywords. You can read more about YAKE here.

In our code, we use an open source implementation of YAKE.

NOTE: Yake returns scores which are indirectly proportional to the keyword importance. Hence, we do the following to report the final scores:

keyword_viz

LDA Topic Modelling

lda_usage

Latent Dirichlet Allocation is a generative probabilistic model for an assortment of documents, generally used for topic modelling and extraction. LDA clusters the text data into imaginary topics.

Every topic can be represented as a probability distribution over ngrams and every document can be represented as a probability distribution over these generated topics.

We train LDA on a corpus where each document contains the abstracts of a particular year. We express every year as a probability distribution of topics.

In the first bar graph, we show how a year can be decomposed into topics. The graphs below the first one show a decomposition of the relevant topics.

Citation

You can cite our work as:

@misc{sharma2021drift,
      title={DRIFT: A Toolkit for Diachronic Analysis of Scientific Literature}, 
      author={Abheesht Sharma and Gunjan Chhablani and Harshit Pandey and Rajaswa Patil},
      year={2021},
      eprint={2107.01198},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

OR

Sharma, A., Chhablani, G., Pandey, H., & Patil, R. (2021). DRIFT: A Toolkit for Diachronic Analysis of Scientific Literature.
Comments
  • Add script for arranging anthology data

    Add script for arranging anthology data

    • Reading XML files from here
    • Data stored in the form of nested dictionaries Format: publisher ---> (month, url, booktitle, year ---> [papers]) paper ---> (author, abstract, ...other text elements)

    Issues:

    • Most conferences have data for very few years. Mistake in the code? Or lack of uniformity in the XML files?
    • Keep "year" at the top of the hierarchy.
    opened by abheesht17 3
  • Script for Clustering Word Embeddings

    Script for Clustering Word Embeddings

    • Use K-Means for clustering the diachronic word embeddings.
    • Rough Sketch:
      • The class can have multiple functions: train, predict, store, visualisation, etc.
      • The function(s) will take as input word vectors from a particular timestamp. They will also take as input parameters of K-Means like number of clusters, etc.
      • Add functionality for visualisation.
      • Return the centroids and the cluster to which the words belong.
    opened by abheesht17 2
  • LDA Topic Modelling error

    LDA Topic Modelling error

    When choosing the LDA Topic Modelling section , the following message appears : ValueError: list.remove(x): x not in list Traceback: File "c:\users\doub2420.virtualenvs\drift-qengzvvy\lib\site-packages\streamlit\script_runner.py", line 337, in run_script exec(code, module.dict) File "C:\drift\app.py", line 1726, in year_paths.remove(os.path.join(vars["data_path"], "compass.txt"))

    What would that mean ?

    Thanks !

    opened by doubianimehdi 1
  • Changes to Productivity Plot

    Changes to Productivity Plot

    1. Check if cluster labels are more or less correct, otherwise we will remove/change the cluster table.
    2. Formatting changes for cluster table might be required
    3. Labels in dataframe should be named
    opened by harsh4799 0
  • Stop Words Custom list and Remove digits not transform it to english numerals

    Stop Words Custom list and Remove digits not transform it to english numerals

    Hi,

    I would like to have an option to use custom stop words and also to not have digits transformed into english numerals (thousand, hundred , and so on ...) because it doesn't help with the purpose of tracking trends or analyzing the abstracts.

    Thank you !

    opened by doubianimehdi 0
Releases(v0.1.0)
Owner
Rajaswa Patil
Rajaswa Patil
PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in clustering (CVPR2021)

PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in Clustering Jang Hyun Cho1, Utkarsh Mall2, Kavita Bala2, Bharath Harihar

Jang Hyun Cho 164 Dec 30, 2022
Transformers based fully on MLPs

Awesome MLP-based Transformers papers An up-to-date list of Transformers based fully on MLPs without attention! Why this repo? After transformers and

Fawaz Sammani 35 Dec 30, 2022
Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models

LMPBT Supplementary code for the Paper entitled ``Locally Most Powerful Bayesian Test for Out-of-Distribution Detection using Deep Generative Models"

1 Sep 29, 2022
PyTorch implementation of "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks"

DiscoGAN in PyTorch PyTorch implementation of Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. * All samples in READM

Taehoon Kim 1k Jan 04, 2023
PyTorch implementation of "Representing Shape Collections with Alignment-Aware Linear Models" paper.

deep-linear-shapes PyTorch implementation of "Representing Shape Collections with Alignment-Aware Linear Models" paper. If you find this code useful i

Romain Loiseau 27 Sep 24, 2022
Implementation of "Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner"

Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner This repository is the official implementation of Meta-rPPG: Remote Heart Ra

Eugene Lee 137 Dec 13, 2022
Pun Detection and Location

Pun Detection and Location “The Boating Store Had Its Best Sail Ever”: Pronunciation-attentive Contextualized Pun Recognition Yichao Zhou, Jyun-yu Jia

lawson 3 May 13, 2022
[CVPR 2021] Counterfactual VQA: A Cause-Effect Look at Language Bias

Counterfactual VQA (CF-VQA) This repository is the Pytorch implementation of our paper "Counterfactual VQA: A Cause-Effect Look at Language Bias" in C

Yulei Niu 94 Dec 03, 2022
Wordle Env: A Daily Word Environment for Reinforcement Learning

Wordle Env: A Daily Word Environment for Reinforcement Learning Setup Steps: git pull [email&#

2 Mar 28, 2022
MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images

Main repo for ECCV 2020 paper MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images. visual.cs.brown.edu/matryodshka

Brown University Visual Computing Group 75 Dec 13, 2022
Repo for the paper Extrapolating from a Single Image to a Thousand Classes using Distillation

Extrapolating from a Single Image to a Thousand Classes using Distillation by Yuki M. Asano* and Aaqib Saeed* (*Equal Contribution) Extrapolating from

Yuki M. Asano 16 Nov 04, 2022
MonoScene: Monocular 3D Semantic Scene Completion

MonoScene: Monocular 3D Semantic Scene Completion MonoScene: Monocular 3D Semantic Scene Completion] [arXiv + supp] | [Project page] Anh-Quan Cao, Rao

298 Jan 08, 2023
This code is part of the reproducibility package for the SANER 2022 paper "Generating Clarifying Questions for Query Refinement in Source Code Search".

Clarifying Questions for Query Refinement in Source Code Search This code is part of the reproducibility package for the SANER 2022 paper "Generating

Zachary Eberhart 0 Dec 04, 2021
EsViT: Efficient self-supervised Vision Transformers

Efficient Self-Supervised Vision Transformers (EsViT) PyTorch implementation for EsViT, built with two techniques: A multi-stage Transformer architect

Microsoft 352 Dec 25, 2022
PyBullet CartPole and Quadrotor environments—with CasADi symbolic a priori dynamics—for learning-based control and reinforcement learning

safe-control-gym Physics-based CartPole and Quadrotor Gym environments (using PyBullet) with symbolic a priori dynamics (using CasADi) for learning-ba

Dynamic Systems Lab 300 Dec 28, 2022
PyTorch implementation of Deep HDR Imaging via A Non-Local Network (TIP 2020).

NHDRRNet-PyTorch This is the PyTorch implementation of Deep HDR Imaging via A Non-Local Network (TIP 2020). 0. Differences between Original Paper and

Yutong Zhang 1 Mar 01, 2022
A repository for generating stylized talking 3D and 3D face

style_avatar A repository for generating stylized talking 3D faces and 2D videos. This is the repository for paper Imitating Arbitrary Talking Style f

Haozhe Wu 191 Dec 22, 2022
yolov5 deepsort 行人 车辆 跟踪 检测 计数

yolov5 deepsort 行人 车辆 跟踪 检测 计数 实现了 出/入 分别计数。 默认是 南/北 方向检测,若要检测不同位置和方向,可在 main.py 文件第13行和21行,修改2个polygon的点。 默认检测类别:行人、自行车、小汽车、摩托车、公交车、卡车。 检测类别可在 detect

554 Dec 30, 2022
This repository is a basic Machine Learning train & validation Template (Using PyTorch)

pytorch_ml_template This repository is a basic Machine Learning train & validation Template (Using PyTorch) TODO Markdown 사용법 Build Docker 사용법 Anacond

1 Sep 15, 2022
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks

CALVIN CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-Horizon Robot Manipulation Tasks Oier Mees, Lukas Hermann, Erick Rosete,

Oier Mees 107 Dec 26, 2022