Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Overview

Efficient Nearest Neighbor Language Models

This is implementation of the paper:

Efficient Nearest Neighbor Language Models
Junxian He, Graham Neubig, Taylor Berg-Kirkpatrick
EMNLP 2021

This repo implements several techniques to speed up the evaluation of non-parametric, nearest neighbor language models. Specifically, we improve the efficiency along three axes: adaptive retrieval, datastore prunning, and dimension reduction.

Install Dependencies

This repository is largly based on the knnlm repo which is a fork of Fairseq (commit da544b). Please use the exact commit page to determine software requirements for using this code.

git clone [email protected]:jxhe/efficient-knnlm.git

cd efficient-knnlm
pip install --editable .
pip install faiss

Hardware

Experiments for this paper were conducted on machines that contain 32 CPUs, 100GB of RAM, and one NVIDIA 3090 24GB GPU. Saving the Wikitext-103 datastore requires 200GB of disk space. Note that the number of CPUs has a great impact on the speed.

Running Efficient kNNLM

Preparation

Data

We share Fairseq's instructions on how to prepare the data here.

mkdir -p datasets/wikitext-103
cp examples/language_model/wikitext-103/prepare-wikitext-103.sh datasets/wikitext-103

cd datasets/wikitext-103
bash prepare-wikitext-103.sh
cd ../..

TEXT=datasets/wikitext-103
python preprocess.py \
    --only-source \
    --trainpref $TEXT/wiki.train.tokens \
    --validpref $TEXT/wiki.valid.tokens \
    --testpref $TEXT/wiki.test.tokens \
    --destdir data-bin/wikitext-103 \
    --workers 20

Download the language model checkpoint pretrained on WikiText-103

# the model checkpoint link is from the knnlm repo
wget https://nlp.stanford.edu/projects/knnlm/wt103_checkpoint_best.pt -P knnlm_ckpt

Save the datastore

mkdir -p dstore

python eval_lm.py data-bin/wikitext-103 \
    --path knnlm_ckpt/checkpoint_best.pt \
    --sample-break-mode none --max-tokens 3072 \
    --softmax-batch 1024 --gen-subset train \
    --context-window 1536 --tokens-per-sample 1536 \
    --dstore-mmap dstore/dstore --knn-keytype 'last_ffn_input' \
    --dstore-size 103225485 --model-overrides "{'knn_keytype': 'last_ffn_input'}" \
    --save-knnlm-dstore --fp16 --dstore-fp16

Dimension Reduction

# the script applies PCA of dimension 512 by default 
# the PCA hyperparameter can be tuned in this script
# set pca=0 to revert back to the vanilla version
bash ef_knnlm/build_faiss.sh

The faiss index is saved into dstore. Try it out:

bash ef_knnlm/utils_cmd/eval_knnlm.sh \
    -d wikitext-103 \
    -s valid \
    -p dstore/dstore_size103225485_embed1024_fp16 \
    -i dstore/knn.103225485.pca512.m64.index \
    -n 103225485 \

You should already observe a speedup.

Adaptive Retrieval

prepare heldout data to train the retrieval adaptor

# this randomly selects 90% of validation data as the training data to 
# train the retrieval adaptor
bash ef_knnlm/adaptive_retrieval/prepare_heldout.sh wikitext-103

prepare features

bash ef_knnlm/adaptive_retrieval/prepare_feature_pipeline.sh

train

bash ef_knnlm/adaptive_retrieval/train_ar.sh

It saves the retrieval adaptor checkpoints into checkpoint/wikitext-103-valid

evaluation

# the cutoff ratio in adaptive retrieval
# by default we cut off half of the retrieval
cutoff=50

# please change this to the .pt file path observed from the last step
ar_ckpt=xxx

# this hyperparameter needs to be changed if 
# the datastore sizes change (e.g. datastore pruning)
size=103225485

dstore_prefix=dstore/dstore_size${size}_embed1024_fp16
index_file=dstore/knn.${size}.pca512.m64.index

bash ef_knnlm/utils_cmd/eval_knnlm.sh \
    -d wikitext-103 \
    -s test \
    -p ${dstore_prefix} \
    -i ${index_file} \
    -c knnlm_ckpt/wt103_checkpoint_best.pt \
    -n ${size} \
    -f datasets/wikitext-103 \
    -a ctxt,freq,lm_ent,lm_max,fert \
    -u ${cutoff} \
    -h ${ar_ckpt} \
    # -w "True"

Datastore Pruning

precompute all the retrieval results for every record in the datastore:

# It is possible to parallel this operation by change 
# "--start-point" and "--num" arguments so that the training
# data would be splitted into multiple smaller ones. In this case
# the retrieval results would be saved into multiple files
bash ef_knnlm/dstore_compression/save_retrieval_results.sh

The retrieval results are saved into dstore/greedy_merge, other datastore pruning algorithms may be played around using these pre-computed results.

greedy merging

# perform greedy merging to yield a new smaller datastore, 
# and build faiss index from the new datastore
bash ef_knnlm/dstore_compression/merge_compression.sh

The pruned datastore and index are saved into dstore/greedy_merging, replace the previousdstore_prefix/index_file with the new ones to use the pruned the datastore. The option -w "True"needs to be passed to eval_knnlm.sh to read the generated datastore weights file from greedy merging.

Reference

@inproceedings{he2021eff,
title={Efficient Nearest Neighbor Language Models},
author={Junxian He and Graham Neubig and Taylor Berg-Kirkpatrick},
booktitle={Proceedings of EMNLP},
year={2021}
}
Owner
Junxian He
NLP/ML PhD student at CMU
Junxian He
Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Nerdy Rodent 2.3k Jan 04, 2023
Wider-Yolo Kütüphanesi ile Yüz Tespit Uygulamanı Yap

WIDER-YOLO : Yüz Tespit Uygulaması Yap Wider-Yolo Kütüphanesinin Kullanımı 1. Wider Face Veri Setini İndir Train Dataset Val Dataset Test Dataset Not:

Kadir Nar 6 Aug 22, 2022
Simple SN-GAN to generate CryptoPunks

CryptoPunks GAN Simple SN-GAN to generate CryptoPunks. Neural network architecture and training code has been modified from the PyTorch DCGAN example.

Teddy Koker 66 Dec 15, 2022
Pytorch implementation of XRD spectral identification from COD database

XRDidentifier Pytorch implementation of XRD spectral identification from COD database. Details will be explained in the paper to be submitted to NeurI

Masaki Adachi 4 Jan 07, 2023
Code for "Adversarial Attack Generation Empowered by Min-Max Optimization", NeurIPS 2021

Min-Max Adversarial Attacks [Paper] [arXiv] [Video] [Slide] Adversarial Attack Generation Empowered by Min-Max Optimization Jingkang Wang, Tianyun Zha

Jingkang Wang 12 Nov 23, 2022
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 3k Dec 29, 2022
[ICCV 2021] Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation

EPCDepth EPCDepth is a self-supervised monocular depth estimation model, whose supervision is coming from the other image in a stereo pair. Details ar

Rui Peng 110 Dec 23, 2022
This is the repository for The Machine Learning Workshops, published by AI DOJO

This is the repository for The Machine Learning Workshops, published by AI DOJO. It contains all the workshop's code with supporting project files necessary to work through the code.

AI Dojo 12 May 06, 2022
Chess reinforcement learning by AlphaGo Zero methods.

About Chess reinforcement learning by AlphaGo Zero methods. This project is based on these main resources: DeepMind's Oct 19th publication: Mastering

Samuel 2k Dec 29, 2022
Seq2seq - Sequence to Sequence Learning with Keras

Seq2seq Sequence to Sequence Learning with Keras Hi! You have just found Seq2Seq. Seq2Seq is a sequence to sequence learning add-on for the python dee

Fariz Rahman 3.1k Dec 18, 2022
This MVP data web app uses the Streamlit framework and Facebook's Prophet forecasting package to generate a dynamic forecast from your own data.

📈 Automated Time Series Forecasting Background: This MVP data web app uses the Streamlit framework and Facebook's Prophet forecasting package to gene

Zach Renwick 42 Jan 04, 2023
Barlow Twins and HSIC

Barlow Twins and HSIC Unofficial Pytorch implementation for Barlow Twins and HSIC_SSL on small datasets (CIFAR10, STL10, and Tiny ImageNet). Correspon

Yao-Hung Hubert Tsai 49 Nov 24, 2022
This repository contains all data used for writing a research paper Multiple Object Trackers in OpenCV: A Benchmark, presented in ISIE 2021 conference in Kyoto, Japan.

OpenCV-Multiple-Object-Tracking Python is version 3.6.7 to install opencv: pip uninstall opecv-python pip uninstall opencv-contrib-python pip install

6 Dec 19, 2021
A toolkit for controlling Euro Truck Simulator 2 with python to develop self-driving algorithms.

europilot Overview Europilot is an open source project that leverages the popular Euro Truck Simulator(ETS2) to develop self-driving algorithms. A con

1.4k Jan 04, 2023
torchlm is aims to build a high level pipeline for face landmarks detection, it supports training, evaluating, exporting, inference(Python/C++) and 100+ data augmentations

💎A high level pipeline for face landmarks detection, supports training, evaluating, exporting, inference and 100+ data augmentations, compatible with torchvision and albumentations, can easily instal

DefTruth 142 Dec 25, 2022
Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers.

Contra-OOD Code for EMNLP 2021 paper Contrastive Out-of-Distribution Detection for Pretrained Transformers. Requirements PyTorch Transformers datasets

Wenxuan Zhou 27 Oct 28, 2022
Delta Conformity Sociopatterns Analysis - Delta Conformity Sociopatterns Analysis

Delta_Conformity_Sociopatterns_Analysis ∆-Conformity is a local homophily measur

2 Jan 09, 2022
This repository is dedicated to developing and maintaining code for experiments with wide neural networks.

Wide-Networks This repository contains the code of various experiments on wide neural networks. In particular, we implement classes for abc-parameteri

Karl Hajjar 0 Nov 02, 2021
KwaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

KuaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%) KuaiRec is a real-world dataset collected from the recommendation log

Chongming GAO (高崇铭) 70 Dec 28, 2022
An onlinel learning to rank python codebase.

OLTR Online learning to rank python codebase. The code related to Pairwise Differentiable Gradient Descent (ranker/PDGDLinearRanker.py) is copied from

ielab 5 Jul 18, 2022