Search Git commits in natural language

Overview

NaLCoS - NAtural Language COmmit Search

Search commit messages in your repository in natural language.

GitHub Issues Stargazers License Code style: black
GitHub release (latest by date) PyPi
All contributors


NaLCoS (NAtural Language COmmit Search) is a command-line tool for searching commit messages in your repository in natural language.

The key features are:

  • Search commit messages in both local and remote GitHub repositories.
  • Search for commits in a specific branch.
  • Restrict the number of commits to look back in history while searching.
  • Increase the number of retrieved results.

image

Internally, NaLCoS uses Sentence Transformers with pre-trained weights from multi-qa-MiniLM-L6-cos-v1. I chose this particular model because it has a good Performance vs Speed tradeoff. Since this model was designed for semantic search and has been pre-trained on 215M (question, answer) pairs from diverse sources, it is a good choice for tasks such as finding similarity between two sentences.

NaLCoS encodes the query string and all the commits into their corresponding vector embeddings and computes the cosine similarity between the query and all the commits. This is then used to rank the commits.

Why did I build this?

Most of the times when I've used Machine Learning till now, has been in dedicated environments such as Google Colab or Kaggle. I had been learning Natural Language Processing for a while and wanted to use transformers to build something different that is not very resource (read GPU) intensive and can be used like an everyday tool.

Though many Transformer models are far from fitting this description, I found that distilled models are not as hungry as their older siblings are infamous for. Searching for Git commits using natural language was something on which I could not find any pre-existing tool and thus decided to give this a shot.

Though there are various improvements left, I'm happy with what this initially turned out to be. I'm eager to see what further enhancements can be made to this to make it more efficient and useful.

Requirements

NaLCoS uses the following packages:

Installation

Installing with pip (Recommended)

Install with pip or your favourite PyPi manager:

$ pip install nalcos

Run NaLCoS with the --help flag to see all the available options:

$ nalcos --help

Note: When you run the nalcos command for the first time, it will, download the model which would be cached and used the next time you run NaLCoS.

Installing bleeding edge from the GitHub repository

  • Clone the repository:
$ git clone https://github.com/thepushkarp/nalcos.git

This also downloads the model weights stored in the nalcos/models directory so you don't have to download them while running the model for the first time.

  • Create a virtual environment (click here to read about activating virtualenv):
$ virtualenv venv
  • Activate virtualenv (for Linux and MacOS):
  $ source ./venv/bin/activate
  • Activate virtualenv (for Windows):
   $ cd venv/Scripts/
   $ activate
  • Install the requirements:
$ pip install -r requirements.txt
  • Change directory to the nalcos directory:
$ cd nalcos/
  • Run NaLCoS with the --help flag to see all the available options:
$ python nalcos.py --help

Usage

A detailed information about the usage of NaLCoS can be found below:

usage: nalcos [-h] [-g] [-n N_MATCHES] [-b BRANCH] [-l LOOK_PAST] [-v] query location

Search a commit in your git repository using natural language.

positional arguments:
  query                 The query to search for similar commit messages.
  location              The repository path to search in. If `-g` flag is not passed, searches locally in the path specified, else
                        takes in a remote GitHub repository name in the format '{owner}/{repo_name}'

optional arguments:
  -h, --help            show this help message and exit
  -g, --github          Flag to search on GitHub instead of searching in a local repository. Due to API limits currently this
                        allows for around 15 lookups per hour from your IP.
  -n N_MATCHES, --n-matches N_MATCHES
                        The number of matching results to return. Default 10.
  -b BRANCH, --branch BRANCH
                        The branch to search in. If not specified, the current branch will be used by default.
  -l LOOK_PAST, --look-past LOOK_PAST
                        Look back this many commits. Default 100.
  -v, --version         show program's version number and exit

Examples

  • Input:
$ python nalcos.py "improve language" "github/docs" --github
  • Output:
Found 100 commits.

                                        Commits related to "improve language" in "github/docs"
┏━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ No. ┃ Commit ID ┃ Commit Message                                                        ┃ Commit Author      ┃ Commit Date          ┃
┡━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│  1. │ 51bfdbb95 │ Merge branch 'main' into fatenhealy-fix-supportedlanguage             │ Faten Healy        │ 2021-09-12T22:26:31Z │
│  2. │ a9c2c8eea │ fix deprecation label spelling (#21474)                               │ Rachael Sewell     │ 2021-09-13T18:12:03Z │
│  3. │ 94e3c092d │ English search sync (#21446)                                          │ Rachael Sewell     │ 2021-09-13T17:30:08Z │
│  4. │ b048e27e9 │ Merge pull request #9909 from github/fatenhealy-fix-supportedlanguage │ Ramya Parimi       │ 2021-09-12T22:35:19Z │
│  5. │ 73c2717f7 │ Fix typo                                                              │ Adrian Mato        │ 2021-09-13T06:35:27Z │
│  6. │ 86b571982 │ Export changes to a branch for codespaces (#21462)                    │ Matthew Isabel     │ 2021-09-13T14:55:50Z │
│  7. │ 969288662 │ Update diff limit to 500KB (#20616)                                   │ jjkennedy3         │ 2021-09-11T09:12:38Z │
│  8. │ f28ee46d4 │ Update OpenAPI Descriptions (#21447)                                  │ github-openapi-bot │ 2021-09-11T09:22:28Z │
│  9. │ 92af3a469 │ update search indexes                                                 │ GitHub Actions     │ 2021-09-12T09:50:46Z │
│ 10. │ e6018f2aa │ update search indexes                                                 │ GitHub Actions     │ 2021-09-11T02:05:19Z │
└─────┴───────────┴───────────────────────────────────────────────────────────────────────┴────────────────────┴──────────────────────┘

Future plans

Please visit the NaLCoS To Do Project Board to see current status and future plans.

Known issues

Not all retrieved results are always relevant. I could think of two primary reasons for this:

  • The data the model was pre-trained on is not representative of how people write commit messages. Since commit messages usually contain technical jargon, merge commit messages, abbreviations and other non-common terms, the model (which has a limited vocabulary) is not able to generalize well to this data.
  • Two commits may be related even when their commit messages may not be similar and similarly two commit messages maybe unrelated even when their commit messages are similar. We often need more metadata (such as lines changes, files changed) etc. to make the predictions more accurate.

Contributing

Any suggestions, improvements or bug reports are welcome.

Contributors

Thanks goes to these wonderful people (emoji key):


Pushkar Patel

💻 📖 🚧

This project follows the all-contributors specification. Contributions of any kind welcome!

License

This project is licensed under the terms of the MIT license.

Comments
  • Patches

    Patches

    • :truck: Renames .cache directory to models and add Project Board link in README
    • Adds version.py
    • Adds contributing section in README
    • :lipstick: Add black code style
    • Adds back torch cuda support
    • :zap: Improve similarity computation
    • :tada: Upload to PyPi
    opened by thepushkarp 3
  • Version 0.2: Visual changes

    Version 0.2: Visual changes

    • Fix status message typo
    • :sparkles: Adds a flag to show similarity scores of the result
    • :sparkles: Adds an flag to display the entire commit message
    • Module error fixes
    • :sparkles: Adds commit links for results from GitHub
    • :art: Improve download prograss bar display when loading for first time
    • :bookmark: Bump version to 0.2
    opened by thepushkarp 1
  • Add a flag to download the model

    Add a flag to download the model

    Currently, if the model is not downloaded, the program downloads it during the first run, in the middle of the "Retrieving the commits ..." status mesage.

    This can be improved by adding a flag through which the user can download/redownload the model when they need it.

    Additionally, the program should prompt the user when it is run without downloading the weights with a choice to download it now or to abort the program.

    opened by thepushkarp 1
  • Use a Python Wrapper to the GitHub API

    Use a Python Wrapper to the GitHub API

    We can use some Python Wrapper of the GitHub API such as ghapi or PyGitHub instead of using the requests library.

    Additional Reference: https://docs.github.com/en/rest/overview/libraries#python

    This can help with #11

    opened by thepushkarp 1
  • Visual improvements

    Visual improvements

    • :sparkles: Adds a flag to show similarity scores of the result
    • :sparkles: Adds an flag to display the entire commit message
    • Module error fixes
    • :sparkles: Adds commit links for results from GitHub
    opened by thepushkarp 0
  • Add Automated Testing

    Add Automated Testing

    AUtomate the testing of the module (preferably with GitHub Actions for CI).

    NOTE: Considering the large installation size and time of the Torch and HuggingFace modules, the resources allocated may go over the GH Actions limit. This is something we have to take care of.

    Follow up of #13

    opened by thepushkarp 0
  • Adds README and some bug fixes

    Adds README and some bug fixes

    • Adds API limit exceeded warnning
    • :zap: Reverts back to using whole commit msg for serarch; displays only title
    • :memo: Add README
    • :zap: Reduces default value of look_past from 1000 to 100
    • :bug: Retrieves all branch names for GitHub repos and add branch not found Exception
    opened by thepushkarp 0
  • Try out other models.

    Try out other models.

    Currently, we are using multi-qa-MiniLM-L6-cos-v1, which has a speed (sentences encoded/sec on 1 V100 GPU) of 14200 and a model size of 80 MB. We should try out other models to see if we can get better performance and speed out of them.

    Additionally, we can also try using other types of tokenizers.

    Further reading:

    • https://www.sbert.net/docs/pretrained_models.html
    • https://huggingface.co/sentence-transformers
    • https://huggingface.co/transformers/tokenizer_summary.html
    help wanted 
    opened by thepushkarp 0
  • Add personal API token support

    Add personal API token support

    Do #28 before this

    Currently, the project is using an unauthenticated GH API which is capped to 60 requests per hour from an IP address.

    We can add the option to add a user's personal API access token to increase this limit.

    enhancement 
    opened by thepushkarp 0
Releases(v0.2)
  • v0.2(Sep 18, 2021)

    Changelog

    • Adds an option of showing the similarity score for the results using the -s flag.
    • Adds option of viewing the entire commit message instead of just the commit title using the -v flag.
    • Commits in results retrieved from GitHub have links to the commits
    • Improved the model download progress bar display when loading model for the first time
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Sep 14, 2021)

    Changelog

    • Make similarity computation more efficient
    • Add support for computation on CUDA
    • Add Black Code style in requirements
    • Published to PyPi at https://pypi.org/project/nalcos/ 🥳
    Source code(tar.gz)
    Source code(zip)
  • v0.1(Sep 13, 2021)

    Features ✨

    • Search commit messages in both local and remote GitHub repositories.
    • Search for commits in a specific branch.
    • Restrict the number of commits to look back in history while searching.
    • Increase the number of retrieved results.
    Source code(tar.gz)
    Source code(zip)
Owner
Pushkar Patel
Research Intern at SPIRE Labs, IISC Bangalore | GitHub Campus Expert @iiitv
Pushkar Patel
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

lang lang is a simple stack based programming language written in Python. It can

Christoffer Aakre 1 May 30, 2022
A collection of models for image - text generation in ACM MM 2021.

Bi-directional Image and Text Generation UMT-BITG (image & text generator) Unifying Multimodal Transformer for Bi-directional Image and Text Generatio

Multimedia Research 63 Oct 30, 2022
Lattice methods in TensorFlow

TensorFlow Lattice TensorFlow Lattice is a library that implements constrained and interpretable lattice based models. It is an implementation of Mono

504 Dec 20, 2022
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 665 Dec 17, 2022
EMNLP 2021 paper "Pre-train or Annotate? Domain Adaptation with a Constrained Budget".

Pre-train or Annotate? Domain Adaptation with a Constrained Budget This repo contains code and data associated with EMNLP 2021 paper "Pre-train or Ann

Fan Bai 8 Dec 17, 2021
iBOT: Image BERT Pre-Training with Online Tokenizer

Image BERT Pre-Training with iBOT Official PyTorch implementation and pretrained models for paper iBOT: Image BERT Pre-Training with Online Tokenizer.

Bytedance Inc. 435 Jan 06, 2023
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 06, 2023
Perform sentiment analysis on textual data that people generally post on websites like social networks and movie review sites.

Sentiment Analyzer The goal of this project is to perform sentiment analysis on textual data that people generally post on websites like social networ

Madhusudan.C.S 53 Mar 01, 2022
TFIDF-based QA system for AIO2 competition

AIO2 TF-IDF Baseline This is a very simple question answering system, which is developed as a lightweight baseline for AIO2 competition. In the traini

Masatoshi Suzuki 4 Feb 19, 2022
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning English | 中文 ❗ Now we provide inferencing code and pre-training models

164 Jan 02, 2023
Blazing fast language detection using fastText model

Luga A blazing fast language detection using fastText's language models Luga is a Swahili word for language. fastText provides a blazing fast language

Prayson Wilfred Daniel 18 Dec 20, 2022
Converts text into a PDF of handwritten notes

Text To Handwritten Notes Converts text into a PDF of handwritten notes Explore the docs » · Report Bug · Request Feature · Steps: $ git clone https:/

UVSinghK 63 Oct 09, 2022
自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器

ja-timex 自然言語で書かれた時間情報表現を抽出/規格化するルールベースの解析器 概要 ja-timex は、現代日本語で書かれた自然文に含まれる時間情報表現を抽出しTIMEX3と呼ばれるアノテーション仕様に変換することで、プログラムが利用できるような形に規格化するルールベースの解析器です。

Yuki Okuda 116 Nov 09, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
Connectionist Temporal Classification (CTC) decoding algorithms: best path, beam search, lexicon search, prefix search, and token passing. Implemented in Python.

CTC Decoding Algorithms Update 2021: installable Python package Python implementation of some common Connectionist Temporal Classification (CTC) decod

Harald Scheidl 736 Jan 03, 2023
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Dense Passage Retrieval Dense Passage Retrieval (DPR) - is a set of tools and models for state-of-the-art open-domain Q&A research. It is based on the

Meta Research 1.1k Jan 07, 2023