Search Git commits in natural language

Overview

NaLCoS - NAtural Language COmmit Search

Search commit messages in your repository in natural language.

GitHub Issues Stargazers License Code style: black
GitHub release (latest by date) PyPi
All contributors


NaLCoS (NAtural Language COmmit Search) is a command-line tool for searching commit messages in your repository in natural language.

The key features are:

  • Search commit messages in both local and remote GitHub repositories.
  • Search for commits in a specific branch.
  • Restrict the number of commits to look back in history while searching.
  • Increase the number of retrieved results.

image

Internally, NaLCoS uses Sentence Transformers with pre-trained weights from multi-qa-MiniLM-L6-cos-v1. I chose this particular model because it has a good Performance vs Speed tradeoff. Since this model was designed for semantic search and has been pre-trained on 215M (question, answer) pairs from diverse sources, it is a good choice for tasks such as finding similarity between two sentences.

NaLCoS encodes the query string and all the commits into their corresponding vector embeddings and computes the cosine similarity between the query and all the commits. This is then used to rank the commits.

Why did I build this?

Most of the times when I've used Machine Learning till now, has been in dedicated environments such as Google Colab or Kaggle. I had been learning Natural Language Processing for a while and wanted to use transformers to build something different that is not very resource (read GPU) intensive and can be used like an everyday tool.

Though many Transformer models are far from fitting this description, I found that distilled models are not as hungry as their older siblings are infamous for. Searching for Git commits using natural language was something on which I could not find any pre-existing tool and thus decided to give this a shot.

Though there are various improvements left, I'm happy with what this initially turned out to be. I'm eager to see what further enhancements can be made to this to make it more efficient and useful.

Requirements

NaLCoS uses the following packages:

Installation

Installing with pip (Recommended)

Install with pip or your favourite PyPi manager:

$ pip install nalcos

Run NaLCoS with the --help flag to see all the available options:

$ nalcos --help

Note: When you run the nalcos command for the first time, it will, download the model which would be cached and used the next time you run NaLCoS.

Installing bleeding edge from the GitHub repository

  • Clone the repository:
$ git clone https://github.com/thepushkarp/nalcos.git

This also downloads the model weights stored in the nalcos/models directory so you don't have to download them while running the model for the first time.

  • Create a virtual environment (click here to read about activating virtualenv):
$ virtualenv venv
  • Activate virtualenv (for Linux and MacOS):
  $ source ./venv/bin/activate
  • Activate virtualenv (for Windows):
   $ cd venv/Scripts/
   $ activate
  • Install the requirements:
$ pip install -r requirements.txt
  • Change directory to the nalcos directory:
$ cd nalcos/
  • Run NaLCoS with the --help flag to see all the available options:
$ python nalcos.py --help

Usage

A detailed information about the usage of NaLCoS can be found below:

usage: nalcos [-h] [-g] [-n N_MATCHES] [-b BRANCH] [-l LOOK_PAST] [-v] query location

Search a commit in your git repository using natural language.

positional arguments:
  query                 The query to search for similar commit messages.
  location              The repository path to search in. If `-g` flag is not passed, searches locally in the path specified, else
                        takes in a remote GitHub repository name in the format '{owner}/{repo_name}'

optional arguments:
  -h, --help            show this help message and exit
  -g, --github          Flag to search on GitHub instead of searching in a local repository. Due to API limits currently this
                        allows for around 15 lookups per hour from your IP.
  -n N_MATCHES, --n-matches N_MATCHES
                        The number of matching results to return. Default 10.
  -b BRANCH, --branch BRANCH
                        The branch to search in. If not specified, the current branch will be used by default.
  -l LOOK_PAST, --look-past LOOK_PAST
                        Look back this many commits. Default 100.
  -v, --version         show program's version number and exit

Examples

  • Input:
$ python nalcos.py "improve language" "github/docs" --github
  • Output:
Found 100 commits.

                                        Commits related to "improve language" in "github/docs"
┏━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ No. ┃ Commit ID ┃ Commit Message                                                        ┃ Commit Author      ┃ Commit Date          ┃
┡━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━┩
│  1. │ 51bfdbb95 │ Merge branch 'main' into fatenhealy-fix-supportedlanguage             │ Faten Healy        │ 2021-09-12T22:26:31Z │
│  2. │ a9c2c8eea │ fix deprecation label spelling (#21474)                               │ Rachael Sewell     │ 2021-09-13T18:12:03Z │
│  3. │ 94e3c092d │ English search sync (#21446)                                          │ Rachael Sewell     │ 2021-09-13T17:30:08Z │
│  4. │ b048e27e9 │ Merge pull request #9909 from github/fatenhealy-fix-supportedlanguage │ Ramya Parimi       │ 2021-09-12T22:35:19Z │
│  5. │ 73c2717f7 │ Fix typo                                                              │ Adrian Mato        │ 2021-09-13T06:35:27Z │
│  6. │ 86b571982 │ Export changes to a branch for codespaces (#21462)                    │ Matthew Isabel     │ 2021-09-13T14:55:50Z │
│  7. │ 969288662 │ Update diff limit to 500KB (#20616)                                   │ jjkennedy3         │ 2021-09-11T09:12:38Z │
│  8. │ f28ee46d4 │ Update OpenAPI Descriptions (#21447)                                  │ github-openapi-bot │ 2021-09-11T09:22:28Z │
│  9. │ 92af3a469 │ update search indexes                                                 │ GitHub Actions     │ 2021-09-12T09:50:46Z │
│ 10. │ e6018f2aa │ update search indexes                                                 │ GitHub Actions     │ 2021-09-11T02:05:19Z │
└─────┴───────────┴───────────────────────────────────────────────────────────────────────┴────────────────────┴──────────────────────┘

Future plans

Please visit the NaLCoS To Do Project Board to see current status and future plans.

Known issues

Not all retrieved results are always relevant. I could think of two primary reasons for this:

  • The data the model was pre-trained on is not representative of how people write commit messages. Since commit messages usually contain technical jargon, merge commit messages, abbreviations and other non-common terms, the model (which has a limited vocabulary) is not able to generalize well to this data.
  • Two commits may be related even when their commit messages may not be similar and similarly two commit messages maybe unrelated even when their commit messages are similar. We often need more metadata (such as lines changes, files changed) etc. to make the predictions more accurate.

Contributing

Any suggestions, improvements or bug reports are welcome.

Contributors

Thanks goes to these wonderful people (emoji key):


Pushkar Patel

💻 📖 🚧

This project follows the all-contributors specification. Contributions of any kind welcome!

License

This project is licensed under the terms of the MIT license.

Comments
  • Patches

    Patches

    • :truck: Renames .cache directory to models and add Project Board link in README
    • Adds version.py
    • Adds contributing section in README
    • :lipstick: Add black code style
    • Adds back torch cuda support
    • :zap: Improve similarity computation
    • :tada: Upload to PyPi
    opened by thepushkarp 3
  • Version 0.2: Visual changes

    Version 0.2: Visual changes

    • Fix status message typo
    • :sparkles: Adds a flag to show similarity scores of the result
    • :sparkles: Adds an flag to display the entire commit message
    • Module error fixes
    • :sparkles: Adds commit links for results from GitHub
    • :art: Improve download prograss bar display when loading for first time
    • :bookmark: Bump version to 0.2
    opened by thepushkarp 1
  • Add a flag to download the model

    Add a flag to download the model

    Currently, if the model is not downloaded, the program downloads it during the first run, in the middle of the "Retrieving the commits ..." status mesage.

    This can be improved by adding a flag through which the user can download/redownload the model when they need it.

    Additionally, the program should prompt the user when it is run without downloading the weights with a choice to download it now or to abort the program.

    opened by thepushkarp 1
  • Use a Python Wrapper to the GitHub API

    Use a Python Wrapper to the GitHub API

    We can use some Python Wrapper of the GitHub API such as ghapi or PyGitHub instead of using the requests library.

    Additional Reference: https://docs.github.com/en/rest/overview/libraries#python

    This can help with #11

    opened by thepushkarp 1
  • Visual improvements

    Visual improvements

    • :sparkles: Adds a flag to show similarity scores of the result
    • :sparkles: Adds an flag to display the entire commit message
    • Module error fixes
    • :sparkles: Adds commit links for results from GitHub
    opened by thepushkarp 0
  • Add Automated Testing

    Add Automated Testing

    AUtomate the testing of the module (preferably with GitHub Actions for CI).

    NOTE: Considering the large installation size and time of the Torch and HuggingFace modules, the resources allocated may go over the GH Actions limit. This is something we have to take care of.

    Follow up of #13

    opened by thepushkarp 0
  • Adds README and some bug fixes

    Adds README and some bug fixes

    • Adds API limit exceeded warnning
    • :zap: Reverts back to using whole commit msg for serarch; displays only title
    • :memo: Add README
    • :zap: Reduces default value of look_past from 1000 to 100
    • :bug: Retrieves all branch names for GitHub repos and add branch not found Exception
    opened by thepushkarp 0
  • Try out other models.

    Try out other models.

    Currently, we are using multi-qa-MiniLM-L6-cos-v1, which has a speed (sentences encoded/sec on 1 V100 GPU) of 14200 and a model size of 80 MB. We should try out other models to see if we can get better performance and speed out of them.

    Additionally, we can also try using other types of tokenizers.

    Further reading:

    • https://www.sbert.net/docs/pretrained_models.html
    • https://huggingface.co/sentence-transformers
    • https://huggingface.co/transformers/tokenizer_summary.html
    help wanted 
    opened by thepushkarp 0
  • Add personal API token support

    Add personal API token support

    Do #28 before this

    Currently, the project is using an unauthenticated GH API which is capped to 60 requests per hour from an IP address.

    We can add the option to add a user's personal API access token to increase this limit.

    enhancement 
    opened by thepushkarp 0
Releases(v0.2)
  • v0.2(Sep 18, 2021)

    Changelog

    • Adds an option of showing the similarity score for the results using the -s flag.
    • Adds option of viewing the entire commit message instead of just the commit title using the -v flag.
    • Commits in results retrieved from GitHub have links to the commits
    • Improved the model download progress bar display when loading model for the first time
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Sep 14, 2021)

    Changelog

    • Make similarity computation more efficient
    • Add support for computation on CUDA
    • Add Black Code style in requirements
    • Published to PyPi at https://pypi.org/project/nalcos/ 🥳
    Source code(tar.gz)
    Source code(zip)
  • v0.1(Sep 13, 2021)

    Features ✨

    • Search commit messages in both local and remote GitHub repositories.
    • Search for commits in a specific branch.
    • Restrict the number of commits to look back in history while searching.
    • Increase the number of retrieved results.
    Source code(tar.gz)
    Source code(zip)
Owner
Pushkar Patel
Research Intern at SPIRE Labs, IISC Bangalore | GitHub Campus Expert @iiitv
Pushkar Patel
A PyTorch implementation of the Transformer model in "Attention is All You Need".

Attention is all you need: A Pytorch Implementation This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish V

Yu-Hsiang Huang 7.1k Jan 05, 2023
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022
Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

PART 2: CHAIN LINKING AUDIO-TO-TEXT NLP TASKS 2A: TRANSCRIBE-TRANSLATE-SENTIMENT-ANALYSIS In notebook3.0, I demo a simple workflow to: transcribe a lo

Chua Chin Hon 30 Jul 13, 2022
A Japanese tokenizer based on recurrent neural networks

Nagisa is a python module for Japanese word segmentation/POS-tagging. It is designed to be a simple and easy-to-use tool. This tool has the following

325 Jan 05, 2023
Plugin repository for Macast

Macast-plugins Plugin repository for Macast. How to use third-party player plugin Download Macast from GitHub Release. Download the plugin you want fr

109 Jan 04, 2023
Implementation of TTS with combination of Tacotron2 and HiFi-GAN

Tacotron2-HiFiGAN-master Implementation of TTS with combination of Tacotron2 and HiFi-GAN for Mandarin TTS. Inference In order to inference, we need t

SunLu Z 7 Nov 11, 2022
List of GSoC organisations with number of times they have been selected.

Welcome to GSoC Organisation Frequency And Details 👋 List of GSoC organisations with number of times they have been selected, techonologies, topics,

Shivam Kumar Jha 41 Oct 01, 2022
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Rishikesh (ऋषिकेश) 33 Sep 22, 2022
Official code repository of the paper Linear Transformers Are Secretly Fast Weight Programmers.

Linear Transformers Are Secretly Fast Weight Programmers This repository contains the code accompanying the paper Linear Transformers Are Secretly Fas

Imanol Schlag 77 Dec 19, 2022
2021海华AI挑战赛·中文阅读理解·技术组·第三名

文字是人类用以记录和表达的最基本工具,也是信息传播的重要媒介。透过文字与符号,我们可以追寻人类文明的起源,可以传播知识与经验,读懂文字是认识与了解的第一步。对于人工智能而言,它的核心问题之一就是认知,而认知的核心则是语义理解。

21 Dec 26, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
CCKS-Title-based-large-scale-commodity-entity-retrieval-top1

- 基于标题的大规模商品实体检索top1 一、任务介绍 CCKS 2020:基于标题的大规模商品实体检索,任务为对于给定的一个商品标题,参赛系统需要匹配到该标题在给定商品库中的对应商品实体。 输入:输入文件包括若干行商品标题。 输出:输出文本每一行包括此标题对应的商品实体,即给定知识库中商品 ID,

43 Nov 11, 2022
Named Entity Recognition API used by TEI Publisher

TEI Publisher Named Entity Recognition API This repository contains the API used by TEI Publisher's web-annotation editor to detect entities in the in

e-editiones.org 14 Nov 15, 2022
BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

Ansh Gupta 0 May 06, 2022
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
A Python script that compares files in directories

compare-files A Python script that compares files in different directories, this is similar to the command filecmp.cmp(f1, f2). I made this script in

Colvin 1 Oct 15, 2021
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
Source code of the "Graph-Bert: Only Attention is Needed for Learning Graph Representations" paper

Graph-Bert Source code of "Graph-Bert: Only Attention is Needed for Learning Graph Representations". Please check the script.py as the entry point. We

14 Mar 25, 2022
Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

Iman Kermani 3 Apr 15, 2022
API for the GPT-J language model 🦜. Including a FastAPI backend and a streamlit frontend

gpt-j-api 🦜 An API to interact with the GPT-J language model. You can use and test the model in two different ways: Streamlit web app at http://api.v

Víctor Gallego 276 Dec 31, 2022