Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Overview

Cherche

Neural search



Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. Cherche's main strength is its ability to build diverse and end-to-end pipelines.

Alt text

Installation 🤖

pip install cherche

To install the development version:

pip install git+https://github.com/raphaelsty/cherche

Documentation 📜

Documentation is available here. It provides details about retrievers, rankers, pipelines, question answering, summarization, and examples.

QuickStart 💨

Documents 📑

Cherche allows findings the right document within a list of objects. Here is an example of a corpus.

from cherche import data

documents = data.load_towns()

documents[:3]
[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris is the capital and most populous city of France.'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of science, and arts."},
 {'id': 2,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France.'
  }]

Retriever ranker 🔍

Here is an example of a neural search pipeline composed of a TfIdf that quickly retrieves documents, followed by a ranking model. The ranking model sorts the documents produced by the retriever based on the semantic similarity between the query and the documents.

from cherche import data, retrieve, rank
from sentence_transformers import SentenceTransformer

# List of dicts
documents = data.load_towns()

# Retrieve on fields title and article
retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents, k=30)

# Rank on fields title and article
ranker = rank.Encoder(
    key = "id",
    on = ["title", "article"],
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k = 3,
    path = "encoder.pkl"
)

# Pipeline creation
search = retriever + ranker

search.add(documents=documents)

search("Bordeaux")
[{'id': 57, 'similarity': 0.69513476},
 {'id': 63, 'similarity': 0.6214991},
 {'id': 65, 'similarity': 0.61809057}]

Map the index to the documents to access their contents.

search += documents
search("Bordeaux")
[{'id': 57,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux ( bor-DOH, French: [bɔʁdo] (listen); Gascon Occitan: Bordèu [buɾˈðɛw]) is a port city on the river Garonne in the Gironde department, Southwestern France.',
  'similarity': 0.69513476},
 {'id': 63,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'The term "Bordelais" may also refer to the city and its surrounding region.',
  'similarity': 0.6214991},
 {'id': 65,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': "Bordeaux is a world capital of wine, with its castles and vineyards of the Bordeaux region that stand on the hillsides of the Gironde and is home to the world's main wine fair, Vinexpo.",
  'similarity': 0.61809057}]

Retrieve 👻

Cherche provides different retrievers that filter input documents based on a query.

  • retrieve.Elastic
  • retrieve.TfIdf
  • retrieve.Lunr
  • retrieve.BM25Okapi
  • retrieve.BM25L
  • retrieve.Flash
  • retrieve.Encoder

Rank 🤗

Cherche rankers are compatible with SentenceTransformers models, Hugging Face sentence similarity models, Hugging Face zero shot classification models, and of course with your own models.

Summarization and question answering

Cherche provides modules dedicated to summarization and question answering. These modules are compatible with Hugging Face's pre-trained models and can be fully integrated into neural search pipelines.

Acknowledgements 👏

The BM25 models available in Cherche are wrappers around rank_bm25. Elastic retriever is a wrapper around Python Elasticsearch Client. TfIdf retriever is a wrapper around scikit-learn's TfidfVectorizer. Lunr retriever is a wrapper around Lunr.py. Flash retriever is a wrapper around FlashText. DPR and Encode rankers are wrappers dedicated to the use of the pre-trained models of SentenceTransformers in a neural search pipeline. ZeroShot ranker is a wrapper dedicated to the use of the zero-shot sequence classifiers of Hugging Face in a neural search pipeline.

See also 👀

Cherche is a minimalist solution and meets a need for modularity. Cherche is the way to go if you start with a list of documents as JSON with multiple fields to search on and want to create pipelines. Also ,Cherche is well suited for middle sized corpora.

Do not hesitate to look at Haystack, Jina, or TxtAi which offer very advanced solutions for neural search and are great.

Dev Team 💾

The Cherche dev team is made up of Raphaël Sourty and François-Paul Servant 🥳

Comments
  • Added spelling corrector object

    Added spelling corrector object

    Hello ! I added a spelling corrector base class as well as the original implementation of the Norvig spelling corrector. The spelling corrector can be fitted directly on the pipeline's documents with the '.add(documents)' method. I also provided an optional (defaults to False) external dictionary, the one originally used by Norvig.

    I have no issue updating my code for improvements, so feel free to suggest any modification !

    opened by NicolasBizzozzero 4
  • 0.0.5

    0.0.5

    Pull request for Cherche version 0.0.5

    • RAG: add RAG generator for open domain question answering
    • RapidFuzzy: New blazzing fast retriever
    • Retrievers: Provide similarities for each retriever
    • Union & Intersection: Keep similarity scores
    opened by raphaelsty 1
  • Batch processing

    Batch processing

    Retrieving documents with batch of queries can significantly speed up things. It is now available for few models using the development version via the batch method.

    Models involved are:

    • TfIdf retriever
    • Encoder retriever (milvus + faiss)
    • Encoder ranker (milvus)
    • DPR retriever (milvus + faiss)
    • DPR ranker (milvus)
    • Recommend retriever

    Batch is not yet compatible with pipelines.

    enhancement 
    opened by raphaelsty 0
  • Cherche 1.0.0

    Cherche 1.0.0

    Here is an essential update for Cherche. The update retains the previous API and is compatible with previous versions. 🥳

    Main additions:

    • Added compatibility with two new open-source retrievers: Meilisearch and TypeSense.
    • Compatibility with the Milvus index to use the retriever.Encoder and retriever.DPR models on massive corpora.
    • Compatibility with the Milvus index to store ranker embeddings in a database rather than in memory.
    • Progress bar when pre-computing embeddings by Encoder, DPR retrievers and Encoder, DPR rankers.
    • All pipelines (voting, intersection, concatenation) produce a similarity score. To do so, the pipeline object applies a softmax to normalize the scores, thus allowing us to "compare" the scores of two distinct models.
    • Integration of collaborative filtering models via adding a Recommend retriever and a Recommend ranker (indexation via Faiss and compatible with Milvus) to consider users' preferences in the search.
    opened by raphaelsty 0
  • "IndexError: index out of range in self "While adding documents to cherche pipeline

    I'm using a cherche pipline built of a tfidf retriever with a sentencetransformer ranker as follows : search = (retriever + ranker) While trying to add documents to the pipeline (search.add(documents=documents), I got this error :

    """/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 2181 # remove once script supports set_grad_enabled 2182 no_grad_embedding_renorm(weight, input, max_norm, norm_type) -> 2183 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 2184 2185

    IndexError: index out of range in self"""

    opened by delmetni 0
  • incomplete doc about metrics

    incomplete doc about metrics

    opened by fpservant 0
Releases(1.0.1)
  • 1.0.1(Oct 27, 2022)

  • 1.0.0(Oct 26, 2022)

    What's Changed

    Here is an essential update for Cherche! 🥳

    • Added compatibility with two new open-source retrievers: Meilisearch and TypeSense.
    • Compatibility with the Milvus index to use the retriever.Encoder and retriever.DPR models on massive corpora.
    • Compatibility with the Milvus index to store ranker embeddings in a database rather than in memory.
    • Progress bar when pre-computing embeddings by Encoder, DPR retrievers and Encoder, DPR rankers.
    • The path parameter is no longer used.
    • All pipelines (voting, intersection, concatenation) produce a similarity score. To do so, the pipeline object applies a softmax to normalize the scores, thus allowing us to "compare" the scores of two distinct models.
    • Integration of collaborative filtering models via adding a Recommend retriever and a Recommend ranker (indexation via Faiss and compatible with Milvus) to consider users' preferences in the search.

    Cherche is now fully compatible with large-scale corpora and deeply integrates collaborative filtering. Updates retains the previous API and is compatible with previous versions.

    Source code(tar.gz)
    Source code(zip)
  • 0.1.0(Jun 16, 2022)

    Added compatibility with the ONNX environment and quantization to significantly speed up sentence transformers and question answering models. 🏎

    It is now possible to choose the type of index for the Encoder and DPR retrievers in order to process the largest corpora while using the GPU.

    Source code(tar.gz)
    Source code(zip)
  • 0.0.9(Apr 13, 2022)

  • 0.0.8(Mar 7, 2022)

  • 0.0.7(Mar 7, 2022)

  • 0.0.6(Mar 3, 2022)

    • Update documentation
    • Update retriever Encoder and DPR, path is optionnal
    • Add deployment documentation
    • Update similarity type
    • Avoid round similarity
    Source code(tar.gz)
    Source code(zip)
  • 0.0.5(Feb 8, 2022)

    • Loading and Saving tutorial
    • Fuzzy retriever
    • Similarities everywhere (retrievers, union, intersection provide similarity scores)
    • RAG generation
    Source code(tar.gz)
    Source code(zip)
  • 0.0.4(Jan 20, 2022)

    Update of the encoder retriever and the DPR retriever. Documents in the Faiss index will not be duplicated. Query embeddings can now be pre-computed for ranker Encoder and ranker DPR to speed up evaluation without having to compute it again.

    Source code(tar.gz)
    Source code(zip)
  • 0.0.3(Jan 13, 2022)

  • 0.0.2(Jan 12, 2022)

    Update of the Cherche dependencies. The previous dependencies were too strict and restrictive as they were limited to a specific version for each package.

    Source code(tar.gz)
    Source code(zip)
Owner
Raphael Sourty
PhD Student @ IRIT and Renault
Raphael Sourty
Speach Recognitions

easy_meeting Добро пожаловать в интерфейс сервиса автопротоколирования совещаний Easy Meeting. Website - http://cf5c-62-192-251-83.ngrok.io/ Принципиа

Maksim 3 Feb 18, 2022
多语言降噪预训练模型MBart的中文生成任务

mbart-chinese 基于mbart-large-cc25 的中文生成任务 Input source input: text + /s + lang_code target input: lang_code + text + /s Usage token_ids_mapping.jso

11 Sep 19, 2022
L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022
Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

CIRPLANT This repository contains the code and pre-trained models for Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT) For d

Zheyuan (David) Liu 29 Nov 17, 2022
This repository is home to the Optimus data transformation plugins for various data processing needs.

Transformers Optimus's transformation plugins are implementations of Task and Hook interfaces that allows execution of arbitrary jobs in optimus. To i

Open Data Platform 37 Dec 14, 2022
Label data using HuggingFace's transformers and automatically get a prediction service

Label Studio for Hugging Face's Transformers Website • Docs • Twitter • Join Slack Community Transfer learning for NLP models by annotating your textu

Heartex 135 Dec 29, 2022
ConvBERT: Improving BERT with Span-based Dynamic Convolution

ConvBERT Introduction In this repo, we introduce a new architecture ConvBERT for pre-training based language model. The code is tested on a V100 GPU.

YITUTech 237 Dec 10, 2022
Example code for "Real-World Natural Language Processing"

Real-World Natural Language Processing This repository contains example code for the book "Real-World Natural Language Processing." AllenNLP (2.5.0 or

Masato Hagiwara 303 Dec 17, 2022
STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

STonKGs STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs. This multimodal Transformer combin

STonKGs 27 Aug 11, 2022
Tool to check whether a GCP bucket is public or not.

Tool to check publicly accessible GCP bucket. Blog https://justm0rph3u5.medium.com/gcp-inspector-auditing-publicly-exposed-gcp-bucket-ac6cad55618c Wha

DIVYANSHU SHUKLA 7 Nov 24, 2022
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 | 한국어 State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained models

Hugging Face 77.1k Dec 31, 2022
Wind Speed Prediction using LSTMs in PyTorch

Implementation of Deep-Forecast using PyTorch Deep Forecast: Deep Learning-based Spatio-Temporal Forecasting Adapted from original implementation Setu

Onur Kaplan 151 Dec 14, 2022
A python gui program to generate reddit text to speech videos from the id of any post.

Reddit text to speech generator A python gui program to generate reddit text to speech videos from the id of any post. Current functionality Generate

Aadvik 17 Dec 19, 2022
Learning to Rewrite for Non-Autoregressive Neural Machine Translation

RewriteNAT This repo provides the code for reproducing our proposed RewriteNAT in EMNLP 2021 paper entitled "Learning to Rewrite for Non-Autoregressiv

Xinwei Geng 20 Dec 25, 2022
Repository for fine-tuning Transformers 🤗 based seq2seq speech models in JAX/Flax.

Seq2Seq Speech in JAX A JAX/Flax repository for combining a pre-trained speech encoder model (e.g. Wav2Vec2, HuBERT, WavLM) with a pre-trained text de

Sanchit Gandhi 21 Dec 14, 2022
A list of NLP(Natural Language Processing) tutorials

NLP Tutorial A list of NLP(Natural Language Processing) tutorials built on PyTorch. Table of Contents A step-by-step tutorial on how to implement and

Allen Lee 1.3k Dec 25, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision Training Efficiency We show the training efficiency of our DSLP model b

Chenyang Huang 37 Jan 04, 2023
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022
Tracking Progress in Natural Language Processing

Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the most common NLP tasks.

Sebastian Ruder 21.2k Dec 30, 2022