A Python framework for conversational search

Overview

Chatty Goose

Multi-stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting


PyPI LICENSE

Installation

  1. Make sure Java 11+ and Python 3.7+ are installed

  2. Install the chatty-goose PyPI module

pip install chatty-goose
  1. If you are using T5 or BERT, make sure to install PyTorch 1.4.0 - 1.7.1 using your specific platform instructions. Note that PyTorch 1.8 is currently incompatible due to the transformers version we currently use. Also make sure to install the corresponding torchtext version.

  2. Download the English model for spaCy

python -m spacy download en_core_web_sm

Quickstart Guide

The following example shows how to initialize a searcher and build a ConversationalQueryRewriter agent from scratch using HQE and T5 as first-stage retrievers, and a BERT reranker. To see a working example agent, see chatty_goose/agents/chat.py.

First, load a searcher

from pyserini.search import SimpleSearcher

# Option 1: load a prebuilt index
searcher = SimpleSearcher.from_prebuilt_index("INDEX_NAME_HERE")
# Option 2: load a local Lucene index
searcher = SimpleSearcher("PATH_TO_INDEX")

searcher.set_bm25(0.82, 0.68)

Next, initialize one or more first-stage CQR retrievers

from chatty_goose.cqr import Hqe, Ntr
from chatty_goose.settings import HqeSettings, NtrSettings

hqe = Hqe(searcher, HqeSettings())
ntr = Ntr(NtrSettings())

Load a reranker

from chatty_goose.util import build_bert_reranker

reranker = build_bert_reranker()

Create a new RetrievalPipeline

from chatty_goose.pipeline import RetrievalPipeline

rp = RetrievalPipeline(searcher, [hqe, ntr], searcher_num_hits=50, reranker=reranker)

And we're done! Simply call rp.retrieve(query) to retrieve passages, or call rp.reset_history() to reset the conversational history of the retrievers.

Running Experiments

  1. Clone the repo and all submodules (git submodule update --init --recursive)

  2. Clone and build Anserini for evaluation tools

  3. Install dependencies

pip install -r requirements.txt
  1. Follow the instructions under docs/cqr_experiments.md to run experiments using HQE, T5, or fusion.

Example Agent

To run an interactive conversational search agent with ParlAI, simply run chat.py. By default, we use the CAsT 2019 pre-built Pyserini index, but it is possible to specify other indexes using the --from_prebuilt flag. See the file for other possible arguments:

python -m chatty_goose.agents.chat

Alternatively, run the agent using ParlAI's command line interface:

python -m parlai interactive --model chatty_goose.agents.chat:ChattyGooseAgent

We also provide instructions to deploy the agent to Facebook Messenger using ParlAI under examples/messenger.

Comments
  • Add baselines for CAsT 2020

    Add baselines for CAsT 2020

    Need someone help to add CAsT 2020 baseline results:

    • [ ] Naive: CQR without canonical responses
    • [ ] Canonical: CQR with canonical (manual) response

    CQR methods: HQE /Ntr (T5)

    enhancement help wanted 
    opened by justram 2
  • Running HQE and getting the reformulated queries

    Running HQE and getting the reformulated queries

    Dear authors,

    I am trying to use your method in some of my work. For that, I need to get the reformulated queries (instead of only the generated ranked hits).

    I am trying to run the HQE experiment as indicated using:

    python -m experiments.run_retrieval \
          --experiment hqe \
          --hits 1000 \
          --sparse_index cast2019 \
          --qid_queries $input_query_json \
          --output ./output/hqe_bm25 
    

    However, when I print the arguments passed inside the retrieval pipeline (L101 of retrieval_pipeline.py) I get as query the raw/original/last-turn query string, and as manual_context_buffer[turn_id] simply None. If I'm not mistaken, that means that running the specific experiment equals to no reformulation being done at all. Can you check/confirm this?

    Digging more into the code, it seems to me that the queries I'd like to access are inside cqr_queries, but still, it seems to me that context should be empty/None in that case - probably resulting to no reformulation done at all.

    opened by littlewine 1
  • Query rewriting fix

    Query rewriting fix

    Thank you with the project.

    The fix to below will be hits = rp.retrieve(query, manual_context_buffer[turn_id-1] if turn_id!=0 else None), to pass the last previous canonical response.

    https://github.com/castorini/chatty-goose/blob/f9c21c8b7b6194d11d7aec5b4e218174cde98418/experiments/run_retrieval.py#L100

    opened by xeniaqian94 1
  • Update based on Pyserini==0.14.0 and fix canonical response bug

    Update based on Pyserini==0.14.0 and fix canonical response bug

    Main change:

    1. change --dense_index from temporary one to pyserini prebuilt index name
    2. fix canonical response bug, which previously add current response to context
    3. since now we have --dense_index, change option name --index to --sparse_index
    opened by jacklin64 0
  • Add chatty goose support for dense retrieval and hybrid search for T5 and CQE

    Add chatty goose support for dense retrieval and hybrid search for T5 and CQE

    New features added: (only for T5 and CQE, may consider HQE in the future) (1) Dense retrieval (2) Dense-sparse hybrid retrieval

    Some arg might be confused and may be changed in the future: (1) --index, --dense_index: may change to --sparse_index and --dense_index (2) --experiment now has options (hqe,cqe,t5,fusion,cqe_t5_fusion) may change to (hqe,cqe,t5,hqe_t5fusion,cqe_t5_fusion)

    opened by jacklin64 0
  • Add cast2020 baseline

    Add cast2020 baseline

    This PR adds both naive and canonical baselines for CAsT2020 topics. The results are overall lower as compared to CAst2019 and the results from the canonical run are only slightly better for some metrics as compared to results from the naive run.

    Resolves #23

    opened by saileshnankani 0
  • Add support for canonical response

    Add support for canonical response

    This PR adds support for using manual_canonical_result_id in the CAsT2020 data for both ntr and hqe (for #23).

    For ntr, rewrite uses the passage corresponding to the canonical document in the history. We only use 1 passage in the historical context as otherwise, it exceed 512 tokens limit. For e.g., it uses q1/P1/q2 and then q1/q2/P2/q3 and so on.

    enhancement 
    opened by saileshnankani 0
  • CQR Replication

    CQR Replication

    Add CQR replication for Fusion BM25

    Library versions used: torch==1.7.0 torchvision==0.8.1 torchtext==0.8


    Results:

    map                   	all	0.2584
    recall_1000           	all	0.8028
    ndcg_cut_1            	all	0.3353
    ndcg_cut_3            	all	0.3247
    

    Details and reproduction results can be found in the notebook

    opened by saileshnankani 0
  • Rename classes and update messenger bot

    Rename classes and update messenger bot

    Breaking changes:

    Renamed several classes to follow Python conventions / be more consistent

    • chatty_goose.agents.cqragent -> chatty_goose.agents.chat
    • HQE -> Hqe
    • T5_NTR -> Ntr
    • HQESettings -> HqeSettings
    • T5Settings -> NtrSettings
    • CQRType -> CqrType
    • CQRSettings -> CqrSettings
    • CQR -> ConversationalQueryRewriter
    opened by edwinzhng 0
  • document spaCy model dependency

    document spaCy model dependency

    With a fresh install, we get the following error if we try to run anything:

    OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
    

    Solution is:

    $ python -m spacy download en_core_web_sm
    

    We should document this.

    opened by lintool 0
  • PyTorch version: needs Torch 1.7 (won't work with 1.8)

    PyTorch version: needs Torch 1.7 (won't work with 1.8)

    With a from-scratch installation, the module pulls in Torch 1.8, which causes this error:

    ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' (/anaconda3/envs/chatty-goose-test/lib/python3.7/site-packages/torch/optim/lr_scheduler.py)
    

    Downgrading fixes the issue:

    $ pip install torch==1.7.1 torchtext==0.8.1
    

    Should we pin the version in our module dependencies? Or at the very least this needs to be documented.=

    opened by lintool 0
  • dependency conflict

    dependency conflict

    Hi,

    When I install chatty-goose from github using:

    python -m pip install git+https://github.com/castorini/chatty-goose.git
    
    

    I met this issue:

    ERROR: Cannot install chatty-goose and chatty-goose==0.2.0 because these package versions have conflicting dependencies.
    
    The conflict is caused by:
        chatty-goose 0.2.0 depends on pyserini==0.14.0
        pygaggle 0.0.3.1 depends on pyserini==0.10.1.0
    

    It seems that chatty-goose requires pyserini==0.14.0 as well as pygaggle 0.0.3.1. However, pygaggle 0.0.3.1 and pyserini==0.14.0 do not play nice with each other

    Could someone provide some help?

    Thanks!

    opened by dayuyang1999 1
  • Expansion to new datasets

    Expansion to new datasets

    Does it make sense to expand Chatty Goose to new datasets? For example:

    • MANtIS - a multi-domain information seeking dialogues dataset: https://guzpenha.github.io/MANtIS/
    • ClariQ - Search-oriented Conversational AI (SCAI) EMNLP https://github.com/aliannejadi/ClariQ
    enhancement 
    opened by lintool 1
  • Checkpoint transformation

    Checkpoint transformation

    According to @edwinzhng's replication log, we have a reranker checkpoint mismatch issue. Currently, we have diffs in our reranking model and the pygaggle's default model.

    Related to this issue: I think we need a folder to put/track our tf2torch ckpt transformation/sanity check scripts?

    opened by justram 0
Releases(0.2.0)
  • 0.2.0(May 7, 2021)

    Breaking changes

    Renamed several classes to follow Python conventions / be more consistent

    chatty_goose.agents.cqragent -> chatty_goose.agents.chat HQE -> Hqe T5_NTR -> Ntr HQESettings -> HqeSettings T5Settings -> NtrSettings CQRType -> CqrType CQRSettings -> CqrSettings CQR -> ConversationalQueryRewriter

    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Mar 8, 2021)

    BREAKING CHANGES

    • Integrate ParlAI Facebook Messenger example for a demo by @edwinzhng
    • Integrate Pyserini/Pygaggle for a reference implementation of multi-stage passage retrieval by @edwinzhng
    • Add replication log for TREC CAsT 2019 conversational passage retrieval task by @edwinzhng
    Source code(tar.gz)
    Source code(zip)
Owner
Castorini
Deep learning for natural language processing and information retrieval at the University of Waterloo
Castorini
Efficient semidefinite bounds for multi-label discrete graphical models.

Low rank solvers #################################### benchmark/ : folder with the random instances used in the paper. ############################

1 Dec 08, 2022
Neural Cellular Automata + CLIP

🧠 Text-2-Cellular Automata Using Neural Cellular Automata + OpenAI CLIP (Work in progress) Examples Text Prompt: Cthulu is watching cthulu_is_watchin

Mainak Deb 21 Dec 19, 2022
Open-Set Recognition: A Good Closed-Set Classifier is All You Need

Open-Set Recognition: A Good Closed-Set Classifier is All You Need Code for our paper: "Open-Set Recognition: A Good Closed-Set Classifier is All You

194 Jan 03, 2023
This tutorial repository is to introduce the functionality of KGTK to first-time users

Welcome to the KGTK notebook tutorial The goal of this tutorial repository is to introduce the functionality of KGTK to first-time users. The Knowledg

USC ISI I2 58 Dec 21, 2022
[CVPR 2016] Unsupervised Feature Learning by Image Inpainting using GANs

Context Encoders: Feature Learning by Inpainting CVPR 2016 [Project Website] [Imagenet Results] Sample results on held-out images: This is the trainin

Deepak Pathak 829 Dec 31, 2022
SBINN: Systems-biology informed neural network

SBINN: Systems-biology informed neural network The source code for the paper M. Daneker, Z. Zhang, G. E. Karniadakis, & L. Lu. Systems biology: Identi

Lu Group 15 Nov 19, 2022
Preparation material for Dropbox interviews

Dropbox-Onsite-Interviews A guide for the Dropbox onsite interview! The Dropbox interview question bank is very small. The bank has been in a Chinese

386 Dec 31, 2022
Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Sharpened Cosine Similarity A layer implementation for PyTorch Install At your c

Brandon Rohrer 203 Nov 30, 2022
IhoneyBakFileScan Modify - 批量网站备份文件扫描器,增加文件规则,优化内存占用

ihoneyBakFileScan_Modify 批量网站备份文件泄露扫描工具 2022.2.8 添加、修改内容 增加备份文件fuzz规则 修改备份文件大小判断

VMsec 220 Jan 05, 2023
A lightweight Python-based 3D network multi-agent simulator. Uses a cell-based congestion model. Calculates risk, loudness and battery capacities of the agents. Suitable for 3D network optimization tasks.

AMAZ3DSim AMAZ3DSim is a lightweight python-based 3D network multi-agent simulator. It uses a cell-based congestion model. It calculates risk, battery

Daniel Hirsch 13 Nov 04, 2022
Official repository of ICCV21 paper "Viewpoint Invariant Dense Matching for Visual Geolocalization"

Viewpoint Invariant Dense Matching for Visual Geolocalization: PyTorch implementation This is the implementation of the ICCV21 paper: G Berton, C. Mas

Gabriele Berton 44 Jan 03, 2023
A python library for self-supervised learning on images.

Lightly is a computer vision framework for self-supervised learning. We, at Lightly, are passionate engineers who want to make deep learning more effi

Lightly 2k Jan 08, 2023
Official code for our EMNLP2021 Outstanding Paper MindCraft: Theory of Mind Modeling for Situated Dialogue in Collaborative Tasks

MindCraft Authors: Cristian-Paul Bara*, Sky CH-Wang*, Joyce Chai This is the official code repository for the paper (arXiv link): Cristian-Paul Bara,

Situated Language and Embodied Dialogue (SLED) Research Group 14 Dec 29, 2022
Annotate datasets with a semi-trained or fully trained YOLOv5 model

YOLOv5 Auto Annotator Annotate datasets with a semi-trained or fully trained YOLOv5 model Prerequisites Ubuntu =20.04 Python =3.7 System dependencie

Akash James 3 May 14, 2022
Source code for CIKM 2021 paper for Relation-aware Heterogeneous Graph for User Profiling

RHGN Source code for CIKM 2021 paper for Relation-aware Heterogeneous Graph for User Profiling Dependencies torch==1.6.0 torchvision==0.7.0 dgl==0.7.1

Big Data and Multi-modal Computing Group, CRIPAC 6 Nov 29, 2022
PSPNet in Chainer

PSPNet This is an unofficial implementation of Pyramid Scene Parsing Network (PSPNet) in Chainer. Training Requirement Python 3.4.4+ Chainer 3.0.0b1+

Shunta Saito 76 Dec 12, 2022
Offcial repository for the IEEE ICRA 2021 paper Auto-Tuned Sim-to-Real Transfer.

Offcial repository for the IEEE ICRA 2021 paper Auto-Tuned Sim-to-Real Transfer.

47 Jun 30, 2022
Dashboard for the COVID19 spread

COVID-19 Data Explorer App A streamlit Dashboard for the COVID-19 spread. The app is live at: [https://covid19.cwerner.ai]. New data is queried from G

Christian Werner 22 Sep 29, 2022
Benchmarks for semi-supervised domain generalization.

Semi-Supervised Domain Generalization This code is the official implementation of the following paper: Semi-Supervised Domain Generalization with Stoc

Kaiyang 49 Dec 10, 2022
Jarvis Project is a basic virtual assistant that uses TensorFlow for learning.

Jarvis_proyect Jarvis Project is a basic virtual assistant that uses TensorFlow for learning. Latest version 0.1 Features: Good morning protocol Tell

Anze Kovac 3 Aug 31, 2022