EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

Overview

BioLAMA

BioLAMA

BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CTD, UMLS, and Wikidata. Please see our paper Can Language Models be Biomedical Knowledge Bases? (Sung et al., 2021) for more details.

* The dataset for the BioLAMA probe is available at data.tar.gz

Getting Started

After the installation, you can easily try BioLAMA with manual prompts. When a subject is "flu" and you want to probe its symptoms from an LM, the input should be like "Flu has symptom such as [Y]."

# Set MODEL to bert-base-cased for BERT or dmis-lab/biobert-base-cased-v1.2 for BioBERT
MODEL=./RoBERTa-base-PM-Voc/RoBERTa-base-PM-Voc-hf
python ./BioLAMA/cli_demo.py \
    --model_name_or_path ${MODEL}

Result:

Please enter input (e.g., Flu has symptoms such as [Y].):
hepatocellular carcinoma has symptoms such as [Y].
-------------------------
Rank    Prob    Pred
-------------------------
1       0.648   jaundice
2       0.223   abdominal pain
3       0.127   jaundice and ascites
4       0.11    ascites
5       0.086   hepatomegaly
6       0.074   obstructive jaundice
7       0.06    abdominal pain and jaundice
8       0.059   ascites and jaundice
9       0.043   anorexia and jaundice
10      0.042   fever and jaundice
-------------------------
Top1 prediction sentence:
"hepatocellular carcinoma has symptoms such as jaundice."

Quick Link

Installation

# Install torch with conda (please check your CUDA version)
conda create -n BioLAMA python=3.7
conda activate BioLAMA
conda install pytorch=1.8.0 cudatoolkit=10.2 -c pytorch

# Install BioLAMA
git clone https://github.com/dmis-lab/BioLAMA.git
cd BioLAMA
pip install -r requirements.txt

Resources

Models

For BERT and BioBERT, we use checkpoints provided in the Huggingface Hub:

Bio-LM is not provided in the Huggingface Hub. Therefore, we use the Bio-LM checkpoint released in link. Among the various versions of Bio-LMs, we use `RoBERTa-base-PM-Voc-hf'.

wget https://dl.fbaipublicfiles.com/biolm/RoBERTa-base-PM-Voc-hf.tar.gz
tar -xzvf RoBERTa-base-PM-Voc-hf.tar.gz 
rm -rf RoBERTa-base-PM-Voc-hf.tar.gz

Datasets

The dataset will take about 78 MB of space. Download data.tar.gz and uncompress it.

tar -xzvf data.tar.gz
rm -rf data.tar.gz

The directory tree of the data is like:

data
├── ctd
│   ├── entities
│   ├── meta
│   ├── prompts
│   └── triples_processed
│       └── CD1
│           ├── dev.jsonl
│           ├── test.jsonl
│           └── train.jsonl
├── wikidata
│   ├── entities
│   ├── meta
│   ├── prompts
│   └── triples_processed
│       └── P2175
│           ├── dev.jsonl
│           ├── test.jsonl
│           └── train.jsonl
└── umls
    ├── meta
    └── prompts

Important: Triples of UMLS is not provided due to the license. For those who want to probe LMs using triples of UMLS, we provide the pre-processing scripts for UMLS. Please follow this instruction.

Experiments

We provide two ways of probing PLMs with BioLAMA:

Manual Prompt

Manual Prompt probes PLMs using pre-defined manual prompts. The predictions and scores will be logged in '/output'.

# Set TASK to 'ctd' for CTD or 'umls' for UMLS
# Set MODEL to 'bert-base-cased' for BERT or 'dmis-lab/biobert-base-cased-v1.2' for BioBERT
TASK=wikidata
MODEL=./RoBERTa-base-PM-Voc/RoBERTa-base-PM-Voc-hf
PROMPT_PATH=./data/${TASK}/prompts/manual.jsonl
TEST_PATH=./data/${TASK}/triples_processed/*/test.jsonl

python ./BioLAMA/run_manual.py \
    --model_name_or_path ${MODEL} \
    --prompt_path ${PROMPT_PATH} \
    --test_path "${TEST_PATH}" \
    --init_method confidence \
    --iter_method none \
    --num_mask 10 \
    --max_iter 10 \
    --beam_size 5 \
    --batch_size 16 \
    --output_dir ./output/${TASK}_manual

Result:

PID     [email protected]   [email protected]
-------------------------
P2175   9.40    21.11
P2176   22.46   39.75
P2293   2.24    11.43
P4044   9.47    19.47
P780    16.30   37.85
-------------------------
MACRO   11.97   25.92

OptiPrompt

OptiPrompt probes PLMs using embedding-based prompts starting from embeddings of manual prompts. The predictions and scores will be logged in '/output'.

# Set TASK to 'ctd' for CTD or 'umls' for UMLS
# Set MODEL to 'bert-base-cased' for BERT or 'dmis-lab/biobert-base-cased-v1.2' for BioBERT
TASK=wikidata
MODEL=./RoBERTa-base-PM-Voc/RoBERTa-base-PM-Voc-hf
PROMPT_PATH=./data/${TASK}/prompts/manual.jsonl
TRAIN_PATH=./data/${TASK}/triples_processed/*/train.jsonl
DEV_PATH=./data/${TASK}/triples_processed/*/dev.jsonl
TEST_PATH=./data/${TASK}/triples_processed/*/test.jsonl
PROMPT_PATH=./data/${TASK}/prompts/manual.jsonl

python ./BioLAMA/run_optiprompt.py \
    --model_name_or_path ${MODEL} \
    --train_path "${TRAIN_PATH}" \
    --dev_path "${DEV_PATH}" \
    --test_path "${TEST_PATH}" \
    --prompt_path ${PROMPT_PATH} \
    --num_mask 10 \
    --init_method confidence \
    --iter_method none \
    --max_iter 10 \
    --beam_size 5 \
    --batch_size 16 \
    --lr 3e-3 \
    --epochs 10 \
    --seed 0 \
    --prompt_token_len 5 \
    --init_manual_template \
    --output_dir ./output/${TASK}_optiprompt

Result:

PID     [email protected]   [email protected]
-------------------------
P2175   9.47    24.94
P2176   20.14   39.57
P2293   2.90    9.21
P4044   7.53    18.58
P780    12.98   33.43
-------------------------
MACRO   7.28    18.51

Acknowledgement

Parts of the code are modified from genewikiworld, X-FACTR, and OptiPrompt. We appreciate the authors for making their projects open-sourced.

Citations

@inproceedings{sung2021can,
    title={Can Language Models be Biomedical Knowledge Bases},
    author={Sung, Mujeen and Lee, Jinhyuk and Yi, Sean and Jeon, Minji and Kim, Sungdong and Kang, Jaewoo},
    booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year={2021},
}
Owner
DMIS Laboratory - Korea University
Data Mining & Information Systems Laboratory @ Korea University
DMIS Laboratory - Korea University
Amazon Multilingual Counterfactual Dataset (AMCD)

Amazon Multilingual Counterfactual Dataset (AMCD)

35 Sep 20, 2022
Lumped-element impedance calculator and frequency-domain plotter.

fastZ: Lumped-Element Impedance Calculator fastZ is a small tool for calculating and visualizing electrical impedance in Python. Features include: Sup

Wesley Hileman 47 Nov 18, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
Machine translation models released by the Gourmet project

Gourmet Models Overview The Gourmet project has released several machine translation models to translate low-resource languages. This repository conta

Edinburgh NLP 5 Dec 08, 2021
A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode

Bloxflip Smart Bet A program that uses real statistics to choose the best times to bet on BloxFlip's crash gamemode. https://bloxflip.com/crash. THIS

43 Jan 05, 2023
ChatBotProyect - This is an unfinished project about a simple chatbot.

chatBotProyect This is an unfinished project about a simple chatbot. (union_todo.ipynb) Reminders for the project: Find why one of the vectorizers fai

Tomás 0 Jul 24, 2022
AI-powered literature discovery and review engine for medical/scientific papers

AI-powered literature discovery and review engine for medical/scientific papers paperai is an AI-powered literature discovery and review engine for me

NeuML 819 Dec 30, 2022
pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

Python binding for Morfologik Morfologik is Polish morphological analyzer. For more information see http://github.com/morfologik/morfologik-stemming/

Damian Mirecki 18 Dec 29, 2021
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 342 Jan 05, 2023
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
Code Generation using a large neural network called GPT-J

CodeGenX is a Code Generation system powered by Artificial Intelligence! It is delivered to you in the form of a Visual Studio Code Extension and is Free and Open-source!

DeepGenX 389 Dec 31, 2022
Grover is a model for Neural Fake News -- both generation and detectio

Grover is a model for Neural Fake News -- both generation and detection. However, it probably can also be used for other generation tasks.

Rowan Zellers 856 Dec 24, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
Code and data accompanying Natural Language Processing with PyTorch

Natural Language Processing with PyTorch Build Intelligent Language Applications Using Deep Learning By Delip Rao and Brian McMahan Welcome. This is a

Joostware 1.8k Jan 01, 2023
Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

sl 1 Apr 08, 2022
Crowd sourced training data for Rasa NLU models

NLU Training Data Crowd-sourced training data for the development and testing of Rasa NLU models. If you're interested in grabbing some data feel free

Rasa 169 Dec 26, 2022
Pytorch NLP library based on FastAI

Quick NLP Quick NLP is a deep learning nlp library inspired by the fast.ai library It follows the same api as fastai and extends it allowing for quick

Agis pof 283 Nov 21, 2022
Host your own GPT-3 Discord bot

GPT3 Discord Bot Host your own GPT-3 Discord bot i'd host and make the bot invitable myself, however GPT3 terms of service prohibit public use of GPT3

[something hillarious here] 8 Jan 07, 2023