A large-scale (194k), Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

Overview

MedMCQA

MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

A large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address realworld medical entrance exam questions.

The MedMCQA task can be formulated as X = {Q, O} where Q represents the questions in the text, O represents the candidate options, multiple candidate answers are given for each question O = {O1, O2, ..., On}. The goal is to select the single or multiple answers from the option set.

If you would like to use the data or code, please cite the paper:

@InProceedings{pmlr-v174-pal22a,
  title = 	 {MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering},
  author =       {Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan},
  booktitle = 	 {Proceedings of the Conference on Health, Inference, and Learning},
  pages = 	 {248--260},
  year = 	 {2022},
  editor = 	 {Flores, Gerardo and Chen, George H and Pollard, Tom and Ho, Joyce C and Naumann, Tristan},
  volume = 	 {174},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {07--08 Apr},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v174/pal22a/pal22a.pdf},
  url = 	 {https://proceedings.mlr.press/v174/pal22a.html},
  abstract = 	 {This paper introduces MedMCQA, a new large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. More than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. Each sample contains a question, correct answer(s), and other options which requires a deeper language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. A detailed explanation of the solution, along with the above information, is provided in this study.}
}

GitHub license GitHub commit PRs Welcome

Dataset Description

Links
Homepage: https://medmcqa.github.io
Repository: https://github.com/medmcqa/medmcqa
Paper: https://arxiv.org/abs/2203.14371
Leaderboard: https://paperswithcode.com/dataset/medmcqa
Point of Contact: Aaditya Ura, Logesh

Dataset Summary

MedMCQA is a large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions.

MedMCQA has more than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity.

Each sample contains a question, correct answer(s), and other options which require a deeper language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. A detailed explanation of the solution, along with the above information, is provided in this study.

MedMCQA provides an open-source dataset for the Natural Language Processing community. It is expected that this dataset would facilitate future research toward achieving better QA systems. The dataset contains questions about the following topics:

  • Anesthesia
  • Anatomy
  • Biochemistry
  • Dental
  • ENT
  • Forensic Medicine (FM)
  • Obstetrics and Gynecology (O&G)
  • Medicine
  • Microbiology
  • Ophthalmology
  • Orthopedics
  • Pathology
  • Pediatrics
  • Pharmacology
  • Physiology
  • Psychiatry
  • Radiology
  • Skin
  • Preventive & Social Medicine (PSM)
  • Surgery

Requirements

pip3 install -r requirements.txt

Data Download and Preprocessing

download the data from below link

data : https://drive.google.com/uc?export=download&id=15VkJdq5eyWIkfb_aoD3oS8i4tScbHYky

Experiments code

To run the experiments mentioned in the paper, follow the below steps

  • Clone the repo
  • Install the dependencies

pip3 install -r requirements.txt

  • Download the data from google drive link
  • Unzip the data
  • run below command with the data path

python3 train.py --model bert-base-uncased --dataset_folder_name "/content/medmcqa_data/"

Supported Tasks and Leaderboards

multiple-choice-QA, open-domain-QA: The dataset can be used to train a model for multi-choice questions answering, open domain questions answering. Questions in these exams are challenging and generally require deeper domain and language understanding as it tests the 10+ reasoning abilities across a wide range of medical subjects & topics.

Languages

The questions and answers are available in English.

Dataset Structure

Data Instances

{
    "question":"A 40-year-old man presents with 5 days of productive cough and fever. Pseudomonas aeruginosa is isolated from a pulmonary abscess. CBC shows an acute effect characterized by marked leukocytosis (50,000 mL) and the differential count reveals a shift to left in granulocytes. Which of the following terms best describes these hematologic findings?",
    "exp": "Circulating levels of leukocytes and their precursors may occasionally reach very high levels (>50,000 WBC mL). These extreme elevations are sometimes called leukemoid reactions because they are similar to the white cell counts observed in leukemia, from which they must be distinguished. The leukocytosis occurs initially because of the accelerated release of granulocytes from the bone marrow (caused by cytokines, including TNF and IL-1) There is a rise in the number of both mature and immature neutrophils in the blood, referred to as a shift to the left. In contrast to bacterial infections, viral infections (including infectious mononucleosis) are characterized by lymphocytosis Parasitic infestations and certain allergic reactions cause eosinophilia, an increase in the number of circulating eosinophils. Leukopenia is defined as an absolute decrease in the circulating WBC count.",
    "cop":1,
    "opa":"Leukemoid reaction",
    "opb":"Leukopenia",
    "opc":"Myeloid metaplasia",
    "opd":"Neutrophilia",
    "subject_name":"Pathology",
    "topic_name":"Basic Concepts and Vascular changes of Acute Inflammation",
    "id":"4e1715fe-0bc3-494e-b6eb-2d4617245aef",
    "choice_type":"single"
}

Data Fields

  • id : a string question identifier for each example
  • question : question text (a string)
  • opa : Option A
  • opb : Option B
  • opc : Option C
  • opd : Option D
  • cop : Correct option (Answer of the question)
  • choice_type : Question is single-choice or multi-choice
  • exp : Expert's explanation of the answer
  • subject_name : Medical Subject name of the particular question
  • topic_name : Medical topic name from the particular subject

Data Splits

The goal of MedMCQA is to emulate the rigor of real word medical exams. To enable that, a predefined split of the dataset is provided. The split is by exams instead of the given questions. This also ensures the reusability and generalization ability of the models.

  • The training set of MedMCQA consists of all the collected mock & online test series.
  • The test set consists of all AIIMS PG exam MCQs (years 1991-present).
  • The development set consists of NEET PG exam MCQs (years 2001-present) to approximate real exam evaluation.

Similar questions from train , test and dev set were removed based on similarity. The final split sizes are as follow:

Train Valid Test
Question # 182,822 6,150 4,183
Vocab 94,231 11,218 10,800
Max Ques tokens 220 135 88
Max Ans tokens 38 21 25

Model Submission and Test Set Evaluation

To preserve the integrity of test results, we do not release the test set's ground-truth to the public. Instead, we require you to use the test-set to evaluate the model and send the predictions along with unique question id id in csv format to below email addresses

Example :

id Prediction (correct option)
84f328d3-fca4-422d-8fb2-19d55eb31503 2
bb85e248-b2e9-48e8-a887-67c1aff15b6d 3

aadityaura [at] gmail.com, logesh.umapathi [at] saama.com

Owner
MedMCQA
MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering
MedMCQA
基于百度的语音识别,用python实现,pyaudio+pyqt

Speech-recognition 基于百度的语音识别,python3.8(conda)+pyaudio+pyqt+baidu-aip 百度有面向python

J-L 1 Jan 03, 2022
Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

AAGCN-ACSA EMNLP 2021 Introduction This repository was used in our paper: Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment An

Akuchi 36 Dec 18, 2022
NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles

NewsMTSC: (Multi-)Target-dependent Sentiment Classification in News Articles NewsMTSC is a dataset for target-dependent sentiment classification (TSC)

Felix Hamborg 79 Dec 30, 2022
ACL'22: Structured Pruning Learns Compact and Accurate Models

☕ CoFiPruning: Structured Pruning Learns Compact and Accurate Models This repository contains the code and pruned models for our ACL'22 paper Structur

Princeton Natural Language Processing 130 Jan 04, 2023
Backend for the Autocomplete platform. An AI assisted coding platform.

Introduction A custom predictor allows you to deploy your own prediction implementation, useful when the existing serving implementations don't fit yo

Tatenda Christopher Chinyamakobvu 1 Jan 31, 2022
2021搜狐校园文本匹配算法大赛baseline

sohu2021-baseline 2021搜狐校园文本匹配算法大赛baseline 简介 分享了一个搜狐文本匹配的baseline,主要是通过条件LayerNorm来增加模型的多样性,以实现同一模型处理不同类型的数据、形成不同输出的目的。 线下验证集F1约0.74,线上测试集F1约0.73。

苏剑林(Jianlin Su) 45 Sep 06, 2022
code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

AttentiveNAS: Improving Neural Architecture Search via Attentive Sampling This repository contains PyTorch evaluation code, training code and pretrain

Facebook Research 94 Oct 26, 2022
This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 - treatments and vaccinations.

Project: Text Analysis - This project aims to conduct a text information retrieval and text mining on medical research publication regarding Covid19 -

1 Mar 14, 2022
Turkish Stop Words Türkçe Dolgu Sözcükleri

trstop Turkish Stop Words Türkçe Dolgu Sözcükleri In this repository I put Turkish stop words that is contained in the first 10 thousand words with th

Ahmet Aksoy 103 Nov 12, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 2.2k Jan 09, 2023
Tools for curating biomedical training data for large-scale language modeling

Tools for curating biomedical training data for large-scale language modeling

BigScience Workshop 242 Dec 25, 2022
端到端的长本文摘要模型(法研杯2020司法摘要赛道)

端到端的长文本摘要模型(法研杯2020司法摘要赛道)

苏剑林(Jianlin Su) 334 Jan 08, 2023
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 03, 2023
Kurumi ChatBot

KurumiChatBot Just another Telegram AI chat bot written in Python using Pyrogram. A public running instance can be found on telegram as @TokisakiChatB

Yoga Pranata 3 Jun 28, 2022
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Udit Arora 19 Oct 28, 2022
ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword

Google Research 409 Jan 06, 2023
To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Ragesh Hajela 0 Feb 08, 2022
Natural Language Processing library built with AllenNLP 🌲🌱

Custom Natural Language Processing with big and small models 🌲🌱

Recognai 65 Sep 13, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 829 Jan 07, 2023