This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

Overview

Pipeline For NLP with Bloom's Taxonomy Using Improved Question Classification and Question Generation using Deep Learning

This repository contains all the source code that is needed for the Project : An Efficient Pipeline For Bloom’s Taxonomy with Question Generation Using Natural Language Processing and Deep Learning.

Outline :

An examination assessment undertaken by educational institutions is an essential process, since it is one of the fundamental steps to determine a student’s progress and achievements for a distinct subject or course. To meet learning objectives, the questions must be presented by the topics, that are mastered by the students. Generation of examination questions from an extensive amount of available text material presents some complications. The current availability of huge lengths of textbooks makes it a slow and time-consuming task for a faculty when it comes to manually annotate good quality of questions keeping in mind, they are well balanced as well. As a result, faculties rely on Bloom’s taxonomy's cognitive domain, which is a popular framework, for assessing students’ intellectual abilities. Therefore, the primary goal of this research paper is to demonstrate an effective pipeline for the generation of questions using deep learning from a given text corpus. We also employ various neural network architectures to classify questions into the cognitive domain of different levels of Bloom’s taxonomy using deep learning, to derive questions and judge the complexity and specificity of those questions. The findings from this study showed that the proposed pipeline is significant in generating the questions, which were equally similar concerning manually annotated questions and classifying questions from multiple domains based on Bloom’s taxonomy.

Main Proposed Pipeline Layout :

Used Datasets

  • Squad Dataset 2.0 - Used In Question Generation Module. Released in 2018, has over 150,000 question-answer pairs.

  • "Yahya et al, (2012)" Introduced Dataset - Dataset Used in Question Classification Module.Consists of around 600 open-ended questions, covering a wide variety of questions belonging to the different levels of the cognitive domain. Original Dataset required some basic pre-processing and then manually converted into dataframe. Check out main paper cited here.

  • Quora Question Pairs Dataset- Dataset Used in Case study of computing semantic similarity between generated questions from T5 Transformer and manually annotated questions from survey form.

Question Generation Module:

The dataset being used for the question generation is Squad (The Stanford Question Answering Dataset) 2.0 Dataset. Squad 2.0 is an extension of the original Squad V1.1 that was published in 2016 by Stanford University.

In this paper, we have implemented T5 Transformer, which is then fine-tuned using PyTorch lightning and training it on the Squad 2.0 dataset. T5 is essentially an encoder-decoder model which takes in all NLP problems and has them converted to a text-to-text format.

Table 1

Passage Answer Context
The term health is very frequently used by everybody. How do we define it? Health does not simply mean "absence of disease" or "physical fitness". It could be defined as a state of complete physical, mental and social well-being. When people are healthy, they are more efficient at work. This increases productivity and brings economic prosperity. Health also increases longevity of people and reduces infant and maternal mortality. When the functioning of one or more organs or systems of the body is adversely affected, characterized by appearance of various signs and symptoms,we say that we are not healthy, i.e., we have a disease. Diseases can be broadly grouped into infectious and non-infectious. Diseases which are easily transmitted from one person to another, are called infectious diseases.' Easily transmitted from one person to another
Proteins are the most abundant biomolecules of the living system. Chief sources of proteins are milk, cheese, pulses, peanuts, fish, meat, etc. They occur in every part of the body and form the fundamental basis of structure and functions of life. They are also required for growth and maintenance of the body. The word protein is derived from Greek word, “proteios” which means primary or of prime importance. Greek Word

Table 1 shows the passages that we have input it into the model and the answers that we want the questions to be generated. We have taken these passages from various high school level books.

Table 2

Answer Context Easily transmitted from one person to another Greek Word
Questions Generated How are infectious diseases defined? What does the word protein come from?
Questions Received What do you mean by infectious disease? What is "proteios"? From which language was it derived from?

As you can see in table 2, the questions generated row are the questions generated as per the answer context by our model. Correspondingly, the Questions Received are the ones that we obtained from circulating a survey that contained the same passage and context.

Results

After training, we observed a steady decrease in training loss Fig. 3. The validation loss fluctuated and has been observed in Fig. 4. Note that due to fewer computation resources, we could train for only a limited amount of time, and hence the fluctuations in validation loss.

  • Training Loss = 0.070
  • Validation Loss = 2.39

Question Classification Module :

A deep learning-based model for multi class classification which takes in a text as input and tries to classify a certain category out of multiple categories in coginitive domain of bloom's taxonomy.

Dataset Used : Yahaa et all (2012)

Model Pipeline :

Model Architecture :

Results :

Summarised Evaluation :

S.No Model Optimizer Accuracy Loss Dropout
1 ConvNet 1D+ 2 Bidirectional LSTMs Layers Adam 80.83 0.6842
2 ConvNet 1D+ 2 Bidirectional LSTMs Layers RMSProp 80.00 1.50
3 ConvNet 1D+ 2 Bidirectional LSTMs Layers Adam with ClipNorm=1.25 83.33 0.86
4 ConvNet 1D+ 2 Bidirectional LSTMs Layers RMSProp with ClipNorm=1.25 79.17 2.10
5 ConvNet 1D+ 2 Bidirectional LSTMs Layers Adam 86.67 0.59 Recurrent Dropout=0.1
6 ConvNet 1D+ 2 Bidirectional LSTMs Layers RMSprop 78.83 2.54 Recurrent Dropout=0.1
7 ConvNet 1D+ 2 Bidirectional LSTMs Layers Adam with ClipNorm=1.25 85.83 0.56 Recurrent Dropout=0.1
8 ConvNet 1D+ 2 Bidirectional LSTMs Layers RMSprop with ClipNorm=1.25 75.83 0.76 Recurrent Dropout=0.1
9 ConvNet 1D+ 2 Bidirectional LSTMs Layers + GloVe 100-D Adam With ClipNorm=1.25 73.33 1.28
10 ConvNet 1D+ 2 Bidirectional LSTMs Layers + GloVe 300-D Adam With ClipNorm=1.25 75.83 0.88
11 ConvNet 1D+ 2 Bidirectional LSTMs Layers + GloVe 100-D RMSprop With ClipNorm=1.25 73.33 2.31
12 ConvNet 1D+ 2 Bidirectional LSTMs Layers + GloVe 300-D RMSprop With ClipNorm=1.25 80.00 1.12

The Best Performance was exhibited by the following dense neural network : ConvNet 1D with 2 Bidirectional LSTMs Layers ,along with Adam optimizer and recurrent dropout =0.1 as regulariser.

Following Results were obtained :

  • Accuracy : 86.67 %
  • Loss : 0.59

Accuracy vs Loss Plot :

Siamese Neural Network for Computing Sentence Similarity – A Case Study :

With a thorough analysis of the outputs, i.e., questions, generated from the proposed model,a case study was done to evaluate how much the generated questions are semantically similar to the questions if annotated manually. For this evaluation, we considered an effective pipeline of Siamese neural networks. This study was done in order to explore insights about the effectiveness of our proposed pipeline – how much our model is efficient to generate questions when compared to the manual annotation of the questions which requires comparatively more hard work and time.

Model Architecture :

Generated Questions Manually Annotated Questions Context Similarity Score
Why is health more efficient at work? How does health affect efficiency at work? Increases Productivity And Brings Economic Prosperity 0.4464
What is the health of people more efficient at work? What are the outcomes of being more efficient at work as a result of good health? Increases Productivity And Brings Economic Prosperity 0.4811
What is the term infectious disease? What do you mean by infectious disease? Easily Transmitted From One Person To Another 0.3505
How are infectious diseases defined? Define infectious disease. Easily Transmitted From One Person To Another 0.2489
According to classical electromagnetic theory, an accelerating charged particle does what ? According to electromagnetic theory what happens when a charged particle accelerates ? Emits Radiation In The Form Of Electromagnetic Waves 0.2074
What does the theory of an accelerating charged particle imply ? What does the classical electromagnetic theory state ? Emits Radiation In The Form Of Electromagnetic Waves 0.0474
What was the Harappans's strategy of sending expeditions to ? What was the primary reason for settlements and expeditions as seen from Harappans's ? Strategy For Procuring Raw Materials 0.4222
What was the idea behind sending expeditions to Rajasthan ? Why did the Harappans's send expeditions to areas in Rajasthan ? Strategy For Procuring Raw Materials 0.6870
What was a feature of the Ganeshwar culture ? What was the distinctive feature of the Ganeshwar culture ? Non-Harappan Pottery 0.6439
What type of artefacts are from the Ganeshwar culture ? What kind of artefacts are from Ganeshwar culture ? Non-Harappan Pottery 0.4309
Proteins form the basis of what? What is the significance of proteins ? Function Of Life 0.1907
What are proteins the fundamental basis of ? What does protein form along with fundamental basis of structure ? Function Of Life 0.1775

The above analysis is a sample from a set of recorded observations evaluated by our network. This clearly indicates the depth of similarity score between generated questions from the transformer and manually annotated questions from the survey.

Accuracy vs Loss Plot :

Owner
Rohan Mathur
3rd Year Undergrad | Data Science Enthusiast
Rohan Mathur
Linear programming solver for paper-reviewer matching and mind-matching

Paper-Reviewer Matcher A python package for paper-reviewer matching algorithm based on topic modeling and linear programming. The algorithm is impleme

Titipat Achakulvisut 66 Jul 05, 2022
Refactored version of FastSpeech2

Refactored version of FastSpeech2. An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

ILJI CHOI 10 May 26, 2022
A 10000+ hours dataset for Chinese speech recognition

A 10000+ hours dataset for Chinese speech recognition

309 Dec 16, 2022
UniSpeech - Large Scale Self-Supervised Learning for Speech

UniSpeech The family of UniSpeech: WavLM (arXiv): WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing UniSpeech (ICML 202

Microsoft 281 Dec 15, 2022
xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 08, 2023
This is a Prototype of an Ai ChatBot "Tea and Coffee Supplier" using python.

Ai-ChatBot-Python A chatbot is an intelligent system which can hold a conversation with a human using natural language in real time. Due to the rise o

1 Oct 30, 2021
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.8k Jan 01, 2023
Longformer: The Long-Document Transformer

Longformer Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents. ***** New December 1st, 2020: Longforme

AI2 1.6k Dec 29, 2022
Code for "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022.

README Code for Two-stage Identifier: "Parallel Instance Query Network for Named Entity Recognition", accepted at ACL 2022. For details of the model a

Yongliang Shen 45 Nov 29, 2022
Precision Medicine Knowledge Graph (PrimeKG)

PrimeKG Website | bioRxiv Paper | Harvard Dataverse Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integra

Machine Learning for Medicine and Science @ Harvard 103 Dec 10, 2022
Augmenty is an augmentation library based on spaCy for augmenting texts.

Augmenty: The cherry on top of your NLP pipeline Augmenty is an augmentation library based on spaCy for augmenting texts. Besides a wide array of high

Kenneth Enevoldsen 124 Dec 29, 2022
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
A library for finding knowledge neurons in pretrained transformer models.

knowledge-neurons An open source repository replicating the 2021 paper Knowledge Neurons in Pretrained Transformers by Dai et al., and extending the t

EleutherAI 96 Dec 21, 2022
hashily is a Python module that provides a variety of text decoding and encoding operations.

hashily is a python module that performs a variety of text decoding and encoding functions. It also various functions for encrypting and decrypting text using various ciphers.

DevMysT 5 Jul 17, 2022
Concept Modeling: Topic Modeling on Images and Text

Concept is a technique that leverages CLIP and BERTopic-based techniques to perform Concept Modeling on images.

Maarten Grootendorst 120 Dec 27, 2022
Bu Chatbot, Konya Bilim Merkezi Yen için tasarlanmış olan bir projedir.

chatbot Bu Chatbot, Konya Bilim Merkezi Yeni Ufuklar Sergisi için 2021 Yılında tasarlanmış olan bir projedir. Chatbot Python ortamında yazılmıştır. Sö

Emre Özkul 1 Feb 23, 2022
Py65 65816 - Add support for the 65C816 to py65

Add support for the 65C816 to py65 Py65 (https://github.com/mnaberez/py65) is a

4 Jan 04, 2023
Transformers Wav2Vec2 + Parlance's CTCDecodeTransformers Wav2Vec2 + Parlance's CTCDecode

🤗 Transformers Wav2Vec2 + Parlance's CTCDecode Introduction This repo shows how 🤗 Transformers can be used in combination with Parlance's ctcdecode

Patrick von Platen 9 Jul 21, 2022
Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

CoML 762 Dec 29, 2022
List of GSoC organisations with number of times they have been selected.

Welcome to GSoC Organisation Frequency And Details 👋 List of GSoC organisations with number of times they have been selected, techonologies, topics,

Shivam Kumar Jha 41 Oct 01, 2022