Summary
Question Answering using BERT (Bidirectional Encoder Representations from Transformers) has significantly advanced natural language understanding and information retrieval. As a cutting-edge transformer model, BERT effectively captures contextual relationships in text through its bidirectional attention mechanism. In QA tasks, BERT analyzes both a question and the associated context to create embeddings for each word, considering the complete surrounding context. The model then identifies the relevant answer span within that context, allowing it to comprehend and extract pertinent information. When fine-tuned for question answering, BERT has shown outstanding performance across various datasets and is widely utilized in applications such as search engines, virtual assistants, and information retrieval systems, improving the accuracy and efficiency of extracting precise answers from text. In this notebook, the BERT model is fine-tuned using the SQUAD 2.0 question and answering dataset obtained from Kaggle.
Python functions and data files needed to run this notebook are available via this link.
import warnings
warnings.filterwarnings('ignore')
from transformers import BertTokenizerFast, BertForQuestionAnswering, pipeline, \
DataCollatorWithPadding, TrainingArguments, Trainer, \
AutoModelForQuestionAnswering, AutoTokenizer
from datasets import Dataset
import pandas as pd
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
There are two types of answering: Extractive, Abstractive
Extractive Answering | Abstractive Answering |
---|---|
Answer to a question given a piece of text is a direct substring of the context | Answer to a question given a piece of context is a free-form phrase based on the context |
BERT | Decoder is required |
GPT, T5 |
We can give a question and some contexts, then BERT can extract a subset of piece of context to answer that question. This is an extractive answering. See Figure below:
# We are using a large uncased BERT since we want to give a model a large data set since
# question and asnwering has limited examples
bert_tokenizer = BertTokenizerFast.from_pretrained('bert-large-uncased', return_token_type_ids=True)
qa_bert = BertForQuestionAnswering.from_pretrained('bert-large-uncased')
SQUAD 2.0 question and answering data set was downloaded from kaggle. For this data set, there is question column which the answer is within column context. The columns text is the answer and answer_start gives the index that answer start within column context.
pd.set_option('display.max_colwidth', None)
# load training data set
df_qa = pd.read_csv('train-squad.csv')
df_qa.rename({'text': 'answer'}, axis=1,inplace=True)
df_qa = df_qa[['context','question','answer']]
print(df_qa.shape)
df_qa[:3]
def find_idx (big_index,small_index):
"""
Find the starting indices of a sequence of 'small_index' within 'big_index'.
Parameters:
- big_index (list): The larger sequence of indices.
- small_index (list): The smaller sequence of indices to be found within 'big_index'.
Returns:
- list: A list of starting indices where 'small_index' is found in 'big_index'.
"""
# Iterate through each index in 'big_index'
for i in range(len(big_index)):
# Initialize an empty list to store starting indices
indices = []
# Check if the current index in 'big_index' matches the first index in 'small_index'
if big_index[i] == small_index[0]:
# If there is a match, append the current index to 'indices'
indices.append(i)
# If the length of 'small_index' is greater than 1, check for the entire sequence
if len(small_index)>1:
j = 1
# Continue checking subsequent indices for a match with 'small_index'
while len(small_index)>j and big_index[i+j] == small_index[j]:
indices.append(j+i)
j += 1
if len(small_index) == j:
return indices
break
else:
return [i]
break
def file_add(x):
"""
Tokenize the input question and context using BERT tokenizer and find the token indices
corresponding to the answer within the tokenized sequence.
Parameters:
- x (dict): Input dictionary containing 'question', 'context', and 'answer' keys.
Returns:
- tuple: A tuple containing the starting and ending token indices of the answer within the tokenized sequence.
If the answer is not found, it returns (-1, -1).
"""
# Tokenize the question and context using BERT tokenizer
qst_contxt = bert_tokenizer.encode(x['question'],x['context'])
try:
# Tokenize the answer
answr = bert_tokenizer.encode(x['answer'])[1:-1]
# Find the indices of the answer within the tokenized question and context
answr_idx = find_idx (qst_contxt,answr)
try:
# If multiple indices are found, use the first and last indices
if len(answr_idx)>1:
# If only one index is found, use it for both start and end
tkn_strt,tkn_end = answr_idx[0], answr_idx[-1]
else :
tkn_strt,tkn_end = answr_idx[0], answr_idx[0]
except TypeError:
# Handle the case where answr_idx is not a list (Type Error)
tkn_strt,tkn_end = -1, -1
# Return the starting and ending token indices of the answer
return tkn_strt, tkn_end
except TypeError:
# Handle the case where answr is not properly defined (Type Error)
return -1, -1
tmp = df_qa.apply(lambda x: file_add(x), axis=1)
df_qa['start_positions'], df_qa['end_positions'] = [i[0] for i in tmp], [i[1] for i in tmp]
df_qa = df_qa[['question', 'context', 'start_positions', 'end_positions', 'answer']]
df_qa[:4]
df_qa.iloc[0]['context']
# index 75, 76, 77 and 78 including question while encoding
bert_tokenizer.decode(bert_tokenizer.encode(df_qa.iloc[0].question, df_qa.iloc[0].context)[75:79])
Dataset.from_pandas
is a method provided by the deep learning framework PyTorch
, specifically in the torch.utils.data
module. This method is used to create a PyTorch dataset from a pandas DataFrame.
We only grab 8,000 examples because fine-tunning process is very expensive.
qa_dataset = Dataset.from_pandas(df_qa.sample(8000, random_state=32))
# Dataset has a built in train test split method
qa_dataset = qa_dataset.train_test_split(test_size=0.2)
qa_dataset
# preprocessing here with truncation to truncate longer text
def preprocess(data):
return bert_tokenizer(data['question'], data['context'], truncation=True) # anything pass window of 512
# should be truncated
qa_dataset = qa_dataset.map(preprocess, batched=True)
# to speed up training, freeze all but the last 2 encoder layers in BERT
for name, param in qa_bert.bert.named_parameters():
if 'encoder.layer.20' in name: # our large model has 24 encoder so everything until last 2 are removed
break
param.requires_grad = False # disable training in BERT
# Dynamic padding to speed up training
data_collator = DataCollatorWithPadding(tokenizer=bert_tokenizer)
batch_size = 5
epochs = 2
training_args = TrainingArguments(
output_dir='./qsn_anw/results',
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
logging_dir='./qsn_anw/logs',
save_strategy='epoch',
logging_steps=10,
evaluation_strategy='epoch',
load_best_model_at_end=True
)
trainer = Trainer(
model=qa_bert, # pretrained BERT
args=training_args,
train_dataset=qa_dataset['train'],
eval_dataset=qa_dataset['test'],
data_collator=data_collator
)
# Get initial metrics
trainer.evaluate()
# Question and answering model is very large
trainer.train()
trainer.save_model()
pipe = pipeline("question-answering", './qsn_anw/results', tokenizer=bert_tokenizer)
txt = """The brain is an organ that serves as the center of the nervous system in
all vertebrate and most invertebrate animals. Only a few invertebrates such as sponges,
jellyfish, adult sea squirts and starfish do not have a brain; diffuse or localised
nerve nets are present instead. The brain is located in the head, usually close to the
primary sensory organs for such senses as vision, hearing, balance, taste, and smell.
The brain is the most complex organ in a vertebrate's body. In a typical human, the
cerebral cortex (the largest part) is estimated to contain 15–33 billion neurons,
each connected by synapses to several thousand other neurons. These neurons communicate
with one another by means of long protoplasmic fibers called axons, which carry trains
of signal pulses called action potentials to distant parts of the brain or body targeting
specific recipient cells."""
pipe("How are neurons connected?", txt)
The answer for question above is synapses.
We can google someone as below:
PERSON = 'Mehdi Rezvandehy'
# Note this is NOT an efficient way to search on google. This is done simply for education purposes
google_html = BeautifulSoup(requests.get(f'https://www.google.com/search?q={PERSON}').text).get_text()[:512]
pipe(f'Who is {PERSON}?', google_html)
# From Huggingface: https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad
squad_pipe = pipeline("question-answering", "bert-large-uncased-whole-word-masking-finetuned-squad")
squad_pipe("Where is Mehdi living these days?", "Mehdi lives in Calgary but Hamid lives in Edmonton.")
Now we have 99% of correct score.
ir = 4000
print (f'question is: \n{df_qa.question.iloc[ir]}\n\n')
print (f'context is: \n{df_qa.context.iloc[ir]}\n\n')
print (f'real answer is: \n{df_qa.answer.iloc[ir]}\n\n')
print (f'Predict it with fine-tuned model \n{squad_pipe(df_qa.question.iloc[ir], df_qa.context.iloc[ir])}\n\n')
txt = """The brain is an organ that serves as the center of the nervous system in
all vertebrate and most invertebrate animals. Only a few invertebrates such as sponges,
jellyfish, adult sea squirts and starfish do not have a brain; diffuse or localised
nerve nets are present instead. The brain is located in the head, usually close to the
primary sensory organs for such senses as vision, hearing, balance, taste, and smell.
The brain is the most complex organ in a vertebrate's body. In a typical human, the
cerebral cortex (the largest part) is estimated to contain 15–33 billion neurons,
each connected by synapses to several thousand other neurons. These neurons communicate
with one another by means of long protoplasmic fibers called axons, which carry trains
of signal pulses called action potentials to distant parts of the brain or body targeting
specific recipient cells."""
pipe("How are neurons connected?", txt)
ir = 4000
print (f'question is: \nHow are neurons connected\n\n')
print (f'context is: \n{txt}\n\n')
print (f'real answer is: \nsynapses\n\n')
print (f'Predict it with fine-tuned model \n{squad_pipe("How are neurons connected?", txt)}\n\n')