Assignment 4: Implementing an Extractive Question Answering Model

Description, Data, Code, and Resources

In this assignment you will experiment with implementing a deep-learning model for a real language understanding task, using existing libraries and pre-trained models.

You will implement a deep learning model, evaluate it, attempt to achieve the best possible score, and write a report on the process of doing so.

You will implement an ``Extractive Question Answering'' model, similar to models trained on the SQuAD2.0 dataset. SQuAD2.0 is a very famous dataset for English Extactive Question Answering (also called Machine Reading Comprehension). The task is defined in the following way: You are given a paragraph and question, and the model needs to mark a span from the paragraph that answers the question, or indicate that such a span is not available (if the paragraph does not answer the qeuestion). The dataset page for SQuAD2.0 is https://rajpurkar.github.io/SQuAD-explorer/. It describes the dataset and the task, allows to download the dataset, and links to the paper describing the data.

Besides the dataset, there is also a leaderboard, listing various works on that dataset, and the scores they obtained. You can look at these for inspiration.

Your job however will not be to train and test a model on SQuAD2.0, but rather on HeQ, a Hebrew question answering dataset with a similar structure.

Here is the paper describing the HeQ dataset: https://u.cs.biu.ac.il/~yogo/heq.pdf
And here is the dataset itself: https://github.com/NNLP-IL/Hebrew-Question-Answering-Dataset

Working on HeQ means that while you can rely on the SQuAD2.0 works for inspiration, you will need to somewhat adapt them to the Hebrew data. Some models take longer to train than others. Some require a GPU to train in a reasonable time. Some have many moving parts, some have less. Take this into account when choosing what to implement.

You need to produce the best model you can for this task. As it is a new dataset, we don't know yet what's achievable in terms of score. So do your best, and you may even be the best in the world :) You should be at least as good as the baseline results mentioned in Table 5 of the HeQ paper above. This should be easy. (You will loose points if you are not at least as good as the base in the paper.)

The HuggingFace Transformers library will be a useful building block: https://huggingface.co/docs

You may use a pre-trained model like AlephBert, mBERT, mT5, or any other pre-trained model you may find:

You may fine-tune the entire model, or use a parameter-efficient tuning method like BitFit or LoRA. (These might be faster to train, and will likely work well)

Compute your F1 and TLNLS scores with this evaluation script: eval.tgz

Run it as python eval_qa.py gold_file.json prediction_file.json.

How and What to submit?

Submit using the submission system. All the text/PDF files should include your ID and username in a clear location.

You need to submit all your code, a plain-text (ASCII) README file explaining how to run your code in order to replicate your best result, and a PDF file called report.pdf describing your implementation and results.

Your grade will be based on a combination of your achieved score and your report, so make sure it is presentable and contains enough details in order to be graded properly.

At a minimum, it should describe:

Which approaches you took for tackling the QA task.
Why did you choose the ones you chose.
Did some things work better than other?
Were there any thing you found to particularly hard? or particularly easy?
What was involved in getting to your final model? If you tried several hyperparameters, or several approaches, describe them, and what was the result of each
Are there things that turned out to be especially important with respect to the final accuracy? What made the biggest difference? what turned out to be not that important or not important at all?
What was your performance on that dataset?
- Report your accuracies on the train and test sets, and your learning curve graphs
What was involved in getting to your final model?
- If you tried several hyperparameters, or several approaches, describe them, and what was the result of each.
What worked straightforward out of the box? what didn't work?
Are there any improvements to the algorithm you can think about, but didn't implement?

Assignment 4: Implementing an Extractive Question Answering Model

Description, Data, Code, and Resources

How and What to submit?

Good luck!