In this assignment you will experiment with implementing a deep-learning model for a real language understanding task, using existing libraries and pre-trained models.
You will implement a deep learning model, evaluate it, attempt to achieve the best possible score, and write a report on the process of doing so.
You will implement an ``Extractive Question Answering'' model, similar to models trained on the SQuAD2.0 dataset. SQuAD2.0 is a very famous dataset for English Extactive Question Answering (also called Machine Reading Comprehension). The task is defined in the following way: You are given a paragraph and question, and the model needs to mark a span from the paragraph that answers the question, or indicate that such a span is not available (if the paragraph does not answer the qeuestion). The dataset page for SQuAD2.0 is https://rajpurkar.github.io/SQuAD-explorer/. It describes the dataset and the task, allows to download the dataset, and links to the paper describing the data.
Besides the dataset, there is also a leaderboard, listing various works on that dataset, and the scores they obtained. You can look at these for inspiration.
Your job however will not be to train and test a model on SQuAD2.0, but rather on HeQ, a Hebrew question answering dataset with a similar structure.
Working on HeQ means that while you can rely on the SQuAD2.0 works for inspiration, you will need to somewhat adapt them to the Hebrew data. Some models take longer to train than others. Some require a GPU to train in a reasonable time. Some have many moving parts, some have less. Take this into account when choosing what to implement.
You need to produce the best model you can for this task. As it is a new dataset, we don't know yet what's achievable in terms of score. So do your best, and you may even be the best in the world :) You should be at least as good as the baseline results mentioned in Table 5 of the HeQ paper above. This should be easy. (You will loose points if you are not at least as good as the base in the paper.)
The HuggingFace Transformers library will be a useful building block: https://huggingface.co/docs
You may use a pre-trained model like AlephBert, mBERT, mT5, or any other pre-trained model you may find:
You may fine-tune the entire model, or use a parameter-efficient tuning method like BitFit or LoRA. (These might be faster to train, and will likely work well)
Compute your F1 and TLNLS scores with this evaluation script: eval.tgz
Run it as python eval_qa.py gold_file.json prediction_file.json
.
Submit using the submission system. All the text/PDF files should include your ID and username in a clear location.
You need to submit all your code, a plain-text (ASCII) README
file explaining how to run your code in order to replicate your best result,
and a PDF file called report.pdf
describing your implementation and results.
Your grade will be based on a combination of your achieved score and your report, so make sure it is presentable and contains enough details in order to be graded properly.
At a minimum, it should describe: