<img src="https://upload.wikimedia.org/wikipedia/en/6/6d/Nvidia_image_logo.svg" style="width: 90px; float: right;">

# BERT Question Answering in TensorFlow with Mixed Precision

Copyright 2021 NVIDIA Corporation. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

In [None]:
!nvidia-smi

## 1. Overview

Bidirectional Embedding Representations from Transformers (BERT), is a method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks. 

The original paper can be found here: https://arxiv.org/abs/1810.04805.

NVIDIA's BERT is an optimized version of Google's official implementation, leveraging mixed precision arithmetic and tensor cores on V100 GPUS for faster training times while maintaining target accuracy.

### Learning objectives

This notebook demonstrates:
- Inference on Question Answering (QA) task with BERT Large model
- The use/download of fine-tuned NVIDIA BERT models from [NGC](https://ngc.nvidia.com)
- Use of Mixed Precision models for Inference

## 2. Setup

### Pre-Trained NVIDIA BERT TensorFlow Models on NGC

<img src="https://blogs.nvidia.com/wp-content/uploads/2019/03/18-ngc-software-stack-447x500.png" style="width: 360px;">

We will be using the following configuration of BERT in this example:

| **Model** | **Hidden layers** | **Hidden unit size** | **Attention heads** | **Feedforward filter size** | **Max sequence length** | **Parameters** |
|:---------:|:----------:|:----:|:---:|:--------:|:---:|:----:|
|BERTLARGE|24 encoder|1024| 16|4 x 1024|512|330M|

**To do so, we will take advantage of the pre-trained models available on the [NGC Model Registry](https://ngc.nvidia.com/catalog/models).**

Among the many configurations available we will download one of these two:

 - **bert_tf_ckpt_large_qa_squad2_amp_384**

which are trained on the [SQuaD 2.0 Dataset](https://rajpurkar.github.io/SQuAD-explorer/).

We can choose the mixed precision model (which takes much less time to train than the fp32 version) without losing accuracy, with the following flag: 

In [None]:
use_mixed_precision_model = True

In [None]:
root_dir="/scratch/ws/1/<your_workspace>/bert/"

In [None]:
# bert_tf_ckpt_large_qa_squad2_amp_384
DATA_DIR_FT = root_dir+'data/finetuned_large_model_SQUAD2.0'
!mkdir -p $DATA_DIR_FT
    
!wget --content-disposition -O $DATA_DIR_FT/bert_tf_ckpt_large_qa_squad2_amp_384_19.03.1.zip  \
https://api.ngc.nvidia.com/v2/models/nvidia/bert_tf_ckpt_large_qa_squad2_amp_384/versions/19.03.1/zip \
&& unzip -n -d $DATA_DIR_FT/ $DATA_DIR_FT/bert_tf_ckpt_large_qa_squad2_amp_384_19.03.1.zip \
&& rm -rf $DATA_DIR_FT/bert_tf_ckpt_large_qa_squad2_amp_384_19.03.1.zip

### NGC Model Scripts

While we're at it, we'll also pull down some BERT helper scripts from the [NGC Model Scripts Registry](https://ngc.nvidia.com/catalog/model-scripts/nvidia:bert_for_tensorflow)

In [None]:
# Download BERT helper scripts
!wget -nc --show-progress -O bert_scripts.zip \
     https://api.ngc.nvidia.com/v2/recipes/nvidia/bert_for_tensorflow/versions/1/zip
!mkdir -p $root_dir
!unzip -n -d $root_dir bert_scripts.zip

### BERT Config

In [None]:
# Download BERT vocab file
!mkdir -p $root_dir/config.qa
!wget -nc https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt \
    -O $root_dir/config.qa/vocab.txt

In [None]:
%%writefile $root_dir/config.qa/bert_config.json
{
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "max_position_embeddings": 512,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

### Helper Functions

In [None]:
# Create dynamic JSON files based on user inputs
def write_input_file(context, qinputs, predict_file):
    # Remove quotes and new lines from text for valid JSON
    context = context.replace('"', '').replace('\n', '')
    # Create JSON dict to write
    json_dict = {
      "data": [
        {
          "title": "BERT QA",
          "paragraphs": [
            {
              "context": context,
              "qas": qinputs
            }
          ]
        }
      ]
    }
    # Write JSON to input file
    with open(predict_file, 'w') as json_file:
        import json
        json.dump(json_dict, json_file, indent=2)
    
# Display Inference Results as HTML Table
def display_results(predict_file, output_prediction_file):
    import json
    from IPython.display import display, HTML

    # Here we show only the prediction results, nbest prediction is also available in the output directory
    results = ""
    with open(predict_file, 'r') as query_file:
        queries = json.load(query_file)
        input_data = queries["data"]
        with open(output_prediction_file, 'r') as result_file:
            data = json.load(result_file)
            for entry in input_data:
                for paragraph in entry["paragraphs"]:
                    for qa in paragraph["qas"]:
                        results += "<tr><td>{}</td><td>{}</td><td>{}</td></tr>".format(qa["id"], qa["question"], data[qa["id"]])

    display(HTML("<table><tr><th>Id</th><th>Question</th><th>Answer</th></tr>{}</table>".format(results)))

## 3. BERT Inference: Question Answering

We can run inference on a fine-tuned BERT model for tasks like Question Answering.

Here we use a BERT model fine-tuned on a [SQuaD 2.0 Dataset](https://rajpurkar.github.io/SQuAD-explorer/) which contains 100,000+ question-answer pairs on 500+ articles combined with over 50,000 new, unanswerable questions.

### Paragraph and Queries

In this example we will ask our BERT model questions related to the following paragraph:

**The Apollo Program**
_"The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two-man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975."_

  
---

The paragraph and the questions can be easily customized by changing the code below:

---

In [None]:
# Create BERT input file with (1) context and (2) questions to be answered based on that context
predict_file = root_dir+'config.qa/input.json'

In [None]:
%%writefile $predict_file
{"data": 
 [
     {"title": "Project Apollo",
      "paragraphs": [
          {"context":"The Apollo program, also known as Project Apollo, was the third United States human spaceflight program carried out by the National Aeronautics and Space Administration (NASA), which accomplished landing the first humans on the Moon from 1969 to 1972. First conceived during Dwight D. Eisenhower's administration as a three-man spacecraft to follow the one-man Project Mercury which put the first Americans in space, Apollo was later dedicated to President John F. Kennedy's national goal of landing a man on the Moon and returning him safely to the Earth by the end of the 1960s, which he proposed in a May 25, 1961, address to Congress. Project Mercury was followed by the two-man Project Gemini. The first manned flight of Apollo was in 1968. Apollo ran from 1961 to 1972, and was supported by the two man Gemini program which ran concurrently with it from 1962 to 1966. Gemini missions developed some of the space travel techniques that were necessary for the success of the Apollo missions. Apollo used Saturn family rockets as launch vehicles. Apollo/Saturn vehicles were also used for an Apollo Applications Program, which consisted of Skylab, a space station that supported three manned missions in 1973-74, and the Apollo-Soyuz Test Project, a joint Earth orbit mission with the Soviet Union in 1975.", 
           "qas": [
               { "question": "What project put the first Americans into space?", 
                 "id": "Q1"
               },
               { "question": "What program was created to carry out these projects and missions?",
                 "id": "Q2"
               },
               { "question": "What year did the first manned Apollo flight occur?",
                 "id": "Q3"
               },                
               { "question": "What President is credited with the notion of putting Americans on the moon?",
                 "id": "Q4"
               },
               { "question": "Who did the U.S. collaborate with on an Earth orbit mission in 1975?",
                 "id": "Q5"
               },
               { "question": "How long did Project Apollo run?",
                 "id": "Q6"
               },               
               { "question": "What program helped develop space travel techniques that Project Apollo used?",
                 "id": "Q7"
               },                
               {"question": "What space station supported three manned missions in 1973-1974?",
                 "id": "Q8"
               }
]}]}]}

## 4. Running Question/Answer Inference

To run QA inference we will launch the script run_squad.py with the following parameters:

In [None]:
import os

# This specifies the model architecture.
bert_config_file = root_dir+'config.qa/bert_config.json'

# The vocabulary file that the BERT model was trained on.
vocab_file = root_dir+'config.qa/vocab.txt'

# Initiate checkpoint to the fine-tuned BERT Large model
init_checkpoint = os.path.join(root_dir+'data/finetuned_large_model_SQUAD2.0/model.ckpt')

# Create the output directory where all the results are saved.
output_dir = root_dir+'results'
output_prediction_file = os.path.join(output_dir,'predictions.json')
    
# Whether to lower case the input - True for uncased models / False for cased models.
do_lower_case = True
  
# Total batch size for predictions
predict_batch_size = 8

# Whether to run eval on the dev set.
do_predict = True

# When splitting up a long document into chunks, how much stride to take between chunks.
doc_stride = 128

# The maximum total input sequence length after WordPiece tokenization.
# Sequences longer than this will be truncated, and sequences shorter than this will be padded.
max_seq_length = 384

### 4a. Run Inference

In [None]:
python_file=root_dir+"run_squad.py"
python_file

In [None]:
# Ask BERT questions
!python $python_file \
  --bert_config_file=$bert_config_file \
  --vocab_file=$vocab_file \
  --init_checkpoint=$init_checkpoint \
  --output_dir=$output_dir \
  --do_predict=$do_predict \
  --predict_file=$predict_file \
  --predict_batch_size=$predict_batch_size \
  --doc_stride=$doc_stride \
  --max_seq_length=$max_seq_length

### 4b. Display Results:

In [None]:
display_results(predict_file, output_prediction_file)

<details>
  <summary><b>Click to reveal expected answers to the questions above</b></summary>
  
| Id | Question | Answer |
|----|----------|--------|
| Q1 | What project put the first Americans into space? | Project Mercury |
| Q2 | What program was created to carry out these projects and missions? | The Apollo program |
| Q3 | What year did the first manned Apollo flight occur? | 1968 |
| Q4 | What President is credited with the notion of putting Americans on the moon?	 | John F. Kennedy |
| Q5 | Who did the U.S. collaborate with on an Earth orbit mission in 1975? | Soviet Union |
| Q6 | How long did Project Apollo run? | 1961 to 1972 |
| Q7 | What program helped develop space travel techniques that Project Apollo used? | Gemini missions |
| Q8 | What space station supported three manned missions in 1973-1974? | Skylab |

</details>

## 5. Custom Inputs

Now that you are familiar with running QA Inference on BERT, you may want to try
your own paragraphs and queries.


1. Copy and paste your context from Wikipedia, news articles, etc. when prompted below
2. Enter questions based on the context when prompted below.
3. Run the inference script
4. Display the inference results

In [None]:
predict_file = root_dir+'config.qa/custom_input.json'
num_questions = 4           # You can configure this number

In [None]:
# Create your own context to ask questions about.
context = input("Paste your context here: ")

In [None]:
# Get questions from user input
questions = [input("Question {}/{}: ".format(i+1, num_questions)) for i in range(num_questions)]
# Format questions and write to JSON input file
qinputs = [{ "question":q, "id":"Q{}".format(i+1)} for i,q in enumerate(questions)]
write_input_file(context, qinputs, predict_file)

In [None]:
# Ask BERT questions
!python $python_file \
  --bert_config_file=$bert_config_file \
  --vocab_file=$vocab_file \
  --init_checkpoint=$init_checkpoint \
  --output_dir=$output_dir \
  --do_predict=$do_predict \
  --predict_file=$predict_file \
  --predict_batch_size=$predict_batch_size \
  --doc_stride=$doc_stride \
  --max_seq_length=$max_seq_length

In [None]:
display_results(predict_file, output_prediction_file)