Graphcore - Medium

Text Summarization with BART-Large on IPU

Graphcore — Fri, 11 Aug 2023 14:45:45 GMT

Try one of the most powerful and versatile applications of the BART language model using a Paperspace Gradient Notebook

by Goran Katalinic

Text summarization is one of the best examples of AI natural language processing (NLP) being put to practical use.

With vast amounts of information produced every day, the ability to quickly understand, evaluate and act on that information can be extremely valuable, both in the commercial world and in other fields such as scientific research.

Summarization is a task of producing a shorter version of a document while preserving its important information. Fundamentally, it involves extracting text from the original input then generating a new text that describes the essence of the original. In some cases the two parts many be managed by different AI models.

In this blog, we will demonstrate how to run the entire summarization process using BART-Large on Graphcore IPUs.

What is BART and why is it good for text summarization?

When Google launched BERT (Bidirectional Encoder Representations from Transformers) in 2018, it was described as being a model for “language understanding” which was defined as a broad range of applications, including sentiment analysis, text classification and question answering. Summarization was not explicity called-out as a use-case, at that time.

In the same year, Open-AI further advanced the field of Natural Language Understanding, proposing the concept of Generative Pre-Training (GPT).

In late 2019, Facebook AI researchers proposed a combination of bidirectional encoder (like BERT) and left-to-right decoder (like GPT) and gave it a name BART, which stands for Bidirectional and Auto-Regressive Transformers.

According to the original paper, the novelty in pretraining involves a new in-filling scheme when randomly shuffling the order of original sentences. The authors claimed that BART is particularly effective when fine tuned for text generation and for comprehension tasks — both of which are needed for text summarization.

Text summarization on Graphcore IPUs with Hugging Face pipeline

BART is one of the many NLP models supported within Optimum Graphcore, which is an interface between Hugging Face and Graphcore IPUs.

Here we demonstrate a text summarization task running BART-Large inference on Graphcore IPUs.

For each code block below, you can simply click to run the block in Paperspace — making any modifications to code/parameters, where relevant. We explain how to run the process in environments other than Paperspace Gradient Notebooks at the end of this blog.

Install dependencies

%pip install optimum-graphcore==0.7.1 wikipedia graphcore-cloud-tools[logger]@git+https://github.com/graphcore/graphcore-cloud-tools

%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

import os

exec_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/")

Model preparation

We start by preparing the model. First, we define the configuration needed to run the model on the IPU. IPUConfig is a class that specifies attributes and configuration parameters to compile and put the model on the device:

from optimum.graphcore import IPUConfig

ipu_config = IPUConfig(
    layers_per_ipu=[12, 12],
    matmul_proportion=0.15,
    executable_cache_dir=exec_cache_dir,
    inference_parallelize_kwargs={
        "max_length": 150,
        "num_beams": 3,
        "use_encoder_output_buffer": True,
        "on_device_generation_steps": 16,
    }
)

Next, let’s import pipeline from optimum.graphcore and create our summarization pipeline:

from optimum.graphcore import pipeline

summarizer = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    tokenizer="facebook/bart-large-cnn",
    ipu_config=ipu_config,
    config="facebook/bart-large-cnn",
    max_input_length=1024,
    truncation=True
)

We define an input to test the model.

input_test = 'In computing, a compiler is a computer program that translates computer code written in one programming language (the source language) into another language (the target language). The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language (e.g. assembly language, object code, or machine code) to create an executable program.'
input_test

Compilation time for the 1st run: ~ 2:30

%%time
summarizer(input_test, max_length=150, num_beams=3)

Faster fairy tales

The first call to the pipeline was a bit slow, taking several seconds to provide the answer. This behaviour is due to compilation of the model which happens on the first call. On subsequent prompts it is much faster:

the_princess_and_the_pea = 'Once upon a time there was a prince who wanted to marry a princess; but she would have to be a real princess. He travelled all over the world to find one, but nowhere could he get what he wanted. There were princesses enough, but it was difficult to find out whether they were real ones. There was always something about them that was not as it should be. So he came home again and was sad, for he would have liked very much to have a real princess. One evening a terrible storm came on; there was thunder and lightning, and the rain poured down in torrents. Suddenly a knocking was heard at the city gate, and the old king went to open it. It was a princess standing out there in front of the gate. But, good gracious! what a sight the rain and the wind had made her look. The water ran down from her hair and clothes; it ran down into the toes of her shoes and out again at the heels. And yet she said that she was a real princess. Well, we\'ll soon find that out, thought the old queen. But she said nothing, went into the bed-room, took all the bedding off the bedstead, and laid a pea on the bottom; then she took twenty mattresses and laid them on the pea, and then twenty eider-down beds on top of the mattresses. On this the princess had to lie all night. In the morning she was asked how she had slept. "Oh, very badly!" said she. "I have scarcely closed my eyes all night. Heaven only knows what was in the bed, but I was lying on something hard, so that I am black and blue all over my body. It\'s horrible!" Now they knew that she was a real princess because she had felt the pea right through the twenty mattresses and the twenty eider-down beds. Nobody but a real princess could be as sensitive as that. So the prince took her for his wife, for now he knew that he had a real princess; and the pea was put in the museum, where it may still be seen, if no one has stolen it. There, that is a true story.'
the_princess_and_the_pea

%%time
summarizer(the_princess_and_the_pea, max_length=150, num_beams=3)

Summarization of Wikipedia articles

Now let’s use the Wikipedia API to search for some long text that can be summarized:

import wikipedia

# TRY IT YOURSELF BY CHANGING THE PAGE TITLE BELOW
page_title = "Queen (band)"
text = wikipedia.page(page_title).content
text

%%time
summarizer(
    text,  # NOTE: the input text would be truncated to max_input_length=1024
    max_length=150,
    num_beams=3,
)

Summarization of medical health records

The summarization task may be also useful in summarising medical health records (MHR). Let’s import an open source dataset with some medical samples.

from datasets import load_dataset

dataset = load_dataset("rungalileo/medical_transcription_40")
dataset

We focus on the medical report labeled as “text” and from the training dataset select a random patient ID.

import random

# RUN THIS CELL AGAIN TO SELECT ANOTHER REPORT
random_patient_id = random.randint(0, len(dataset["train"]))

exemplary_medical_report = dataset["train"][random_patient_id]["text"]
exemplary_medical_report

%%time
summarizer(exemplary_medical_report, max_length=150, num_beams=3)

Running BART-Large on IPUs in non-Paperspace environments

To run the demo using other IPU hardware, you need to have the Poplar SDK enabled and the relevant PopTorch wheels installed. Refer to the getting started guide for your system for details on how to enable the Poplar SDK and install the PopTorch wheels.

Text Summarization with BART-Large on IPU was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.

Fine-tune OpenAI’s Whisper Automatic Speech Recognition (ASR) Model

Graphcore — Wed, 09 Aug 2023 22:31:30 GMT

Get the most out of Whisper by optimising if for new use cases, including better comprehension of specific languages and dialects, as well as technical and industry-specific terminology.

by Goran Katalinic

Whisper — the open source automatic speech recognition (ASR) model created by OpenAI — is incredibly powerful out of the box.

It is trained on 680,000 hours of labelled audio data, 117,000 hours of which cover 96 languages other than English, meaning that it can be applied to a wide range of applications with great results.

The vanilla version Whisper is available to run for inference in a Paperspace Gradient Notebook, powered by Graphcore IPUs.

There are also good reasons to fine-tune Whisper for a particular use case. This could include accounting for the complex and sometimes subtle differences in speech and vocabulary as influenced by:

A less common spoken language
Locale and dialect
A particular domain, such as scientific, medical, and legal

Where can I get audio data for fine-tuning Whisper?

Some organisations may have large amounts of proprietary audio data that can be used in the fine-tuning process. For others, gathering the audio necessary for fine-tuning is not a trivial undertaking.

Thankfully, there are several open-sourced speech recognition datasets available, covering multiple languages. The largest of these are:

Common Voice: 16,400 hours spanning 100 languages
Multilingual LibriSpeech: 6,000 hours for seven languages other than English
LibriSpeech: 1,000 hours, English only

There are smaller datasets covering many more languages and dialects, such as:

VoxPopuli: 1,800 hours, 16 languages
Fleurs: 12 hours per language, 102 languages
There are also individual datasets hosted by OpenSLR

In our Paperspace Gradient Notebook, we demonstrate fine-tuning using the Catalan subset of OpenSLR.

How to fine-tune Whisper on Graphcore IPUs

Get started by running the Whister Small Fine Tuning notebook on Paperspace.

Install dependencies

# Install optimum-graphcore from source 
!pip install git+https://github.com/huggingface/optimum-graphcore.git@v0.7.1 "soundfile" "librosa" "evaluate" "jiwer"
%pip install "graphcore-cloud-tools[logger] @ git+https://github.com/graphcore/graphcore-cloud-tools"
%load_ext graphcore_cloud_tools.notebook_logging.gc_logger

import os

n_ipu = int(os.getenv("NUM_AVAILABLE_IPU", 4))
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "/tmp/exe_cache/") + "/whisper"

# Generic imports
from dataclasses import dataclass
from typing import Any, Dict, List, Union

import evaluate
import numpy as np
import torch
from datasets import load_dataset, Audio, Dataset, DatasetDict

# IPU-specific imports
from optimum.graphcore import (
    IPUConfig, 
    IPUSeq2SeqTrainer, 
    IPUSeq2SeqTrainingArguments, 
)
from optimum.graphcore.models.whisper import WhisperProcessorTorch

# HF-related imports
from transformers import WhisperForConditionalGeneration

Load dataset

Common Voice datasets consist of recordings of speakers reading text from Wikipedia in different languages. 🤗 Datasets enables us to easily download and prepare the training and evaluation splits.

First, ensure you have accepted the terms of use on the 🤗 Hub: mozilla-foundation/common_voice_13_0. Once you have accepted the terms, you will have full access to the dataset and be able to download the data locally.

dataset = DatasetDict()
split_dataset = Dataset.train_test_split(
    load_dataset("openslr", "SLR69", split="train", token=False), test_size=0.2, seed=0
)
dataset["train"] = split_dataset["train"]
dataset["eval"] = split_dataset["test"]
print(dataset)

The columns of interest are:

audio: the raw audio samples
sentence: the corresponding ground truth transcription.

We drop the path column.

dataset = dataset.remove_columns(["path"])

Since Whisper was pre-trained on audio sampled at 16 kHz, we must ensure the Common Voice samples are downsampled accordingly.

dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))

Prepare Dataset

We prepare the datasets by extracting features from the raw audio inputs and injecting labels which are simply transcriptions with some basic processing.

The feature extraction is provided by 🤗 Transformers WhisperFeatureExtractor. To decode generated tokens into text after running the model, we will similarly require a tokenizer, WhisperTokenizer. Both of these are wrapped by an instance of WhisperProcessor.

MODEL_NAME = "openai/whisper-small"
LANGUAGE = "spanish"
TASK = "transcribe"
MAX_LENGTH = 224

processor = WhisperProcessorTorch.from_pretrained(MODEL_NAME, language=LANGUAGE, task=TASK)
processor.tokenizer.pad_token = processor.tokenizer.eos_token
processor.tokenizer.max_length = MAX_LENGTH
processor.tokenizer.set_prefix_tokens(language=LANGUAGE, task=TASK)

def prepare_dataset(batch, processor):
    inputs = processor.feature_extractor(
        raw_speech=batch["audio"]["array"],
        sampling_rate=batch["audio"]["sampling_rate"],
    )
    batch["input_features"] = inputs.input_features[0].astype(np.float16)

    transcription = batch["sentence"]
    batch["labels"] = processor.tokenizer(text=transcription).input_ids
    return batch

columns_to_remove = dataset.column_names["train"]
dataset = dataset.map(
    lambda elem: prepare_dataset(elem, processor),
    remove_columns=columns_to_remove,
    num_proc=1,
)

train_dataset = dataset["train"]
eval_dataset = dataset["eval"]

Lastly, we pre-process the labels by padding them with values that will be ignored during fine-tuning. This padding is to ensure tensors of static shape are provided to the model. We do this on the fly via the data collator below.

@dataclass
class DataCollatorSpeechSeq2SeqWithLabelProcessing:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        batch = {}
        batch["input_features"] = torch.tensor([feature["input_features"] for feature in features])
        
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt", padding="longest", pad_to_multiple_of=MAX_LENGTH)
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        batch["labels"] = labels

        return batch

Define metrics

The performance of our fine-tuned model will be evaluated using word error rate (WER).

metric = evaluate.load("wer")


def compute_metrics(pred, tokenizer):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    # replace -100 with the pad_token_id
    pred_ids = np.where(pred_ids != -100, pred_ids, tokenizer.pad_token_id)
    label_ids = np.where(label_ids != -100, label_ids, tokenizer.pad_token_id)

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    normalized_pred_str = [tokenizer._normalize(pred).strip() for pred in pred_str]
    normalized_label_str = [tokenizer._normalize(label).strip() for label in label_str]

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)
    normalized_wer = 100 * metric.compute(predictions=normalized_pred_str, references=normalized_label_str)

    return {"wer": wer, "normalized_wer": normalized_wer}

Load pre-trained model

model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.max_length = MAX_LENGTH
model.generation_config.max_length = MAX_LENGTH

Ensure language-appropriate tokens, if any, are set for generation. We set them on both the config and the generation_config to ensure they are used correctly during generation.

model.config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(
    language=LANGUAGE, task=TASK
)
model.config.suppress_tokens = []
model.generation_config.forced_decoder_ids = processor.tokenizer.get_decoder_prompt_ids(
    language=LANGUAGE, task=TASK
)
model.generation_config.suppress_tokens = []

Fine-tuning Whisper on the IPU

The model can be directly fine-tuned on the IPU using the IPUSeq2SeqTrainer class.

The IPUConfig object specifies how the model will be pipelined across the IPUs.

For fine-tuning, we place the encoder on two IPUs, and the decoder on two IPUs.

For inference, the encoder is placed on one IPU, and the decoder on a different IPU.

replication_factor = n_ipu // 4
ipu_config = IPUConfig.from_dict(
    {
        "optimizer_state_offchip": True,
        "recompute_checkpoint_every_layer": True,
        "enable_half_partials": True,
        "executable_cache_dir": executable_cache_dir,
        "gradient_accumulation_steps": 16,
        "replication_factor": replication_factor,
        "layers_per_ipu": [5, 7, 5, 7],
        "matmul_proportion": [0.2, 0.2, 0.6, 0.6],
        "projection_serialization_factor": 5,
        "inference_replication_factor": 1,
        "inference_layers_per_ipu": [12, 12],
        "inference_parallelize_kwargs": {
            "use_cache": True,
            "use_encoder_output_buffer": True,
            "on_device_generation_steps": 16,
        }
    }
)

Lastly, we specify the arguments controlling the training process.

total_steps = 1000 // replication_factor
training_args = IPUSeq2SeqTrainingArguments(
    output_dir="./whisper-small-ipu-checkpoints",
    do_train=True,
    do_eval=True,
    predict_with_generate=True,
    learning_rate=1e-5 * replication_factor,
    warmup_steps=total_steps // 4,
    evaluation_strategy="steps",
    eval_steps=total_steps,
    max_steps=total_steps,
    save_strategy="steps",
    save_steps=total_steps,
    logging_steps=25,
    dataloader_num_workers=16,
    dataloader_drop_last=True,
)

Then, we just need to pass all of this together with our datasets to the IPUSeq2SeqTrainer class:

trainer = IPUSeq2SeqTrainer(
    model=model,
    ipu_config=ipu_config,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=DataCollatorSpeechSeq2SeqWithLabelProcessing(processor),
    compute_metrics=lambda x: compute_metrics(x, processor.tokenizer),
    tokenizer=processor.feature_extractor,
)

To gauge the improvement in WER, we run an evaluation step before fine-tuning.

trainer.evaluate()

All that remains is to fine-tune the model! The fine-tuning process should take between 6 and 18 minutes, depending on how many replicas are used, and achieve a final WER of around 10%.

trainer.train()

Fine-tuning on IPUs in non-Paperspace environments

To run the Whisper Small fine-tuning demo using IPU hardware other than in a Paperspace Gradient Notebook, you need to have the Poplar SDK enabled.

Refer to the Getting Started guide for your system for details on how to enable the Poplar SDK. Also refer to the Jupyter Quick Start guide for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

Conclusion

In this notebook, we demonstrated how to fine-tune Whisper for multi-lingual speech recognition and transcription on the IPU.

We used a single replica on a total of four IPUs. To reduce the fine-tuning time, more than one replica, hence more IPUs are required. On Paperspace, you can use either an IPU Pod16 or a Bow Pod16, both with 16 IPUs. Please contact Graphcore if you need assistance running on larger platforms.

For all available notebooks, check IPU-powered Jupyter Notebooks to see how IPUs perform on other tasks.

Have a question? Please contact us on our Graphcore community channel.

Fine-tune OpenAI’s Whisper Automatic Speech Recognition (ASR) Model was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.

Llama 2: Run Meta’s latest Open Source Generative Large Language Model on IPUs

Tim Santos — Tue, 01 Aug 2023 13:56:15 GMT

Try out 7bn and 13bn parameter versions of the groundbreaking LLM using a Paperspace Gradient Notebook

by Tim Santos and Arsalan Udin

Llama-2 is the latest open-source Large Language Model (LLM) from Meta. It has been described as a game-changer for adoption and commercialisation of LLMs because of its comparable performance with much larger models and its permissive open-source license that allows its use and distribution in commercial applications.

Now, Graphcore is making the model available to run — for free — using Paperspace Gradient Notebooks.

What is Llama-2?

Llama 2 is a large language model that offers significant improvements over its predecessor, which was released in February 2023. While the original Llama was only licensed for research use, Llama-2 is free to use, even for commercial applications, under a new open-source licence.

The latest version of Llama is trained on 40% more data and offers, according to Meta’s accompanying paper, performance on par with proprietary models such as ChatGPT and Palm, and better than some of the most widely hailed open source LLMs.

The Llama-2 model repository offers a variety of mode sizes, namely 7B, 13B, and 70B parameters which makes deployment of applications viable without having to spend a fortune on infrastructure. There are also accompanying variants, fine-tuned for chat-type interaction, using human feedback.

How can I use Llama-2 for free on IPUs?

At Graphcore we believe in the importance of open-source models in realising the transformative power of AI. They fast-track innovation and breakthroughs, allowing users to build cutting-edge products and services with the minimal burden of starting from scratch or relying on costly proprietary models.

We are committed to bringing the latest, most exciting models to Graphcore IPU users as soon as possible, so we’re delighted to make the 7bn and 13b parameter versions of Llama-2.

The accompanying notebook guides you through creating and configuring an inference pipeline on Graphcore IPUs using a Paperspace Gradient notebook which pre-packages libraries and prerequisite files so you can get started with Llama-2 easily.

New users can try out Llama-2 on a free IPU-Pod4, with Paperspace’s six-hour free trial. For a higher performance implementation, you can scale up to an IPU-Pod16.

1. Request access to the model weights

To download and use the pre-trained Llama-2 base model and fine-tuned Llama-2-chat checkpoints, you will need to authenticate the Hugging Face Hub and create a read access token on the Hugging Face website. You will need this token access when prompted after executing the following cell:

from huggingface_hub import notebook_login  

notebook_login()

Llama-2 is open-sourced under the Meta license which you’ll need to accept to get access to the model weights and tokenizer. The model cards in Hugging Face hub are also gated models, so you will need to request access through the model cards (see llama-2–7b-chat, llama-2–13b-chat).

2. Select Model Weights

Meta released 7B, 13B,and 70B parameter versions of the model. Llama-2–7B and Llama-2–13B fits in our Paperspace free tier environment, using a Graphcore IPU-Pod4 system. Users can also scale up to a paid IPU-Pod16 system for faster inference.

3. Create Inference Pipeline

Our next goal is to set up the inference pipeline. We’ll specify the maximum sequence length and the micro batch size. Before deploying a model on IPUs, it needs to be translated into a format that’s executable through a compilation process.

This step takes place when the pipeline is created. It is crucial that all input dimensions are defined prior to this compilation. We have supplied the necessary configurations allowing you to run everything seamlessly. This compilation step could take 15–30 minutes depending on the size of the model, but we also provide the pre-compiled execution file in the cache to bring this step down to a minute.

Selecting a longer sequence length or larger batch size will use more IPU memory. This means that increasing one may require you to decrease the other. As your next task, try enhancing the system’s efficiency by adjusting these hyperparameters.

Remember, if you make changes, the pipeline will require recompilation which could take between 10–30 minutes depending on the size of the model.

If you would like to try another model with different number of parameters, select a different checkpoint in the previous cell and then re-run this step to load the correct model to the IPUs.

4. Run Inference

Call the llama_pipeline object you have just created to generate text from a prompt.

There are some optional parameters to the pipeline call you can use to control the generation behaviour:

temperature — Indicates whether you want more or less creative output. A value of 1.0 corresponds to the model’s default behaviour. Temperatures greater than 1.0 flatten the next token distribution making more unusual next tokens more likely. The temperature must be zero or positive.

k — Indicates that only among the highest k probable tokens can be sampled. This is known as “top k” sampling. Set to 0 to disable top k sampling and sample from all possible tokens. The value for k must be between a minimum of 0 and a maximum of config.model.embedding.vocab_size which is 32000. The default is 5.

output_length — Sets a maximum output length in tokens. Generation normally stops when the model generates its end_key text, but can be made to stop before that by specifying this option. A value of None disables the limit. The default is 'None'.

You can start with a different user input and play around with the optional parameters. For instance, let’s use the prompt “How do I get help if I am stuck on a deserted island?”, set temperature=0.4, and set top-K sampling to 10.

Prompting Llama 2

One of the advantages of using your own open source LLM is the ability to control the system prompt which is inaccessible in models served behind APIs and chat applications. This lets you pre-define the behaviour of your generative text agent, inject personality, and provide a more streamlined experience for an end-customer or a client application.

To see the full system prompt and format, you can call the last_instruction_prompt attribute on the pipeline.

Let us see the default prompt format as described in this Llama 2 prompting guide.

Running Llama-2-chat on non-Paperspace IPU environments

To run the demo using IPU hardware other than in Paperspace, you can get the code from this repository.

Is Llama-2 right for me?

Llama-2 is a very powerful model for building your own generative text and chat applications, it comes with a very competitive performance and a permissive license for research and commercialisation. The good performance of Llama-2 with relatively smaller memory footprint makes it a very viable and cost-effective model for wide adoption and deployment in production.

It is easy to access on IPUs on Paperspace and it can be expanded and deployed to handle a variety of applications.

You can contact Graphcore to discuss building and scaling applications using Lllama in the cloud, or anything else.

Llama 2: Run Meta’s latest Open Source Generative Large Language Model on IPUs was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.

OASST1 Fine-tuned Pythia-12B chatbot: open-source ChatGPT alternative

Graphcore — Wed, 31 May 2023 09:32:56 GMT

This powerful open-source chatbot has no usage restrictions, so is perfect for commercial applications. Try it now on Paperspace, powered by IPUs.

Author: Steve Barlow, Member of Engineering Team at Graphcore

One of the most exciting open-source chatbot alternatives to ChatGPT — OpenAssistant’s OASST1 fine-tuned Pythia-12B — is now available to run for free on Paperspace, powered by Graphcore IPUs. This truly open-source model can be used commercially without restrictions.

oasst-sft-4-pythia12b is a variant of EleutherAI’s Pythia model family, fine-tuned using the Open Assistant Conversations (OASST1) dataset, a crowdsourced “human-generated, human-annotated assistant-style conversation corpus”.

The OASST1 dataset consists of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees.

Running OASST1 Fine-tuned Pythia-12B inference on Paperspace

Open Assistant’s fine-tuned Pythia can easily be run on Graphcore IPUs using a Paperspace Gradient notebook. New users can try out Pythia on an IPU-POD4, with Paperspace’s six hour free trial. For a higher performance implementation, you can scale up to an IPU-POD16.

The notebook guides you through creating and configuring an inference pipelineand running the pipeline to build a turn-by-turn conversation.

Because the OpenAssistant model uses the same underlying Pythia-12B model as Dolly, we run it using the Dolly pipeline.

Let’s begin by loading the inference config. We use the same configuration file as Dolly and manually modify the vocab size which is the only difference between the model graphs. A configuration suitable for your instance will automatically be selected.

from utils.setup import dolly_config_setup

config_name = "dolly_pod4" if number_of_ipus == 4 else "dolly_pod16"
config, *_ = dolly_config_setup("config/inference.yml", "release", config_name)

# Update vocab size for oasst-sft-4-pythia-12b-epoch-3.5 - 50288 rather than 50280
config.model.embedding.vocab_size = 50288

config

Next, we want to create our inference pipeline. Here we define the maximum sequence length and maximum micro batch size. Before executing a model on IPUs it needs to be turned into an executable format by compiling it. This will happen when the pipeline is created. All input shapes must be known before compiling, so if the maximum sequence length or micro batch size is changed, the pipeline will need to be recompiled.

Selecting a longer sequence length or larger batch size will use more IPU memory. This means that increasing one may require you to decrease the other.

This cell will take approximately 18 minutes to complete, which includes downloading the model weights.

import api

# changing these parameters will trigger a recompile.
sequence_length = 512  # max 2048
micro_batch_size = 1

# The pipeline is set to load the OpenAssistant checkpoint rather than the default Dolly one

# We override the Dolly prompt templating by specifying a prompt_format. Setting the format to
# just echo the instruction means that the pipeline does no formatting and it is up to the
# application to provide correctly templated prompts

# We set the text string that OpenAssistant uses to mark that it has finished generation, which
# is different from the Dolly one

oasst_pipeline = api.DollyPipeline(
    config,
    sequence_length=sequence_length,
    micro_batch_size=micro_batch_size,
    hf_dolly_checkpoint="OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5",
    prompt_format="{instruction}",
    end_key="<|endoftext|>",
)

Call the oasst_pipeline object you have just created to generate text from a prompt.

To make a chatbot, we will take user input and feed it to the model. So that the notebook can be tested automatically, we create a function similar to the Python built-in function input() that collects input from the user and returns it, but which will return canned input from a test environment variable EXAMPLE_PROMPTS instead if that variable is set. The variable should be set to a JSON list of strings - for example:

export EXAMPLE_PROMPTS='["Are you related to Dolly?", "Ah OK. How many islands are there in Scotland?"]'

import os
import json

example_prompts = os.getenv("EXAMPLE_PROMPTS")
if example_prompts is not None:
    example_prompts = json.loads(example_prompts)

# Get input from user like input() but with possibility of automation via EXAMPLE_PROMPTS environment variable
def auto_input(prompt: str) -> str:
    if example_prompts is None:
        return input(prompt)
    auto_in = example_prompts.pop(0) if example_prompts != [] else ""
    print(prompt, "AUTO", auto_in)
    return auto_in

A chatbot conversation is built up from a number of turns of user input and the model writing a reply. As a conversation develops, the prompt should be extended turn by turn, so the model has access to the full context.

The model has been trained on a specific prompt template to represent the conversation as it is built up:


<|prompter|>user1<|endoftext|> <|assistant|>reply1<|endoftext|> <|prompter|>user2<|endoftext|> <|assistant|>...

There are a some optional parameters to the pipeline call you can use to control the generation behaviour:

temperature – Indicates whether you want more or less creative output. A value of 1.0 corresponds to the model's default behaviour. Smaller values than this accentuate the next token distribution and make the model more likely to pick a highly probable next token. A value of 0.0 means the model will always pick the most probable token. Temperatures greater than 1.0 flatten the next token distribution making more unusual next tokens more likely. Temperature must be zero or positive.
k – Indicates that only among the highest k probable tokens can be sampled. This is known as "top k" sampling. Set to 0 to disable top k sampling and sample from all possible tokens. The value for k must be between a minimum of 0 and a maximum of config.model.embedding.vocab_size which is 50,288. The default is 5.
output_length - Sets a maximum output length in tokens. Generation normally stops when the model generates its end_key text, but can be made to stop before that by specifying this option. A value of None disables the limit. The default is 'None'.

You can start with any user input. For instance “What other animals are similar to Alpacas?”

import logging

# Conduct a complete conversation - with ability to set pipeline optional parameters
def chat(temperature=None, top_k=None, output_length=None):

    options = {}
    if temperature is not None:
        options["temperature"] = temperature
    if top_k is not None:
        options["k"] = top_k
    if output_length is not None:
        options["output_length"] = output_length

    # Suppress INFO logging to make a better interactive chat display
    logging.disable(logging.INFO)

    print("To complete the chat, enter an empty prompt")

    prompt = ""
    while True:
        user_input = auto_input("Prompter:")
        if user_input == "":
            break

        prompt += f"<|prompter|>{user_input}<|endoftext|><|assistant|>"
        chat_output = oasst_pipeline(prompt, **options)[0]
        prompt += chat_output + "<|endoftext|>"

    # Restore logging to what it was before
    logging.disable(logging.NOTSET)


chat(temperature=0.0)

# Note the first time you run this cell and enter a prompt, there will be a delay of ~1 minute where
# nothing appears to happen. This is where the server is attaching to the IPUs

See image below for an example of the model in action, using a different prompt.

Remember to detach your pipeline when you are finished to free up resources:

oasst_pipeline.detach()

Running OASST1 Fine-tuned Pythia in non-Paperspace IPU environments

To run the demo using IPU hardware other than in Paperspace, you need to have the Poplar SDK enabled. Refer to the Getting Started guide for your system for details on how to enable the Poplar SDK. Also refer to the Jupyter Quick Start guide for how to set up Jupyter to be able to run this notebook on a remote IPU machine.

OASST1 Fine-tuned Pythia-12B chatbot: open-source ChatGPT alternative was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.

Running Flan-T5-XL inference in Float16 for IPU — how we did it

Graphcore — Tue, 30 May 2023 13:28:03 GMT

Running Flan-T5-XL inference in Float16 for IPU — how we did it

T5 has only worked with bfloat16, until now. Find out how Graphcore made it run on float16 hardware.

Author: Harry Mellor, AI Engineer at Graphcore

The T5 language model has proved hugely popular since it first appeared in Hugging Face Transformers. There have also been constant demands to make T5 runnable at float16 precision.

Until now, T5 has only worked with hardware that supports bfloat16, the format that the model was originally trained with. This has limited its use to select CPUs, TPUs beyond v2, and GPUs beyond A100.

The best alternative — using float32 — typically leads to exceeding hardware memory limits or simply taking too long to execute, compared to running in float16.

With the release of FLAN-T5, we were keen to offer these models running on our IPUs — which means using float16.

In this blog, we are delighted to present our FLAN-T5 for IPU solution. While this has been developed specifically for the T5 model, the methods are reusable and can help you in similar scenarios.

Porting T5 to float16 on IPU

Identifying dynamic parts of the computational graph

Before running the model we need to carry out a quick visual inspection of the model code to look for parts that won’t compile into a static graph. We found dynamic branching of the graph in the T5Block. Coincidentally, the branches that are created clamp the data if it has already overflowed in float16:

# clamp inf values to enable fp16 training 
if hidden_states.dtype == torch.float16 and torch.isinf(hidden_states).any(): 
    clamp_value = torch.finfo(hidden_states.dtype).max - 1000 
    hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)

We chose to remove the dynamic condition, torch.isinf(hidden_states).any(), from this branch* because:

We cannot statically compile this dynamic branching condition
While clamping the hidden states only treats the symptom of float16 issues, it is still needed for training and so cannot be removed entirely. See the “FeedForward’s down projection” section for details on how we treated the cause for inference.

*this change has also been made in the latest version of Transformers

Enabling Poplar’s floating-point exception detection

Our Poplar backend has floating-point exception detection built-in, which makes tracking down the source of numerical issues far more straightforward. The process consists of the following steps:

Enable floating-point exceptions in your application. In PopTorch, you can use opts.Precision.enableFloatingPointExceptions(True)
(For more information see the PopTorch User Guide)
Run your application with graph profiling enabled: POPLAR_ENGINE_OPTIONS"='{"autoReport.all":"true", "autoReport.outputExecutionProfile": "false", "autoReport.directory":"./report"}'
For more details see the section on capturing IPU reports in the Graph Analyser User Guide*.
If a floating-point exception is triggered a poptorch_error.log file will be generated. Open this file and scroll down to (or search for) Backtrace. Find the ID nearest the top of the backtrace, denoted by (Id: 1234), and search for it in the graph profile’s program tree. From here you should be able to examine the debug information of the offending operation and figure out where in the model it came from.

*note that we use "autoReport.outputExecutionProfile": "false" to avoid the overhead of profiling the execution. We can do this because we are only interested in the program tree.

Using this method, we solved the rest of the floating-point exceptions.

Resolving the floating-point exceptions

Attention Masking

The first two exceptions were found in the attention masking. In two places the attention mask was “inverted” and used additively. The mask value was set to -torch.finfo(torch.float16).min (i.e.-65504) and the pass value was set to 0. This was done so that when the masked attention values are passed to softmax they have minimum relevance in the resulting output. However, if what you were masking was negative and had an absolute value greater than the resolution of float16 at -65504, then you would end up with a negative infinity:

>>> torch.tensor([-65504], dtype=torch.float16) - 10 
tensor([-65504.], dtype=torch.float16) 
>>> torch.tensor([-65504], dtype=torch.float16) - 100 
tensor([-inf], dtype=torch.float16)

We solved these two exceptions by simply scaling the mask down by 25%, meaning that you could have attention values as low as -16376 without the mask causing an overflow.

GeLU approximated by tanh

The third exception was found in the explicit definition of the tanh GeLU approximation used by the FLAN-T5 model (the original T5 model used ReLU activations). The formula

0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))

cubes the input, which will cause an overflow if the absolute value of the input is larger than approximately 39. We fixed this by reverting to ReLU when the input was larger than 39, which is a safe approximation to make since ReLU==GeLU when the absolute value of the input is >5.

Pre-norm residual connections

The fourth exception was found in the residual additions in the encoder’s FF layers. We were seeing that, when the output of the FF network was added to its input, the operation was overflowing. We solved this by:

Casting the embeddings input to the first encoder block to float32
For the SelfAttention and FeedForward layers in every encoder block:
Cast to float16 after LayerNorm* so that the bulk of the compute still happens in float16
Cast to float32 after the dropout before adding to the float32 residual connection
Cast the output of final_layer_norm* after all the encoder blocks back to float16 ready for the decoder, which is all float16

*this actually happened automatically because of the way that LayerNorm was implemented for T5

The following diagrams are colour coded as follows to represent the precision of the data:

The T5 encoder consists of a chain of blocks, each block contains a SelfAttention layer and a FeedForward layer:

Each of these layers has the same fundamental structure, with the only difference being the Attention/Hidden layer:

After the casting changes mentioned in step 2 above, these layers look like:

This prevents overflow in the pre-norm residuals that get passed all the way through the encoder.

FeedForward’s down projection

The final floating-point exception was found in the down projection in the Hidden part of the encoder’s FeedForward layer. In the code this is the wo layer, which we shall refer to as DownProject for clarity. Currently, the FeedForward layer and its Hidden component look like this:

We were able to resolve the overflow in DownProject by scaling down its input and then scaling up its output once it was safely in float32 again.

The scaling factor was chosen by examining the standard deviation of the activations coming out of DownProject and identifying a suitable power of 2 that would tame these activations. We want to use a power of two because then only the exponents of the float16 activations need to be changed, avoiding lossy modification of the mantissa.

We found that the standard deviation was ~4400 and so we chose 8 as our scaling factor to reduce the standard deviation to ~550. After implementing this scaling, the FeedForward layer and its Hidden component look like this:

The solution to this problem in the latest version of Transformers keeps this layer in float32 at all times.

Validation

Since we’ve changed a few things in the model, you’re probably wondering if the model still performs as it is supposed to. We wondered this too, and so validated it on a subset* of the MMLU benchmark on CPU in float32 and on IPU in float16. The CPU and IPU achieved overall averages of 49.3% and 49.4% respectively, proving that we have not degraded the performance of the original model.

*Our current implementation of FLAN-T5-XL has a maximum input length of 896 tokens, so we used the subset of MMLU where the examples did not exceed this length.

Conclusion

With this, we now have FLAN-T5-XL implementation that can be used for inference on IPU in float16. Please head over to Paperspace to try it out for yourself!

You can also try other variations of T5 on IPUs:

Running Flan-T5-XL inference in Float16 for IPU — how we did it was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.

Flan-T5: sweet results with the smaller, more efficient LLM

Graphcore — Tue, 30 May 2023 13:27:48 GMT

Flan-T5 offers outstanding performance for a range of NLP applications, even compared to very large language models. Try now on Paperspace, powered by IPUs

Author: Harry Mellor, AI Engineer at Graphcore

In the world of AI language models, there’s no one-size-fits-all solution.

Commercial users are increasingly coming to the realisation that Ultra-Large Language Models, while broadly capable, are AI overkill for many applications.

The penny (or dollar) usually drops when they receive an outsize bill from the owners of their prefered proprietary model, or cloud compute provider. That’s assuming they can even secure GPU availability for the A100 and H100 system needed to run advanced models.

Instead, many are looking to more efficient, open-source alternatives to the likes of GPT-3/4.

Flan T5

In December 2022 Google published Scaling Instruction-Finetuned Language Models in which they perform extensive fine-tuning for a broad collection of tasks across a variety of models (PaLM, T5, U-PaLM).

Part of this publication was the release of Flan-T5 checkpoints, “which achieve strong few-shot performance” with relatively modest parameter counts “even compared to much larger models” like the largest members of the GPT family.

In this blog, we will show how you can use Flan-T5 running on a Paperspace Gradient Notebook, powered by Graphcore IPUs. Flan-T5-Large can be run on an IPU-POD4, using Paperspace’s six hour free trial, while Flan-T5-XL can be run on a paid IPU-POD16.

We will look at a range of common NLP workloads and consider the following:

How good is Flan-T5, really?
How do I run Flan-T5 on IPUs?
What can I use Flan-T5 for?
Why would I move up to Flan-T5-XL?

How good is Flan-T5, really?

Let’s start by looking at some performance numbers from the Google-authored paper:

Part of table 5 from Scaling Instruction-Finetuned Language Models

These results are astounding. Notice that:

Flan-T5 performs ~2x better than T5 in MMLU, BBH & MGSM
In TyDiQA we even see the emergence of new abilities
Flan-T5-Large is better than all previous variants of T5 (even XXL)

This establishes Flan-T5 as an entirely different beast to the T5 that you may know. Now let’s see how Flan-T5-Large and Flan-T5-XL compare to other models in the MMLU benchmark:

Part of the MMLU leaderboard from Papers With Code (CoT = Chain of Thought)

Noting that Flan-T5 had MMLU held out from training, this table shows that:

Flan-T5-Large and Flan-T5-XL (with 0.8B and 3B parameters respectively) perform similarly to other models with significantly more parameters, for example GPT-3 (175B parameters) and Galactica (120B parameters).
GPT-3 needs to be fine-tuned for the benchmark task in order to beat Flan-T5-XL.
Flan-T5 outperforms smaller versions of more recent LLMs like PaLM and LLaMA (while also being multiple times smaller).

How do I run Flan-T5 on IPUs?

Since the Flan-T5 checkpoints are available on Hugging Face, you can use Graphcore’s Hugging Face integration (🤗 Optimum Graphcore) to easily run Flan-T5 with a standard inference pipeline.

If you already have an existing Hugging Face-based application that you’d like to try on IPUs, then it is as simple as:

- from transformers import pipeline 
+ from optimum.graphcore import pipeline 
 
- text_generator = pipeline("text2text-generation", model="google/flan-t5-large") 
+ text_generator = pipeline("text2text-generation", model="google/flan-t5-large", ipu_config="Graphcore/t5-large-ipu") 

text_generator("Please solve the following equation: x^2 - 9 = 0") 
[{'generated_text': '3'}]

Now let’s define a text generator of our own to use in the rest of this notebook. First, make sure that your Python virtual environment has the latest version of 🤗 Optimum Graphcore installed:

%pip install "optimum-graphcore>=0.6.1, <0.7.0"

The location of the cache directories can be configured through environment variables or directly in the notebook:

import os 
executable_cache_dir=os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/") 
num_available_ipus=int(os.getenv("NUM_AVAILABLE_IPU", 4))

Next, let’s import pipeline from optimum.graphcore and create our Flan-T5 pipeline for the appropriate number of IPUs:

from optimum.graphcore import pipeline 
 
size = {4: "large", 16: "xl"} 
flan_t5 = pipeline( 
    "text2text-generation", 
    model=f"google/flan-t5-{size[num_available_ipus]}", 
    ipu_config=f"Graphcore/t5-{size[num_available_ipus]}-ipu", 
    max_input_length=896, 
    ipu_config=ipu_config, 
) 
flan_t5.model.ipu_config.executable_cache_dir = executable_cache_dir

Now, let’s ask it some random questions:

questions = [ 
    "Solve the following equation for x: x^2 - 9 = 0", 
    "At what temperature does nitrogen freeze?", 
    "In order to reduce symptoms of asthma such as tightness in the chest, wheezing, and difficulty breathing, what do you recommend?", 
    "Which country is home to the tallest mountain in the world?" 
] 
for out in flan_t5(questions): 
    print(out)

Graph compilation: 100%|██████████| 100/100 [05:20<00:00] 
Graph compilation: 100%|██████████| 100/100 [02:56<00:00] 
 
 
{'generated_text': '3'} 
{'generated_text': '-32 °C'} 
{'generated_text': 'ibuprofen'} 
{'generated_text': 'nepal'}

Note that some of these answers may be wrong, information retrieval from the model itself is not the purpose of Flan-T5. However, if you use Flan-T5-XL they are less likely to be wrong (come back to this notebook with an IPU-POD16 to see the difference!)

What can I use Flan-T5 for?

Flan-T5 has been fine-tuned on thousands of different tasks across hundreds of datasets. So no matter what your task might be, it’s worth seeing if Flan-T5 can meet your requirements. Here we will demonstrate a few of the common ones:

Sentiment Analysis

sentiment_analysis = ( 
    "Review: It gets too hot, the battery only can last 4 hours. Sentiment: Negative\n" 
    "Review: Nice looking phone. Sentiment: Positive\n" 
    "Review: Sometimes it freezes and you have to close all the open pages and then reopen where you were. Sentiment: Negative\n" 
    "Review: Wasn't that impressed, went back to my old phone. Sentiment:" 
) 

flan_t5(sentiment_analysis)[0]["generated_text"]

Negative

Advanced Named Entity Recognition

The following snippets are adapted from the Wikipedia pages corresponding to each mentioned company.

advanced_ner = """Microsoft Corporation is a company that makes computer software and video games. Bill Gates and Paul Allen founded the company in 1975 
[Company]: Microsoft, [Founded]: 1975, [Founders]: Bill Gates, Paul Allen 
 
Amazon.com, Inc., known as Amazon , is an American online business and cloud computing company. It was founded on July 5, 1994 by Jeff Bezos 
[Company]: Amazon, [Founded]: 1994, [Founders]: Jeff Bezos 
 
Apple Inc. is a multinational company that makes personal computers, mobile devices, and software. Apple was started in 1976 by Steve Jobs and Steve Wozniak.""" 

flan_t5(advanced_ner)[0]["generated_text"]

[Company]: Apple, [Founded]: 1976, [Founders]: Steve Jobs, Steve Wozniak

Question Answering

The following snippet came from the squad dataset.

context = 'Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24-10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi\'s Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.'
question = "Which NFL team represented the AFC at Super Bowl 50?"
# The correct answer is Denver Broncos
flan_t5(f"{context} {question}")[0]['generated_text']

Denver Broncos

Intent Classification

intent_classification = """[Text]: I really need to get a gym membership, I'm exhausted. 
[Intent]: get gym membership 
 
[Text]: What do I need to make a carbonara? 
[Intent]: cook carbonara 
 
[Text]: I need all these documents sorted and filed by Monday. 
[Intent]:""" 

flan_t5([intent_classification])[0]["generated_text"]

file documents

Summarization

The following snippets came from the xsum dataset.

summarization=""" 
Document: Firstsource Solutions said new staff will be based at its Cardiff Bay site which already employs about 800 people. 
The 300 new jobs include sales and customer service roles working in both inbound and outbound departments. 
The company's sales vice president Kathryn Chivers said: "Firstsource Solutions is delighted to be able to continue to bring new employment to Cardiff." 
Summary: Hundreds of new jobs have been announced for a Cardiff call centre. 
 
Document: The visitors raced into a three-goal first-half lead at Hampden. 
Weatherson opened the scoring with an unstoppable 15th-minute free-kick, and he made it 2-0 in the 27th minute. 
Matt Flynn made it 3-0 six minutes later with a fine finish. 
Queen's pulled a consolation goal back in stoppage time through John Carter. 
Summary: Peter Weatherson netted a brace as Annan recorded only their second win in eight matches. 
 
Document: Officers searched properties in the Waterfront Park and Colonsay View areas of the city on Wednesday. 
Detectives said three firearms, ammunition and a five-figure sum of money were recovered. 
A 26-year-old man who was arrested and charged appeared at Edinburgh Sheriff Court on Thursday. 
Summary: 
""" 
flan_t5(summarization)[0]["generated_text"]

A man has been arrested after a firearm was found in a property in Edinburgh.

Text Classification

text_classification_1 = """A return ticket is better value than a single. 
topic: travel cost 

You can start from the basic stitches, and go from there. 
topic: learning knitting 

The desk which I bought yesterday is very big. 
topic: furniture size 

George Washington was president of the United States from 1789 to 1797. 
topic:""" 

flan_t5(text_classification_1)[0]["generated_text"]

George Washington presidency

text_classification_2 = """FLAN-T5 was released in the paper Scaling Instruction-Finetuned Language Models - it is an enhanced version of T5 that has been finetuned in a mixture of tasks. 
keywords: released, enhanced, finetuned 

The IPU, or Intelligence Processing Unit, is a highly flexible, easy-to-use parallel processor designed from the ground up for AI workloads. 
keywords: processor, AI 

Paperspace is the platform for AI developers. providing the speed and scale needed to take AI models from concept to production. 
keywords:""" 

flan_t5(text_classification_2)[0]["generated_text"]

paperspace, AI, scale

Why would I move up to Flan-T5-XL?

As we saw earlier, when looking at the results from the paper, Flan-T5-XL is roughly 40% (on average) better than Flan-T5-Large across its validation tasks. Therefore when deciding if Flan-T5-XL is worth the cost for you, ask yourself the following questions:

Does my data need greater linguistic understanding for the task to be performed?
Is my task too complicated for a model as small as Flan-T5-Large and too easy for a model as large as GPT-3?
Does my task require longer output sequences that Flan-T5-XL is needed to generate?

To demonstrate, let us now look at an example of a task where the answer to all of the above questions is yes. Let’s say you have a customer service AI that you use to answer basic questions in order to reduce the workload of your customer service personnel. This needs:

Strong linguistic ability to both parse and generate medium-sized chunks of text.
An LLM that is able to learn well from context, but doesn’t have all of human history embedded in its parameters.
The ability to produce multiple-sentence responses, but not much longer than this.

Looking at the code below, we see some context about Graphcore provided in the input, as well as a primer for a conversational response from the model. As you can see from the example, Flan-T5-XL was able to understand the information provided in the context and provide useful and natural answers to the questions it was asked.

from IPython.display import clear_output 
 
class ChatBot: 
    def __init__(self, model, context) -> None: 
        self.model = model 
        self.initial_context = context 
        self.context = self.initial_context 
        self.user, self.persona = [x.split(":")[0] for x in context.split("\n")[-2:]] 
 
    def ask(self, question): 
        question += "." if question[-1] not in [".", "?", "!"] else "" 
        x = f"{self.context}\n{self.user}: {question}\n{self.persona}: " 
        # print(f"\n{x}\n") 
        y = self.model(x) 
        response = y[0]["generated_text"] 
        self.context = f"{x}{response}" 
        return response 
 
    def session(self): 
        print("Starting session", flush=True) 
        prompt = input() 
        while prompt != "": 
            if prompt == "reset": 
                clear_output() 
                print("Starting session", flush=True) 
                self.context = self.initial_context 
                prompt = input() 
            print(f"{self.user.title()}: {prompt}", flush=True) 
            answer = self.ask(prompt) 
            print(f"{self.persona.title()}: {answer}", flush=True) 
            prompt = input() 
        print("Ending session", flush=True)
context = f"""This is a conversation between a [customer] and a [virtual assistant]. 
The [virtual assistant] works at Graphcore. Here is some informaton about Graphcore: 
- Graphcore is located in Bristol. 
- Graphcore invented the intelligence processing unit (IPU). It is purpose built for AI applications. 
- The currently available IPU models are: Classic IPU, Bow IPU, C600. 
- IPUs are available on: Paperspace, Gcore Cloud and Graphcloud. 
 
[virtual assistant]: Hello, welcome to Graphcore, how can I help you today? 
[customer]: I'd like to ask some questions about your company. 
[virtual assistant]: Ok, I can help you with that.""" 
chatbot = ChatBot(flan_t5, context) 
chatbot.session()

Starting session 
[Customer]: What is an IPU? 
[Virtual Assistant]: The Intelligence Processing Unit (IPU) is a computer chip that is used to process artificial intelligence. 
[Customer]: Who makes it? 
[Virtual Assistant]: Graphcore is the manufacturer of the IPU. 
[Customer]: Can I use them? 
[Virtual Assistant]: Yes, I'm sure you can. 
[Customer]: Where? 
[Virtual Assistant]: The IPU is available on Paperspace, Gcore and Graphcloud. 
Ending session

flan_t5.model.detachFromDevice()

Conclusion

In summary, the answers to the questions we posed in the introduction are:

How good is Flan-T5, really?

A: Twice as good as T5 and on par with GPT-3 according to the MMLU benchmark.

How do I run Flan-T5 on IPUs?

A: Change one import and add one keyword argument to your pipeline instantiation.

What can I use Flan-T5 for?

A: Given its wide variety of fine-tuned tasks, almost anything.

Why would I move up to Flan-T5-XL?

A: For an approximately 40% performance increase over Flan-T5-Large, enabling more demanding tasks.

If you’d like to learn more about how we got T5 to work properly in Float16, see our technical blog on the subject.

You can also try other variations of T5 on IPUs:

If you’d like to continue exploring NLP on the IPU, take a look at our GPT-J Fine-Tuning blog and corresponding notebook.

Flan-T5: sweet results with the smaller, more efficient LLM was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.

Small language models, high performance: DeBERTa and the Future of NLU

Graphcore — Tue, 23 May 2023 17:07:55 GMT

DeBERTa is an efficient language model, optimised for Natural Language Understanding (NLU). Try for free on Paperspace Gradient Notebooks powered by IPUs.

Author: Luke Markham, Field Application Engineer at Graphcore

Large Language Models (LLMs) such as OpenAI’s GPTs, Google’s Bard, and Meta’s LLaMA have popularised the concept of Natural Language Processing in AI and are now finding a growing number of commercial uses.

While much attention has been focused on the generative capabilities of such models, many NLP applications require Natural Language Understanding (NLU), rather than generation.

NLU is used in chatbots and virtual assistants, enabling them to understand user queries and navigate conversation flow. It also plays a critical role in search engines, where it helps to retrieve relevant information based on user queries.

The healthcare industry is increasingly deploying NLU to extract information from patient records and assist doctors in making more accurate diagnoses.

Perhaps because certain high-profile LLMs have demonstrated broad capabilities, some users are turning to them for NLU applications, but this may prove to be computational overkill.

In this article, we’ll explore how smaller models such as Microsoft’s DeBERTa can achieve surprising performance on NLU tasks.

Beyond text-based Interfaces

The usefulness of NLP-powered systems has advanced hugely in recent years, however there are limitations to primarily text-based interfaces such as chatbots and virtual assistants:

Communicating by text alone can be challenging when dealing with complex information, such as medical diagnoses or financial advice, which may require visual aids such as diagrams, images, graphs, or maps.

Conveying emotion and tone through text is also difficult and can lead to misunderstandings or misinterpretations, particularly in customer service applications.

Finally, there is the issue of cognitive overload, which occurs when users are presented with too much text at once, leading to confusion and frustration.

To address these problems, NLP applications can incorporate other forms of media, such as images, graphs, and maps, into their UI/UX design.

NLU models play a critical role in this process by creating the structured data formats required for these designs.

For example, a weather app could use a chatbot interface that also incorporates graphs and maps to convey information more effectively, with NLU models extracting relevant information from user input and converting it into a structured format.

Cost-Efficiency of Smaller Models

Large, complex LLMs like GPT-3/4 and T5 aren’t always the most efficient for these sorts of tasks. While the simplicity of setting them up can be seductive, they are often computationally expensive which, of course, translates into being financially expensive.

Using smaller models like DeBERTa can lead to significant savings while maintaining high levels of accuracy. In many cases, these smaller models can even outperform larger models on specific tasks.

Because smaller models require less computational power to train and use, thet can be faster and more accessible. The smaller size of these models also allows them to be deployed on smaller devices, making them ideal for edge computing and other resource-constrained environments.

DeBERTa

One of the most popular Natural Language Understanding architectures is DeBERTa, a transformer-based model that achieves state-of-the-art results in a variety of NLU tasks, including question answering, natural language inference, and sentiment analysis.

DeBERTa is a more efficient variant of the popular language model BERT, specifically designed for Natural Language Understanding tasks. It addresses some of BERT’s limitations, such as the inability to model long-range dependencies and the lack of robustness to noisy text.

DeBERTa outperforms BERT across the board and exceeds the NLU performance of the majority of larger and more recent language models.

One reason for DeBERTa’s success is its novel architecture, which allows for better attention across the input sequence through techniques such as attention factors and relative position bias. This helps DeBERTa achieve high accuracy with fewer parameters.

It is believed that on many NLU tasks — such as SQuAD — bidirectional encoders, as adopted by DeBERTa, considerably outperform the left-to-right decoders used in the GPT models [1]

On benchmark datasets such as SuperGLUE, DeBERTa also outperforms larger, more complex models such as GPT-3 and T5, while using a fraction of the number of parameters.

To try out DeBERTa-Base inference for yourself for free by launching the Paperspace Gradient Notebook, powered by Graphcore IPUs.

[1] BART: Denoising Sequence-to-Sequence Pre-training for Natural
Language Generation, Translation, and Comprehension — Mike Lewis et. Al

Small language models, high performance: DeBERTa and the Future of NLU was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.

Packed BERT: How to accelerate fine-tuning and inference for NLP tasks with packing

Arsalan Uddin — Wed, 17 May 2023 16:02:52 GMT

Learn how to achieve 6x faster BERT fine-tuning and up to 12x faster batched inference with packing using a classification example

Image by author.

As large language models (LLMs) like the GPT (Generative pre-trained transformer) family capture the public imagination, the practicality of smaller Transformer models like BERT should not be underestimated. While language understanding tasks may work well with models like GPT, it is usually faster and cheaper to use smaller models like BERT, which give you the same or better performance, because they are trained specifically for focused language understanding tasks, while using much less compute and energy.

At Graphcore, we have recently worked on advancing a method called packing for fine-tuning and inference to optimise natural language processing (NLP) models at the application level. This blog post will explain the concept of packing for fine-tuning NLP tasks, and show you how to use it with easy-to-use utilities for Hugging Face on IPUs.

A Graphcore customer, Pienso, uses this technique in their production text analytics and insight platform, which you can read about more here.

Why do we want to speed up narrow tasks?

Packing is particularly beneficial for creating solutions that are online-capable, meaning they:

Scale well
Handle large incoming loads of data well
Consume less compute with no performance degradation
Reduce the overall time of fine-tuning/inference tasks

The technique is both throughput focused and aims to reduce as much computational waste as possible to improve efficiency. It works well with increasing scale — creating large effective increases in batch size with negligible overhead, making it useful for production as well as as research tasks.

As an example, implementing packing for multi-label classification with the GoEmotions dataset showed a 6x speedup for fine-tuning and a 9x speedup for inference workloads, bringing model processing to near real-time speeds!

Note that packing is not specific to BERT and is in theory applicable to any model which processes data on a token-by-token basis with none or minimal cross-token interaction. It can potentially also be applied to genomics and protein folding models, and other transformer models. It is worth noting, however, that its applicability is dependent on the structure of the dataset used, as described in the next section.

This implementation for fine-tuning and inference tasks was inspired by and builds on the work done to create Packed BERT for pre-training.

What is packing?

Put simply, packing as a concept for LLM inputs means to concatenate multiple tokenized sequences together into a single input, which we can call a ‘pack’.

A surprising amount of datasets contain heavily skewed length distributions towards the shorter side, and transformer models receive fixed sized inputs. This is often handled by simply padding the part of the input that isn’t used by the sequence with unused values.

Packing sequences together works to eliminate padding, exploiting the unused space to mitigate computational waste, while maintaining the benefits of a model represented as a static graph with constant-sized inputs. It also means more sequences are being processed per batch, with multiple sequences (within a pack) being processed in parallel on a token level. This effectively increases batch size, with minimal overhead, and brings with it massive throughput benefits.

Dataset length distributions

The time-optimisation that packing provides relies on datasets with sequence length distributions which are skewed to the shorter side of the maximum sequence length. It works by fitting multiple sequences into one sample, essentially treating each sample as a batch within a batch.

For training, increasing effective batch size in this way means further hyperparameter tuning is needed. For inference, since hyperparameter consideration is not needed, even higher throughput can be theoretically achieved.

The code walkthrough in this article demonstrates a 9x speed-up for inference and 6x speedup for training.

For perspective, we undertook a brief analysis of the dataset characteristics of some of the most popular fine-tuning text classification datasets on Hugging Face Hub, based on number of downloads. The pool contains 10 of the most popular datasets, with most subsets included, barring a couple which were prohibitively large relative to others. In total, this makes up 39 datasets.

These datasets were tokenized to an arbitrary short sequence length of 256 and manually generated by extracting only the list of sequence lengths for each sample in the dataset. These sequence lengths were used to create a length distribution histogram from 0 to 256 with a bin size of 5:

Sequence length histograms for 10 of the most popular Hugging Face Hub datasets (39 subsets in total) for sequence classification. Image by author.

Observing this histogram, it is apparent that there is a strong skew towards shorter length sequences in the majority of the datasets, with a small number of components at longer lengths, constituted mostly by datasets which focus on long contexts.

Why does packing work with language models like BERT?

BERT is an encoder heavy natural language model commonly used for language analysis and prediction tasks. Text samples are passed to BERT after being tokenized (mapped to a word-specific integer value that corresponds to the model vocabulary). For BERT, each sequence of tokens is processed in parallel, on a token-by-token basis, with positional, syntactic and semantic information about the token encoded within the sample’s embeddings, which are learned over each sequence through the Transformer multi-headed self-attention mechanism.

This is incredibly useful, it means that in theory, individual token information can be analysed with respect to the sequence it is part of, without interference from other sequences passed within the input. This behaviour is already exhibited by transformers in datasets which contain more than one sentence being part of a single input sequence.

A key element of transformers is the attention mask, which allows the self attention to concentrate its context oriented token-specific learning to a particular part of the sequence.

This allows us to pack an input with multiple sequences to achieve the desired throughput benefit. In practice, there are a few tweaks to make to the model, particularly at the output stage, in order to be able to perform classification and prediction tasks with packed inputs. Take a look at the diagram below for a straightforward visualisation of how we can pack multiple sequences to mitigate computational waste from padding and speed up the model, as compared to BERT without packing:

By specifying a sequence specific attention mask inside a single input with multiple sequences, we can classify multiple sequences in one input! Image by author.

There are essentially 3 stages to packing, particularly for fine-tuning:

An algorithm that is as fast as possible at packing as many sequences as possible together, as close to optimally as possible.
Adjusting the model’s input to interpret a single input as multiple sequences rather than a single sequence and padding using the attention mask.
Adjusting the model’s output to unravel the packed input back into separate sequences for final cross-token calculations, such as the loss.

The packing algorithm

There are three algorithms outlined for the original packing implementation for pre-training, these are:

Non-negative least squares histogram packing (NNLSHP)
Shortest-pack-first histogram packing (SPFHP)
Longest-pack-first histogram packing (LPFHP)

These algorithms use the sequence length histogram to try to optimally create packs of sequences with varying lengths to minimise the total number of packs (inputs to the model).

In previous implementations for large dataset pre-training, the most optimal possible configuration of lengths was achieved using NNLSHP. This algorithm is explained neatly in the original article:

“The tricky part was the strategy matrix. Each column has a maximum sum of three and encodes which sequences get packed together to exactly match the desired total length; 512 in our case. The rows encode each of the potential combinations to reach a length the total length. The strategy vector x is what we were looking for, which describes how often we choose whichever of the 20k combinations. Interestingly, only around 600 combinations were selected at the end. To get an exact solution, the strategy counts in x would have to be positive integers, but we realised that an approximate rounded solution with just non-negative x was sufficient. For an approximate solution, a simple out-of-the-box solver could be used to get a result within 30 seconds.”

https://medium.com/media/a9f44d95d67771f2fe3e58fd197cdadf/href

The disadvantages of this approach are that it:

Significantly increases in complexity for a sequences-per-pack value greater than 3.
3 sequences at most per pack in up to 30 seconds is reasonable for pre-training, but for online-capable tasks, and smaller dataset fine-tuning and inference, this can be far too long — especially when entire training runs can be completed in a matter of seconds.

Skewed distributions in small fine-tuning datasets allow packing many more sequences per pack, which is unfeasible with NNLSHP, so lets take a look at a simpler and more adaptive algorithm: SPFHP.

https://medium.com/media/56d00e2f0c3aa972406a6d7ff8aad22a/href

SPFHP scales very well to increasing number of sequences per pack. It operates on a sorted histogram of lengths from longest to shortest and simply looks at each sequence, and checks whether it fits into any pack. If it fits into one pack, it will be placed in that pack — not considering any future sequences — and if it fits into multiple, it will be placed into the pack with the shortest sequence length.

It is more appropriate for small to medium sized datasets and its complexity is not increased by increasing the number of packs. It solves for any given dataset in almost constant time, taking under 0.02 seconds for up to 16 million samples. The complexity of SPFHP increases with sequence length rather than dataset size or sequences-per-pack, so it remains relatively constant for different dataset sizes!

LPFHP is a shortest-to-longest variant of SPFHP, splitting counts to get more ideal fits. In some cases, it can be useful to approach the task longest-pack-first, offering a slightly more optimal fit.

The packing utilities created for the Packed BERT Hugging Face notebooks allows one of SPFHP or LPFHP to be used, to mitigate the potential preprocessing bottleneck of using NNLSHP.

The original algorithm code for all packing algorithms is available in the blogs code.

Implementing packing for BERT fine-tuning with Hugging Face

Have a go at using packing to speed up BERT for multi-label sequence classification yourself: In this section, we’ll look at how to easily use packing with BERT using HuggingFace on the GoEmotions dataset.

You can follow this walkthrough with some visualisation and further explanation by running it in a Paperspace Gradient notebook — all of the code used in this article is available in the multi-label text classification notebook.

The in-depth walkthrough available on Github explains the internal functionality of the high-level functions that make packing so easy to use in Hugging Face. Full explanations of these and the intrinsic details that you might require if you wish to apply packing to your own model/task are provided in that walkthrough.

The GoEmotions dataset is ideal to try fine-tuning for packing as it has an extremely strong sequence length skew towards shorter sequences, and will provide large throughput increases, as the sequence length distribution for it shows:

Sequence length distribution for GoEmotions dataset. Image by author.

About the dataset: GoEmotions is a multi-label sentiment analysis dataset, containing approximately 58000 carefully curated comments labeled for 28 categories of emotion, including a ‘neutral’ category. We will train for all labels using one-hot encoding to represent multi-label outputs.

This dataset is easily downloadable during this walkthrough using the datasets library from Hugging Face, which lets you download the processed and ready-to-use dataset from the Hugging Face Hub, so there is no need to download and set up the dataset yourself.

Setting up your environment

First, set up your environment by installing all of the required pip packages for this walkthrough. If you are trying this walkthrough in Paperspace, ensure you are in the Hugging Face Transformers on IPU (Optimum Graphcore) environment, where you can access the packing utilities from the packed-bert folder. The notebooks for packing already have the utilities accessible, but if you are trying this in a separate personal environment, be sure to clone the repository with:

git clone git@github.com:huggingface/optimum-graphcore.git

and navigate to notebooks/packed-bert to be able to use the models, utils and pipeline functionalities.

pip install git+https://github.com/huggingface/optimum-graphcore.git
pip install scikit-learn;
pip install datasets
pip install evaluate
pip install tokenizers
pip install matplotlib
pip install scipy
pip install huggingface_hub;

We will first import the packages we need as well as the packing specific model for sequence classification, and the inference pipeline.

import os
import transformers
import optimum.graphcore
import torch
import numpy as np
import evaluate
from datasets import load_dataset
from transformers import AutoConfig
from huggingface_hub import notebook_login
from transformers import AutoTokenizer
from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments
from models.modeling_bert_packed import PipelinedPackedBertForSequenceClassification
from pipeline.packed_bert import PackedBertTextClassificationPipeline
from utils.packing.dataset_creator import PackedDatasetCreator

Next, we define some generic parameters for the walkthrough:

task = "go_emotions"
model_checkpoint = "bert-base-uncased"
ipu_config_name = "Graphcore/bert-base-uncased"
micro_batch_size = 2
gradient_accumulation_steps = 39
device_iterations = 32
max_seq_length = 256
num_labels=28
pod_type = os.getenv("GRAPHCORE_POD_TYPE", "pod4")
executable_cache_dir = os.getenv("POPLAR_EXECUTABLE_CACHE_DIR", "./exe_cache/") + "/packed_bert_slseqcls/" + "/packed_bert_slseqcls/"

The task is go_emotions, the checkpoint is the default bert-base-uncased which we will use as a base to fine-tune the model for go_emotions. This model will run on Graphcore IPUs, and to take advantage of the parallelism available on an IPU, we need to define some IPU-specific configurations, from the perspective of fine-tuning, these effectively create a larger batch size for the model.

max_seq_length is the maximum length all input sequences are fixed to, and all sequences below this length will be padded to this length. Given the small size of the sequences in the GoEmotions dataset, we can reduce the model maximum input size to max_seq_length = 256.

IPU Parallelism: We are using both data parallelism and pipeline parallelism (see this tutorial for more). Therefore the global batch size, which is the actual number of samples used for the weight update, is determined using four factors:

global_batch_size = micro_batch_size * gradient_accumulation_steps * device_iterations * replication_factor

We define the gradient accumulation steps, device iterations and micro-batch size. The replication factor (which replicates the model over multiple sets of IPUs) is set to 1 by default, and this walkthrough will use 4 IPUs for a single replica by default. The ipu_config is a special configuration available within all checkpoints in the Graphcore space on the Hugging Face Hub. It contains the defaults for these IPU-specific parameters for the base checkpoint, and must be passed to an IPU-specific model to instantiate it.

Getting and preparing the dataset

To retrieve the dataset, we can simply use load_dataset from the Datasets library, which will retrieve it from the Hugging Face Hub and load it into your script.

We also want to initialise the metric for validating the dataset, since this is a multi-label task with one-hot encoded outputs, we will use the ROC-AUC metric, a common performance measure for multi-class, multi-label classification problems. We can load this from Hugging Face’s Evaluate library.

For preprocessing the model and turning our strings of sentences into integer tokens that correspond to the vocabulary interpreted by BERT, we also need to initialise a model tokenizer. This will convert individual words/sub-words into tokens.

This is easily done using the AutoTokenizer from the Transformers library. The default pre-trained BERT checkpoint contains a pre-configured tokenizer, so we can load the tokenizer with the pre-trained embeddings and vocabulary directly for our model using from_pretrained.

dataset = load_dataset(task)
metric = evaluate.load("roc_auc", "multilabel")
 
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

The map function available in the Datasets library can be used to tokenize the dataset.

First, we create a simple function which applies the tokenizer to one text sample, for which we:

Define the maximum sequence length
Set truncation to True: This will truncate any sequences which are larger than the maximum sequence length.
Note that for packing, we do not set padding to be true at this stage. We will pad the dataset when we pack it, but packing requires its inputs to be unpadded initially.

Second, we need to encode our labels in the one-hot format. As the dataset allows multiple labels to be true for one input, to keep a constant label dimension, we need to encode the label column in the dataset as a binary array rather than dynamic-length lists of integers representing the label classes. Since there are 28 classes, an example would look like this:

Original labels: [3, 27] 
One hot encoded labels: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

Here, 3 and 27 are valid labels for a given input. In the one-hot encoded format, index 3 and index 27 are set to 1 to define this, allowing the valid labels to be of any size.

We wrap the call to tokenize one sample in the first function, and wrap the call to convert one set of labels to a one-hot encoded format in the second function. Then we use the map function to iteratively apply preprocessing to every sample, using internal batching to speed it up, set with the batched argument.

#Tokenising function for one sample
def preprocess_function(examples):
 return tokenizer(
    examples['sentence'], truncation=True, max_length=max_seq_length)

#Label ID to one-hot-encoded conversion for one sample
def id_to_N_hot(example):
    indexes = example['labels']
    label = np.zeros((num_labels,), dtype=np.float16)
    for idx in indexes:
        label[idx] = 1
    example['labels'] = label
    return example

# Call to map the data
encoded_dataset = dataset.map(id_to_N_hot)
encoded_dataset = encoded_dataset.map(preprocess_function, batched=True)

Packing the dataset

Now that we have a preprocessed and tokenized dataset, we can pack the dataset. To summarise what we need to do for packing, there are four steps to ensuring we can use a packed dataset for a model:

Create a histogram of the sequence lengths of the dataset.
Generate a set of ‘strategies’ for the dataset using one of the state-of-the-art packing algorithms, which map out the order and indices of the sequences that need to be packed together.
Use this strategy to create the actual dataset, concatenating the tokenized features together for each column in the dataset, including the labels.
Finally, pass these new columns into a custom PyTorch dataset, ready to be passed to the dataloader!

What is a strategy? The ‘mapping of the order of sequences’ uses the histogram of sequence lengths to group together length values that best fit into a give maximum length. The set of strategies for a dataset contains multiple lists of lengths, covering all of the sequences in the dataset, for instance:

The first strategy in the set of strategies may be [120,40,40,60] indicating one sequence of length 120, two sequences of length 40 and one sequence of length 60 should be placed in the given order to form one pack. There may be multiple sequences with these lengths, so when forming one pack, we can mark a sequence as being ‘used’ to ensure sequence retrieval is not repeated.

The above steps have been simplified through the easy-to-use utils.packing available in the Graphcore Hugging Face repo. We can simply generate the packed dataset after the usual tokenization and preprocessing by passing all necessary packing configuration to the PackedDatasetCreator class, and generate the ready-to-use dataset with .create().

This completes all three of the above steps, and creates the PyTorch dataset.

First, we define some essential packing parameters:

max_seq_per_pack = 6
num_labels = 28
problem_type = 'multi_label_classification'

The max_seq_per_pack is the maximum number of sequences that can be concatenated into one input, called a ‘pack’. This effects the overall batch size of the data at the classification stage, and as a result, too large of a value for fine-tuning may result in needing more extensive hyperparameter tuning to get the best results possible.

We have decided to keep the default value at 6 for the current examples, meaning a theoretical speed-up of up to 6 times the speed of an unpacked dataset. For inference, this can be set higher, to get even more throughput for large batched inference workloads.

The PackedDatasetCreator provides quite a few options to modify the process according to the dataset:

Adjustable maximum sequence length.
Adjustable number of sequences per pack max_seq_per_pack .
The packing algorithm to use (one of LPFHP or SPFHP).
A custom label key to allow the class to access the labels in the dataset, as these are not tokenized.
Currently, the creator class supports single and multi-label classification, and question answering. The type of task is defined using the problem_type, one of single_label_classification, multi_label_classification or question answering.
For different tasks, you can set training=True, validation=True or inference=True.

The PackedDatasetCreator class also has some other features specifically for inference, such as pad_to_global_batch_size, a feature useful for performing batched inference on a large samples when we do not want to lose any of the samples when creating data iterators, it applies ‘vertical’ padding to the dataset, adding filler rows to bring the dataset up to a value divisible by the global batch size, and allows for the largest possible batch sizes to be used without any loss of data.

Once the packer has been created (has generated the histogram and packing strategy — the 1st and 2nd steps), we can simply call the .create() which will return fully packed versions of the initial tokenized un-packed datasets, this will perform the 3rd and 4th step of packing the dataset.

train_data_packer = PackedDatasetCreator(
 tokenized_dataset = encoded_dataset['train'],
 max_sequence_length = max_seq_length,
 max_sequences_per_pack = max_seq_per_pack,
 training = True,
 num_labels = num_labels,
 problem_type = problem_type,
 algorithm = 'SPFHP',
 custom_label_key = 'label'
)

val_data_packer = PackedDatasetCreator(
 tokenized_dataset = encoded_dataset['validation'],
 max_sequence_length = max_seq_length,
 max_sequences_per_pack = max_seq_per_pack,
 validation = True,
 num_labels = num_labels,
 problem_type = problem_type,
 algorithm = 'SPFHP',
 custom_label_key = 'label'
)

packed_train_dataset = train_data_packer.create()
packed_val_dataset = val_data_packer.create()

We can then observe the output of the dataset creation, which will show us what packing has actually accomplished here:

Packing efficiency (fraction of real tokens): 42.5296
 Speed-up theoretical limit: 13.3547
 Achieved speed-up over un-packed dataset: 5.67971
 Runtime: Packed 43410 sequences in 0.001 seconds
 Average packing factor: 5.6797069213659555
Packing efficiency (fraction of real tokens): 43.7226
 Speed-up theoretical limit: 13.3873
 Achieved speed-up over un-packed dataset: 5.85329
 Runtime: Packed 5426 sequences in 0.001 seconds
 Average packing factor: 5.853290183387271

Packed dataset creation time: 1.9252s
Packed dataset creation time: 0.1407s

The output shows a theoretically achieved 5.68 times speed up over the unpacked dataset. The algorithm is fast, completing the strategy for all 43410 training sequences in 0.001 seconds. The full process of creating the dataset takes a few seconds, effectively negligible overhead mitigated by the training speed-up.

Observing the packed dataset

Lets have a look at what the columns in the packed dataset look like, to dig a little deeper into what the PackedDatasetCreator has done with the original dataset.

Within the function, it expects the default column names generated by the tokenizer, specifically:

input_ids
attention_mask
labels (for training, validation)
token_type_ids (optional)

It also generates some extra columns:

position_ids is a column denoting the position of individual tokens within their respective sequence inside a pack.
example_ids (for inference) are integer indices corresponding to the position of a sample within the input data, and are used to re-order inference data outputs to be the same order as the inputs. This is required as the one-time shuffling of the data is a necessary side-effect of packing, in order to reduce the dataset size as optimally as it can.

The following diagram is a simplified visualisation of what actually happens when the dataset is created, using the strategy generated by the algorithm (also showing the position_ids for each input):

How the transformer model’s inputs change when used for packing. Image by author.

Inside the packed dataset creator: The middle box in the diagram indicates how the sequences to concatenate are chosen using the previously mentioned strategy. The dataset is used to form a stacked list (‘Sorted sequences’), where the first row in the list is the sorted lengths of a sequence, and the second row is the corresponding indices of the sequence in the dataset — from which the relevant indices of sequences to pack are obtained (these are also equivalent to the stored example_ids).

This is used to retrieve one sequence with a length corresponding to the strategy, and then the specific index of that sequence in the dataset is nullified, so the same sequence cannot be chosen again.

Changes to the inputs after packing: Packing the dataset involves concatenating inputs together, but the formation of some inputs may change to enable processing multiple sequences at once — as in the diagram above:

A custom attention_mask is generated: It contains a unique index for each sequence of the pack and 0 for the remaining padding tokens. This tells the model’s self-attention mechanism to retrieve context for a single token from only the sequence relevant to the token. As in the above example, something like [1,1,1,1,0,0,0,0,0,0]once packed with other sequences, turns into [1,1,1,1,2,2,3,0,1,2,3] to indicate the specific attention for each sequence in the pack.
The CLS tokens of each sequence must be moved to the end of the pack for classification tasks — the BERT Pooler can then easily extract a fixed set of global representations from the end of the sequence rather than retrieve them from intermittent and dynamically changing positions in the input. For token-based or prediction tasks, this is not necessary as this is only needed if the Pooler is required for the task.
The position_ids of a pack contain the concatenated position_ids of each sequence— i.e. the individual token position respective to a sequence.
labels and token_type_ids are also packed (concatenated) to correspond to the input_ids pack. Note that while other columns are padded with 0s, labels are padded with a value of -100, this is to take advantage of certain loss functions in PyTorch automatically ignoring indices with the value -100. It also reduces indexing confusion between tasks which may use 0 as a label (such as in the one-hot encoded case).

For this task, multi-label classification with BERT uses binary cross-entropy loss with logits, and will not ignore the indices. As a result, the provided model class is set up to manually ignore these indices using the -100 value.

For more in-depth information on the implementation of thePackedDatasetCreator and how you might go about applying it to different fine-tuning tasks, have a look at the Deeper dive into fine-tuning with packing.

Prepare the model for fine-tuning

First, lets define a function which can compute metrics during validation for us. This function is automatically called by Hugging Face Optimum’s Trainer class, passing the model predictions and corresponding labels. Here, we have to apply some simple postprocessing to ensure we ignore the unused padding samples created when packing. A few key things to note here:

All labels set to -100, as described previously, are ignored when computing accuracy using a mask.
Similarly, for returned logits, we must ensure these padded indices are ignored. We can use an element-wise comparison to create a boolean mask of valid and invalid labels. We can then use this mask to dynamically slice the logits and labels to only perform accuracy calculations using our metric on valid indices.
We apply the softmax function to retrieve a probabilistic distribution from our logits, this isolates high scoring classes for multi-label classification further, and this distribution is expected by the accuracy metric.

The predictions and labels are passed to the metric we initialised using the Evaluate library, which will return an accuracy percentage for the validation samples in the dataset:

from scipy.special import softmax

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    
    labels = labels.reshape(-1, labels.shape[-1])
    predictions = predictions.reshape(-1, predictions.shape[-1])
    
    # Remove the padding labels
    mask = (labels != -100)[:,0]
    
    labels = labels[mask,:]
    predictions = predictions[mask,:]
    pred_scores = softmax(predictions.astype("float32"), axis=1)    

    auc = metric.compute(
        prediction_scores=pred_scores, references=labels, multi_class="ovr"
    )["roc_auc"]

    return {"roc_auc": auc}

Model modifications for packing

Next, we can move on to instantiating our model. Using the packing utilities, we can use PipelinedPackedBertForSequenceClassification, a modified version of Transformers BertForSequenceClassification which inherits IPU pipeline parallelism and implements some small modifications at the output and input of the model to make it compatible with our packed dataset. The internal forward and backward pass of BertForSequenceClassification is not changed.

Recall the second and third essential stages of packing for fine-tuning: modifying the model inputs to receive the packed sequences, and modifying the model outputs to unpack the sequences for the output calculations. We can summarise the key changes made to the model here:

Attention mask (input stage)

Recall the increasing integer attention mask we generated using the PackedDatasetCreator. The integer representation will not be interpreted as we need it by BERT, instead we create a binary extended attention mask by reshaping and transposing the values in the increasing integer attention mask and using the order represented by the integers.

This represents the relevant attention mask for each token in the input sequence, defining the relevant context for each. This is done inside the Packed BERT model head, just before passing the inputs into the Transformers BERT forward pass.

The reason we don’t pass such a mask directly is because this makes the dataset orders of magnitude larger, and makes it difficult for the dataloader to infer batch dimension from the inputs.

Converting the attention mask to a 3D attention mask at the input stage to the BERT model forward pass. Image by author.

2. BERT Pooler (Output stage):

For classification tasks only, a slight modification needs to be made to the BERT Pooler — instead of extracting output embeddings from the first token position, extract output embeddings from the last N token positions, where N = max_seq_per_pack .

3. Unpacking (Output stage):

For non-classification tasks, if hidden states are used to directly obtain token-specific logits, these must be ‘unpacked’ using the positional information available in the inputs. It is possible to stack the logits for as many sequences as are present, and multiply with a sparse mask corresponding to the sequence positions — this will allow the loss to treat the inputs as a larger batch size of individual sequences.

4. Masking (Output stage):

If using a loss function where you cannot use ignore_index (e.g. binary losses with logits) or similar, to ensure unused intermittent logit/label values don’t contaminate the loss, the unused indices can be found from label indices set to -100 to create a sparse mask. These can be used to mask the logits for these indices to 0, or to simply return all loss values, mask the loss similarly, and then return the average.

Examples and further explanations of these changes can be read in the deeper dive walkthrough, and the source code for the currently available tasks is present in the Graphcore Hugging Face (Optimum) repository, under notebooks/packed_bert/models/modeling_packed_bert.py

To prepare the model for training, we need to initialise our BERT config from the pre-trained checkpoint with some of the model modifications we have made. For packing, we must specify max_sequences_per_pack, num_labels and problem_type as these are essential to our changes in the model. Then, we can simply call from_pretrained on the model to inherit all of the default configurations from the pre-trained checkpoint plus the few configurations we have added. To speed things up, and at minimal detriment to the performance, we can train the model in half precision (FP16).

config = AutoConfig.from_pretrained(model_checkpoint)
config.max_sequences_per_pack = max_seq_per_pack
config.num_labels = num_labels
config.problem_type = problem_type
model = PipelinedPackedBertForSequenceClassification.from_pretrained(
 model_checkpoint, config=config).train().half()

We have inherited most of the config from the pre-trained checkpoint, adding some new configuration options specific to this use case:

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    ...
    "27": "LABEL_27"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_10": 10,
    ...
    "LABEL_9": 9
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "max_sequences_per_pack": 6,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "multi_label_classification",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

At the beginning of the walkthrough, we set some default IPU configurations, as well as an executable cache directory (this is useful as it lets you skip compilation after the first time the model is compiled). These can now be passed to the Optimum Graphcore IPUConfig to simplify passing these to the model.

ipu_config = IPUConfig.from_pretrained(
    ipu_config_name,
    executable_cache_dir = executable_cache_dir,
    gradient_accumulation_steps=gradient_accumulation_steps,
    device_iterations = device_iterations,
    replication_factor=1,
    inference_device_iterations = 64,
    inference_replication_factor = 1
)

To train the model, we create a trainer using the IPUTrainer class which handles model compilation on IPUs, training and evaluation. The class works just like the Hugging Face Trainer class, but with the additional IPUConfig passed in.

We can also specify some additional training arguments using the IPUTrainingArguments class, which will be passed to the trainer. This is really useful for modifying training hyperparameters easily, including parameters like number of epochs to train, the batch size per device, the learning rate, warm-up ratio and some Dataloader options such as drop_last as well.

Lets now instantiate our training with the defined IPUTrainingArguments .

from transformers import default_data_collator

args = IPUTrainingArguments(
    "./"+f"{model_checkpoint}-{task}",
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=4,
    num_train_epochs=5,
    learning_rate=2e-4,
    adam_epsilon=1e-6,
    loss_scaling=16.0,
    warmup_ratio=0.1,
    weight_decay=0,
    lr_scheduler_type = "cosine",
    metric_for_best_model=metric_name,
    dataloader_drop_last=True,
    logging_steps=1,
    pod_type=pod_type,
    gradient_accumulation_steps=gradient_accumulation_steps
)

trainer = IPUTrainer(
    model,
    ipu_config,
    args,
    train_dataset=packed_train_dataset,
    eval_dataset=packed_val_dataset,
    data_collator=default_data_collator,
    compute_metrics=compute_metrics
)

Thats all the setup done, all thats left is to train the model!

A note on hyperparameter tuning

When packing and increasing the number of sequences, you might find that your model isn’t quite converging at the same rate as without packing, this is because increasing the maximum number of sequences has a similar effect on learning as significantly increasing the batch size. More samples are computed in one pass, and so hyperparameters must be adjusted according to that increase in effective batch size.

For the examples provided here and in the Hugging Face notebooks, we did not undertake extensive hyperparameter tuning, but nevertheless achieved convergence simply by incrementally increasing only the initial learning rate by approximately the same rate the effective batch size was increased.

Other parameters are generally kept the same. For a specific use case where you may want to change the packing parameters (akin to modifying the batch size), you may find that you want to undertake some hyperparameter tuning for a specific case to get the best results.

Train the model

With the Hugging Face Optimum library, you can do this in one line, requiring no complicated training loops or implementations of backpropagation. We can simply initiate the training process by calling .train() on our instantiated trainer. This will start iterating through the data and training with the defined hyperparameters and for the epochs defined with the given batch configurations:

trainer.train()

Then we can save the model locally:

trainer.save_model("./"+f"{model_checkpoint}-{task}")

or push to the Hugging Face hub:

trainer.push_to_hub()

Observing the training output
The default IPUTrainer doesn’t take into account the number of samples inside of an input, because it expects each input to have one sample, but we can calculate the actual throughput by having a look at the output of the IPUTrainer training:

Packed - BERT

***** Running training *****
  Num examples = 7643
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2496
  Gradient Accumulation steps = 39
  Total optimization steps = 15
{'loss': 0.7289, 'learning_rate': 0.0001, 'epoch': 0.33}                                      
{'loss': 0.192, 'learning_rate': 0.0002, 'epoch': 0.67}                                       
{'loss': 0.143, 'learning_rate': 0.0001970941817426052, 'epoch': 1.0}                         
{'loss': 0.169, 'learning_rate': 0.000188545602565321, 'epoch': 1.33}                         
{'loss': 0.1223, 'learning_rate': 0.00017485107481711012, 'epoch': 1.67}                      
{'loss': 0.1726, 'learning_rate': 0.00015680647467311557, 'epoch': 2.0}                       
{'loss': 0.1644, 'learning_rate': 0.00013546048870425356, 'epoch': 2.33}                      
{'loss': 0.1231, 'learning_rate': 0.0001120536680255323, 'epoch': 2.67}                       
{'loss': 0.1311, 'learning_rate': 8.79463319744677e-05, 'epoch': 3.0}                         
{'loss': 0.1501, 'learning_rate': 6.453951129574644e-05, 'epoch': 3.33}                       
{'loss': 0.2156, 'learning_rate': 4.3193525326884435e-05, 'epoch': 3.67}                      
{'loss': 0.1241, 'learning_rate': 2.514892518288988e-05, 'epoch': 4.0}                        
{'loss': 0.142, 'learning_rate': 1.1454397434679021e-05, 'epoch': 4.33}                       
{'loss': 0.1344, 'learning_rate': 2.905818257394799e-06, 'epoch': 4.67}                       
{'loss': 0.1233, 'learning_rate': 0.0, 'epoch': 5.0}                                          
100%|█████████████████████████████████████████████████████████| 15/15 [00:54<00:00,  3.62s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)
{
 'train_runtime': 54.571, 
 'train_samples_per_second': 686.079, 
 'train_steps_per_second': 0.275, 
 'train_loss': 0.1890590712428093, 
 'epoch': 5.0
}

While the train_samples_per_second is around 686 samples/s, we can see that the total number of examples is 7643. This is due to packing, the actual number of samples in the GoEmotions training set is 43410. So the actual training throughput can be calculated by 686*5.68, or approximately 3896 samples/s!

To quantitatively demonstrate the advantage, you can run training under the same conditions, with equivalent batch size, for the same dataset but without packing, resulting in the training output below:

Unpacked - BERT

***** Running training *****
  Num examples = 43410
  Num Epochs = 5
  Instantaneous batch size per device = 2
  Total train batch size (w. parallel, distributed & accumulation) = 2496
  Gradient Accumulation steps = 39
  Total optimization steps = 85

100% 
85/85 [06:12<00:00,  4.38s/it]
{'loss': 0.7563, 'learning_rate': 2.2222222222222223e-05, 'epoch': 0.06}                                       
{'loss': 0.4773, 'learning_rate': 4.4444444444444447e-05, 'epoch': 0.12}                                       
{'loss': 0.1841, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.18}                                        
{'loss': 0.1415, 'learning_rate': 8.888888888888889e-05, 'epoch': 0.24}                                        
{'loss': 0.119, 'learning_rate': 0.00011111111111111112, 'epoch': 0.29}                                        
{'loss': 0.2073, 'learning_rate': 0.00013333333333333334, 'epoch': 0.35}                                       
{'loss': 0.1138, 'learning_rate': 0.00015555555555555556, 'epoch': 0.41}                                       
{'loss': 0.178, 'learning_rate': 0.00017777777777777779, 'epoch': 0.47}                                        
{'loss': 0.2422, 'learning_rate': 0.0002, 'epoch': 0.53}                                                       
{'loss': 0.0984, 'learning_rate': 0.0001999145758387301, 'epoch': 0.59}                                        
{'loss': 0.142, 'learning_rate': 0.000199658449300667, 'epoch': 0.65}                                          
{'loss': 0.145, 'learning_rate': 0.0001992320579737045, 'epoch': 0.71}                                         
{'loss': 0.0774, 'learning_rate': 0.00019863613034027224, 'epoch': 0.76}                                       
{'loss': 0.1322, 'learning_rate': 0.00019787168453273544, 'epoch': 0.82}                                       
{'loss': 0.2188, 'learning_rate': 0.00019694002659393305, 'epoch': 0.88}                                       
{'loss': 0.1536, 'learning_rate': 0.0001958427482458253, 'epoch': 0.94}                                        
{'loss': 0.1007, 'learning_rate': 0.00019458172417006347, 'epoch': 1.0}                                        
{'loss': 0.0267, 'learning_rate': 0.0001931591088051279, 'epoch': 1.06}                                        
{'loss': 0.0623, 'learning_rate': 0.00019157733266550575, 'epoch': 1.12}                                       
{'loss': 0.1069, 'learning_rate': 0.0001898390981891979, 'epoch': 1.18}                                        
{'loss': 0.0558, 'learning_rate': 0.0001879473751206489, 'epoch': 1.24}                                        
{'loss': 0.0684, 'learning_rate': 0.00018590539543698854, 'epoch': 1.29}                                       
{'loss': 0.0535, 'learning_rate': 0.00018371664782625287, 'epoch': 1.35}                                       
{'loss': 0.2245, 'learning_rate': 0.0001813848717270195, 'epoch': 1.41}                                        
{'loss': 0.0355, 'learning_rate': 0.00017891405093963938, 'epoch': 1.47}                                       
{'loss': 0.0364, 'learning_rate': 0.00017630840681998066, 'epoch': 1.53}                                       
{'loss': 0.1504, 'learning_rate': 0.00017357239106731317, 'epoch': 1.59}                                       
{'loss': 0.0346, 'learning_rate': 0.00017071067811865476, 'epoch': 1.65}                                       
{'loss': 0.0334, 'learning_rate': 0.00016772815716257412, 'epoch': 1.71}                                       
{'loss': 0.0862, 'learning_rate': 0.00016462992378609407, 'epoch': 1.76}                                       
{'loss': 0.0844, 'learning_rate': 0.0001614212712689668, 'epoch': 1.82}                                        
{'loss': 0.1204, 'learning_rate': 0.00015810768154019385, 'epoch': 1.88}                                       
{'loss': 0.1338, 'learning_rate': 0.00015469481581224272, 'epoch': 1.94}                                       
{'loss': 0.1765, 'learning_rate': 0.00015118850490896012, 'epoch': 2.0}                                        
{'loss': 0.2133, 'learning_rate': 0.00014759473930370736, 'epoch': 2.06}                                       
{'loss': 0.0235, 'learning_rate': 0.00014391965888473703, 'epoch': 2.12}                                       
{'loss': 0.1029, 'learning_rate': 0.00014016954246529696, 'epoch': 2.18}                                       
{'loss': 0.0134, 'learning_rate': 0.00013635079705638298, 'epoch': 2.24}                                       
{'loss': 0.0334, 'learning_rate': 0.00013246994692046836, 'epoch': 2.29}                                       
{'loss': 0.0686, 'learning_rate': 0.00012853362242491053, 'epoch': 2.35}                                       
{'loss': 0.1879, 'learning_rate': 0.00012454854871407994, 'epoch': 2.41}                                       
{'loss': 0.0515, 'learning_rate': 0.00012052153421956342, 'epoch': 2.47}                                       
{'loss': 0.0536, 'learning_rate': 0.00011645945902807341, 'epoch': 2.53}                                       
{'loss': 0.0595, 'learning_rate': 0.00011236926312693479, 'epoch': 2.59}                                       
{'loss': 0.0714, 'learning_rate': 0.00010825793454723325, 'epoch': 2.65}                                       
{'loss': 0.1455, 'learning_rate': 0.00010413249742488131, 'epoch': 2.71}                                       
{'loss': 0.0655, 'learning_rate': 0.0001, 'epoch': 2.76}                                                       
{'loss': 0.0498, 'learning_rate': 9.586750257511867e-05, 'epoch': 2.82}                                        
{'loss': 0.0635, 'learning_rate': 9.174206545276677e-05, 'epoch': 2.88}                                        
{'loss': 0.075, 'learning_rate': 8.763073687306524e-05, 'epoch': 2.94}                                         
{'loss': 0.079, 'learning_rate': 8.35405409719266e-05, 'epoch': 3.0}                                           
{'loss': 0.0366, 'learning_rate': 7.947846578043659e-05, 'epoch': 3.06}                                        
{'loss': 0.1427, 'learning_rate': 7.54514512859201e-05, 'epoch': 3.12}                                         
{'loss': 0.0612, 'learning_rate': 7.146637757508949e-05, 'epoch': 3.18}                                        
{'loss': 0.0401, 'learning_rate': 6.753005307953167e-05, 'epoch': 3.24}                                        
{'loss': 0.0065, 'learning_rate': 6.3649202943617e-05, 'epoch': 3.29}                                          
{'loss': 0.0302, 'learning_rate': 5.983045753470308e-05, 'epoch': 3.35}                                        
{'loss': 0.0399, 'learning_rate': 5.608034111526298e-05, 'epoch': 3.41}                                        
{'loss': 0.0241, 'learning_rate': 5.240526069629265e-05, 'epoch': 3.47}                                        
{'loss': 0.0732, 'learning_rate': 4.8811495091039926e-05, 'epoch': 3.53}                                       
{'loss': 0.0867, 'learning_rate': 4.530518418775733e-05, 'epoch': 3.59}                                        
{'loss': 0.0454, 'learning_rate': 4.189231845980618e-05, 'epoch': 3.65}                                        
{'loss': 0.0172, 'learning_rate': 3.857872873103322e-05, 'epoch': 3.71}                                        
{'loss': 0.0124, 'learning_rate': 3.53700762139059e-05, 'epoch': 3.76}                                         
{'loss': 0.0254, 'learning_rate': 3.227184283742591e-05, 'epoch': 3.82}                                        
{'loss': 0.0799, 'learning_rate': 2.9289321881345254e-05, 'epoch': 3.88}                                       
{'loss': 0.0466, 'learning_rate': 2.6427608932686843e-05, 'epoch': 3.94}                                       
{'loss': 0.0185, 'learning_rate': 2.3691593180019366e-05, 'epoch': 4.0}                                        
{'loss': 0.0042, 'learning_rate': 2.1085949060360654e-05, 'epoch': 4.06}                                       
{'loss': 0.012, 'learning_rate': 1.861512827298051e-05, 'epoch': 4.12}                                         
{'loss': 0.0279, 'learning_rate': 1.6283352173747145e-05, 'epoch': 4.18}                                       
{'loss': 0.045, 'learning_rate': 1.4094604563011472e-05, 'epoch': 4.24}                                        
{'loss': 0.0575, 'learning_rate': 1.2052624879351104e-05, 'epoch': 4.29}                                       
{'loss': 0.0196, 'learning_rate': 1.0160901810802115e-05, 'epoch': 4.35}                                       
{'loss': 0.0118, 'learning_rate': 8.422667334494249e-06, 'epoch': 4.41}                                        
{'loss': 0.0071, 'learning_rate': 6.840891194872112e-06, 'epoch': 4.47}                                        
{'loss': 0.0276, 'learning_rate': 5.418275829936537e-06, 'epoch': 4.53}                                        
{'loss': 0.0353, 'learning_rate': 4.1572517541747294e-06, 'epoch': 4.59}                                       
{'loss': 0.1348, 'learning_rate': 3.059973406066963e-06, 'epoch': 4.65}                                        
{'loss': 0.0679, 'learning_rate': 2.128315467264552e-06, 'epoch': 4.71}                                        
{'loss': 0.0589, 'learning_rate': 1.3638696597277679e-06, 'epoch': 4.76}                                       
{'loss': 0.0682, 'learning_rate': 7.679420262954984e-07, 'epoch': 4.82}                                        
{'loss': 0.0627, 'learning_rate': 3.415506993330153e-07, 'epoch': 4.88}                                        
{'loss': 0.088, 'learning_rate': 8.542416126989805e-08, 'epoch': 4.94}                                         
{'loss': 0.0182, 'learning_rate': 0.0, 'epoch': 5.0}


Training completed. Do not forget to share your model on huggingface.co/models =)

{
 'train_runtime': 372.393, 
 'train_samples_per_second': 569.721, 
 'train_steps_per_second': 0.228, 
 'train_loss': 0.09256008372587317, 
 'epoch': 5.0
}

Notice that the number of examples trained changes from 7643 to 43410 — because sequences are now being processed one-by-one. Observing the results:

Total time and throughput comparison between packed and unpacked BERT. Image by author.

The benefits of packing are evident — it offers a huge throughput and overall time benefit for fine-tuning.

Evaluate the model

We can then easily perform evaluation on the validation dataset in the same way:

trainer.evaluate()

and observe the outputs, with an evaluation accuracy of 83% showing we have successfully trained our model.

***** Running Evaluation *****
  Num examples = 927
  Batch size = 256

100% 4/4 [00:00<00:00, 36.42it/s]

{'roc_auc': 0.836179971362336}

High speed inference with Hugging Face

Packing can also be used for high speed batched inference, covering the workflow from fine-tuning a model to deploying it for live inference.

A fully abstracted inference pipeline with built-in optimisation can be used for this, with the source code and module available in the Hugging Face Graphcore repository. The custom PackedBertTextClassificationPipeline is ideal for optimisation of larger inference loads for high throughput.

The following example uses a limit of up to 12 sequences per pack (i.e., a theoretical speed up of 12x) to demonstrate higher potential throughput. As mentioned before, higher packing factors for inference do not require extra effort or performance cost to implement, as they do with training.

By default, the inference pipeline will retain the order of the data to ensure that it is maintainable when inferring on live loads of data, and outputs can be easily matched back to inputs.

As default, the required arguments are the:

Model checkpoint ( model )
Cache storage directory (executable_cache_dir)
Maximum sequence length ( max_seq_length )
Problem type (problem_type)

Some optional arguments include:

The sequences per pack ( max_seq_per_pack — defaults to 6)
Label categories (label_categories — defaults to indices)
Micro batch size (micro_batch_size — defaults to 1)
Pre-trained tokenizer (pretrained_tokenizer) name if it isn’t present in the saved model — this otherwise defaults to bert-base-uncased.

First, let’s outline the class names corresponding to the label categories:

#Define each of the class names in category order for inference labelling
class_names = [
    "admiration",
    "amusement",
    "anger",
    "annoyance",
    "approval",
    "caring",
    "confusion",
    "curiosity",
    "desire",
    "disappointment",
    "disapproval",
    "disgust",
    "embarrassment",
    "excitement",
    "fear",
    "gratitude",
    "grief",
    "joy",
    "love",
    "nervousness",
    "optimism",
    "pride",
    "realization",
    "relief",
    "remorse",
    "sadness",
    "surprise",
    "neutral",
]

Then, we can instantiate the pipeline with the needed arguments.

Note that passing an IPU config for the inference is not strictly necessary, the pipeline will inherit as default the configuration from your saved checkpoint. However, to demonstrate further the advantages of using parallelism, with 4 IPUs in a data-parallel fashion, we pass a boosted configuration to the pipeline in this example:

#Path to saved trained model checkpoint
model = "./"+f"{model_checkpoint}-{task}" 

inference_boosted_ipu_config = IPUConfig.from_pretrained(model, 
    inference_device_iterations=32,
    inference_replication_factor=4,
    ipus_per_replica=1,
    layers_per_ipu=[12]
)
#Instantiate the pipeline with all required options
pipeline = PackedBertTextClassificationPipeline(
    model = model,
    executable_cache_dir = executable_cache_dir,
    problem_type='multi_label_classification',
    max_seq_per_pack=12,
    max_seq_length=max_seq_length,
    ipu_config=inference_boosted_ipu_config,
    micro_batch_size=8,
    label_categories=class_names
)

The above lines are all that is needed to set up the pipeline. Next, as an example of a large amount of data, we pass the raw text column for the entire GoEmotions training dataset directly into the pipeline for it to perform inference on:

preds = pipeline.predict(dataset['train']['text'])

Packing efficiency (fraction of real tokens): 68.4612                                         
 Speed-up theoretical limit: 13.3547
 Achieved speed-up over un-packed dataset: 9.14280
 Runtime: Packed 43410 sequences in 0.001 seconds
 Average packing factor: 9.142796967144061
Packed dataset creation time: 1.6288s

We can print the inference output to observe the contents of the returned predictions:

print(f"Number of predictions: {len(preds['predictions'])}")
print(f"Preprocessing time: {preds['preprocessing_time']}s")
print(f"Postprocessing time: {preds['postprocessing_time']}s")
print(f"Throughput: {preds['throughput']} samples/s")

Number of predictions: 43410
Preprocessing time: 6.723727464675903s
Postprocessing time: 0.1985154151916504s
Throughput: 49017.46503352953 samples/s

The output for inference shows an inference throughput of 49017 samples per second, with IPU acceleration and a packing factor of 9.1. Compared to the unpacked version, this is approximately 9.1x faster for inference. Recall that for datasets with different dataset skews, varying improvements in throughput will be observed.

Lets look at a random output to see what the pipeline returns.

print(f"Input:{dataset['train']['text'][16]}")
print(f"Output:{preds['predictions'][16]}")

Input:Thank you friend
Output:
  {'admiration': 0.008711744, 
   'amusement': 0.0030106984, 
   'anger': 0.0032300074, 
   'annoyance': 0.0037541997, 
   'approval': 0.0056799226, 
   'caring': 0.0048773102, 
   'confusion': 0.0027520012, 
   'curiosity': 0.003843228, 
   'desire': 0.0020692206, 
   'disappointment': 0.0025953841, 
   'disapproval': 0.00305811, 
   'disgust': 0.0009967086,
   'embarrassment': 0.0015079721, 
   'excitement': 0.0029756227, 
   'fear': 0.0018840467, 
   'gratitude': 0.90438044, 
   'grief': 0.001129419, 
   'joy': 0.0033982676, 
   'love': 0.0040671937, 
   'nervousness': 0.0016305067, 
   'optimism': 0.005975805, 
   'pride': 0.0019212064, 
   'realization': 0.0032426496, 
   'relief': 0.0016178179, 
   'remorse': 0.0053986902, 
   'sadness': 0.002555146, 
   'surprise': 0.003431616, 
   'neutral': 0.010305118
  }

From the above, the outputs list the probabilities for all of the classes. For this input, the most probable class is, expectedly, gratitude with a score of 0.903.

In summary, using packing for fine-tuning and inference provides evident advantages. The optimisation does not use extra hardware or memory to increase application throughput. Instead, it successfully ‘recycles’ computational waste created by the excess padding of datasets, making it especially time-efficient while fully maintaining model performance.

Try it yourself for free

Try our PackedBERT notebooks for free in the cloud with Paperspace:

For a more in-depth look at implementing packing yourself for different datasets and tasks, try our deep-dive fine-tuning notebook.

Other useful resources

Delve into the original development of packing for BERT pre-training with the work that made this implementation possible through the:

Packed BERT: How to accelerate fine-tuning and inference for NLP tasks with packing was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dolly 2.0 — open source language model with ChatGPT-like interactivity

Graphcore — Wed, 26 Apr 2023 11:31:52 GMT

Dolly 2.0 — open source language model with ChatGPT-like interactivity

Dolly 2.0 is an open source large language model suitable for commercial use: learn more about how to use Dolly for AI applications on IPUs.

Author: Alex McKinney

Dolly 2.0, an open-source large language model (LLM) that delivers ChatGPT-like instruction-following interactivity, is now available to run as a Paperspace Gradient Notebook, powered by Graphcore IPUs.

The 12 billion parameter model was created by Databricks and is based on EleutherAI’s Pythia.

The trained weights, source code, and dataset for Dolly 2.0 have been released under an open-source and commercial-use license, making it the first truly open, instruction fine-tuned LLM. Prior models are subject to more stringent licensing, making them unusable for commercial applications.

Training LLMs for human-computer interaction

Attempting to elicit answers from LLMs that haven’t been appropriately fine-tuned requires prompt engineering to produce consistently useful responses — an experience that can be frustrating for users. This is because base LLMs are trained to simply predict the next token, which does not necessarily correlate with a good or even correct responses.

Fine-tuning on an instruction-following dataset makes the model more suited to human interaction: simply ask the model a question and get a response back. This makes it ideal for Q&A applications.

LLMs and commercial restrictions

Dolly 2.0’s predecessor Dolly 1.0 was trained using the Stanford Alpaca dataset. Alpaca, in turn, uses some outputs from OpenAI’s ChatGPT. As a result, Dolly 1.0 was bound by ChatGPT’s licence restrictions regarding commercial use — preventing Dolly users from building products and services around the model.

Dolly 1.0 is not alone in this respect. Similar limitations affect many recently released instruction-following LLMs, including Koala, GPT4All, and Vicuna.

Generating an original dataset

To address these problems, the Databricks team needed to generate a corpus of training data, written by humans, of a similar size to the 13,000 prompt-response pairs used by OpenAI to train InstructGPT, a sibling model to ChatGPT.

The company turned to its 5,000 employees, gamifying the process of creating training data by running contest to write the instruction and response pairs. Prizes were offered for the top 20 labelers, across seven specific dataset tasks. Using this approach, combined with a competitive leader board, Databricks managed to create a dataset with more than 15,000 instruction-response pairs.

The dataset tasks encompass the following (from Databricks’ blog):

Open Q&A: For instance, “Why do people like comedy movies?” or “What is the capital of France?” In some cases, there’s not a correct answer, and in others, it requires drawing on knowledge of the world at large.

Closed Q&A: These are questions that can be answered using only the information contained in a passage of reference text. For instance, given a paragraph from Wikipedia on the atom, one might ask, “What is the ratio between protons and neutrons in the nucleus?”

Extract information from Wikipedia: Here an annotator would copy a paragraph from Wikipedia and extract entities or other factual information such as weights or measurements from the passage.

Summarize information from Wikipedia: For this, annotators provided a passage from Wikipedia and were asked to distil it to a short summary.

Brainstorming: This task asked for open-ended ideation and an associated list of possible options. For instance, “What are some fun activities I can do with my friends this weekend?”.

Classification: For this task, annotators were asked to make judgments about class membership (e.g. are the items in a list animals, minerals or vegetables) or to judge the properties of a short passage of text, such as the sentiment of a movie review.

Creative writing: This task would include things like writing a poem or a love letter.

The resulting model, Dolly 2.0, is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License, which allows anyone to use it, modify it, and create a commercial application using it.

It is worth noting that Dolly 2.0 is still under active development, so expect to see further versions with better performance in the near future.

Getting Started: How to use Dolly 2.0 on IPUs

You can try out Dolly 2.0 for free, using a Paperspace Gradient Notebook, powered by Graphcore IPUs.

The notebook takes you through the process of downloading the model weights, creating an inference pipeline, and querying Dolly 2.0 with instructions and questions.

Dolly 2.0 fits in our Paperspace free tier environment, using a Graphcore POD4 system. We plan to expand this for faster inference on POD16 systems in due course. Follow the Paperspace Gradient link below to begin.

Dolly is fun to interact with and could easily be fine-tuned to have a particular personality as it answers user questions. As is, it could be used with an appropriate safety filter for real world information or just as a fun on-brand character.

If you are interested in fine tuning Dolly 2.0 on IPUs with your own datasets, please let us know using this form and we’ll help you get started.

Dolly 2.0 — open source language model with ChatGPT-like interactivity was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to use OpenAI’s Whisper for Speech Recognition

Graphcore — Tue, 25 Apr 2023 09:36:59 GMT

Tutorial: how to run Whisper for speech recognition on Graphcore IPUs

Author: Scott Griffiths, AIE Engineering Manager at Graphcore

Whisper is an exciting new language model that takes a novel approach to speech recognition. It produces high quality results, even from low quality audio, and is extremely adaptable to a diverse range of voices and languages, without the need for fine-tuning.

Whisper is open source, and with a range of model sizes available, can be an efficient solution for a many speech to text applications including translation, smart personal assistants, vehicle voice control systems, customer service operations, and more.

In this blog, we will explore what makes Whisper different to other speech recognition models and we will show you how get started using the Hugging Face implementation of Whisper Tiny using a pre-built Paperspace Gradient Notebook, running on Graphcore IPUs.

What is so clever about Whisper?

Whisper’s creators at OpenAI set out to solve several fundamental challenges that have faced Automatic Speech Recognition (ASR) up until now:

Talk isn’t cheap

Many ASR models rely on very high quality, labelled, audio/text data to perform supervised learning. Unfortunately, this ‘gold standard’ of training data is in short supply. Models trained this way are capable of producing good speech recognition results under ideal conditions. However, because of their limited exposure to different training examples, they tend not to generalise well, can struggle with low quality real-world audio, and typically need additional voice fine-tuning to prepare them for specific use-cases.

The obvious way to improve such models would be to train them on more data, but the shortage of high quality datasets led AI practitioners to look in the opposite direction, developing ASR models with unsupervised learning, using vast amounts of unlabelled audio.

Models created this way are able to achieve very high quality representation of speech — but require subsequent fine-tuning to prepare them for specific ASR tasks. As well as entailing extra work, the fine-tuning process used in speech recognition has been shown to throw up some overfitting issues that can limit model generalizability.

Whisper’s creators described this problem as “a crucial weakness which limits their usefulness and robustness” and set out to design an ASR model that worked “out of the box”.

How ‘weak’ training data makes us stronger

The Whisper solution starts with the same high quality, labelled audio datasets and augment them with much larger ‘weakly supervised datasets’ (such as video captions). This approach was partly influenced by research in computer vision that showed larger, weakly supervised datasets can actually improve the robustness and generalisation of models.

A number of techniques were used to detect and remove the lowest quality data, such as video transcriptions that had been generated by other ASR technologies due to the risk of transferring their limitations into Whisper.

Ultimately, 680,000 hours of labelled audio data was used to train Whisper, far more than previous, supervised models. Almost a fifth of the training data was non-English, spanning 96 languages. The dataset also included 125,000 hours of foreign language-to-English translations.

The multi-task transformer

Whisper takes a classic encoder-decoder transformer architecture and applies it to audio/text pairs, using encodings generated from the audio to enable next token prediction on the text component.

Crucially, Whisper includes special tokens in the decoder that direct it to perform different language tasks, such as [transcribe] or [translate].

This approach differs from many AST models which use a variety of subsystems for different aspects of the speech to text process, such as voice activity detection, identifying different speakers, and normalizing the text format. Such architectures require additional resource to coordinate the complex interplay of sub-systems.

A sequence-to-sequence Transformer model is trained on many different speech processing tasks, including multilingual speech recognition, speech translation, spoken language identification, and voice activity detection. (Source)

Performance

The performance of ASR models is typically measured by Word Error Rate (WER).

Whisper is competitive with state-of-the-art commercial and open-source ASR systems in long-form transcription. The distribution of word error rates from six ASR systems on seven long-form datasets are compared, where the input lengths range from a few minutes to a few hours. The boxes show the quartiles of per-example WERs, and the per-dataset aggregate WERs are annotated on each box. Our model outperforms the best open source model (NVIDIA STT) on all datasets, and in most cases, commercial ASR systems as well. (Source)

As illustrated in the plot above, Whisper’s WER for long-form transcription is comparable to proprietary Audio Speech Recognition (ASR) systems, trained on ‘gold-standard‘ datasets. However, because of its larger corpus of training data and the use of ‘weakly labelled‘ examples, Whisper proved much more robust when measured using other training benchmarks, without the need for fine-tuning.

Whisper outperforms a SOTA wav2vec model on various datasets (Source)

It is worth noting that the Word Error Rate for different languages varies significantly. The chart below shows performance variance for Whisper Large v2 using the Feurs dataset. In this instance, Spanish and Italian perform even better than English, while Zimbabwean Shona and Pakistani Sindhi fare worst, out-of-the-box. Performance for Whisper Tiny may be different.

The figure above shows a WER (Word Error Rate) breakdown by languages of the Fleurs dataset using the large-v2 model (Source)

Using Whisper on Graphcore IPUs

Whisper on IPUs

Developers can use a pretrained Whisper Tiny (39m parameters) for inference on Graphcore IPUs via a Paperspace Gradiant Notebook.

Users can get started with a six hour free trial, or upgrade to a paid tier if they need more.

Other versions of Whisper are available for IPU and if you want to find out more, contact us via this form .

Running Whisper in a Paperspace Gradient Notebook on IPUs is simple.

We will be using Hugging Face’s IPU-optimized transformers library, optimum-graphcore.

You will also need a few other libraries to manipulate sound files, tokenise the inputs for the model and plot our results.

Following this guide, you will be able to transcribe audio content in very little time.

%%capture
%pip install optimum-graphcore==0.6.0
%pip install soundfile==0.12.1 librosa==0.10.0.post2 tokenizers==0.12.1
%pip install matplotlib==3.7.1
%matplotlib inline

Next, let’s import the part of the libraries that we will use today.

# Generic imports
from datasets import load_dataset
import matplotlib
import librosa
import IPython
import random

# IPU specific imports
from optimum.graphcore import IPUConfig
from optimum.graphcore.modeling_utils import to_pipelined

# HF related imports
from transformers import WhisperProcessor, WhisperForConditionalGeneration

We can now choose the model to use and its configuration. Here we are going for Whisper tiny.en which allow for fastest execution speed whilst also have great transcription quality as it is specialised in a single language, English. We are going to use two IPUs to run this model, on the first we place the encoder -side of the Transformer model and on the second the decoder. By calling .half() on our model, we are enabling the use of fp16 precision resulting in near double throughput in comparison to fp32.

model_spec = "openai/whisper-tiny.en"

# Instantiate processor and model
processor = WhisperProcessor.from_pretrained(model_spec)
model = WhisperForConditionalGeneration.from_pretrained(model_spec)

# Adapt whisper to run on the IPU
ipu_config = IPUConfig(ipus_per_replica=2)
pipelined_model = to_pipelined(model, ipu_config)
pipelined_model = pipelined_model.parallelize(for_generation=True).half()

The model is ready, now let’s get some audio data to transcribe. We are using the well known librispeech which contains pairs of audio data with corresponding transcriptions. If you are using your own audio and need to convert it into a file format recognised by Whisper, we would suggest using an free application such as FFmpeg.

# Load the dataset and read an example soundfile
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
test_sample = ds[2]
sample_rate = test_sample['audio']['sampling_rate']

Then, we create a function to call our Whisper model and call on the audio data. After this, we print the generated transcription to the console and observe that it match the ground truth that we got from librispeech and observe visually the audio we just transcribed.

def transcribe(data, rate):
    input_features = processor(data, return_tensors="pt", sampling_rate=rate).input_features.half()

    # This triggers a compilation the first time around (unless a precompiled model is available)
    sample_output = pipelined_model.generate(input_features, max_length=448, min_length=3)
    transcription = processor.batch_decode(sample_output, skip_special_tokens=True)[0]
    return transcription

test_transcription = transcribe(test_sample["audio"]["array"], sample_rate)
print(f"Expected: {test_sample['text']}\n")
print(f"Transcribed: {test_transcription}")

Image by author.

Hopefully you agree that it is quick and easy to get started transcribing with Whisper on Graphcore IPUs.

Of course — Whisper has many more talents, such as translation and multi-lingual transcription.

If you want to take your Whisper usage on IPUs further, or explore larger versions of the model, please feel free to contact us.

How to use OpenAI’s Whisper for Speech Recognition was originally published in Graphcore on Medium, where people are continuing the conversation by highlighting and responding to this story.