Part 4 of 8
Running powerful NLP models with the HuggingFace
transformers
library.
Welcome to the third lesson of the course. Let's recap our progress so far:
Next we will use our first LLM. We'll start with a Natural Language Processing (NLP) model provided by the HuggingFace team.
First, let's set up our notebook to be fully interactive and easy to use. We can do this with a couple of "magic functions" built-in to Jupyter.
Specifically, we use the magic autoreload
and matplotlib
functions. The cell below shows them in action:
#| classes: code-alone
# best practice notebook magic
%load_ext autoreload
%autoreload 2
%matplotlib inline
Let's take a look at what these magic functions do.
autoreload
dynamically reloads code libraries, even as they're changing under the hood. That means we do not have to restart the notebook after every change. We can instead code and experiment on the fly.
matplotlib inline
automatically displays any plots below the code cell that created them. The plots are also saved in the notebook itself, which is perfect for our blog posts.
All of our notebooks going forward will start with these magic functions.
Let's start with the "hello, world!"
of NLP: sentiment analysis.
:::: callout-note The code and examples below are based on the official HuggingFace tutorial, reworked to better suit the course. ::::
Imagine that we're selling some product. And we've gathered a bunch of reviews from a large group of users to find out both the good and bad things that people are saying. The bad reviews will point out where our product needs improving. Positive reviews will show what we're doing right.
Figuring out the tone of a statement (positive vs. negative) is an area of NLP known as sentiment analysis
.
Going through each review would give us a ton of insight about our product. But, it would take a ton of intense and manual effort. Enter Machine Learning to the rescue! An NLP model can automatically analyze and classify the reviews in bulk.
Let's take a look at the HuggingFace NLP model that we'll run. At a high level, the model is built around three key pieces:
Config
file.Preprocessor
file.Model
file(s).The HuggingFace API has a handy, high-level pipeline
that wraps up all three objects for us.
:::: callout-important
Before going forward, make sure that the llm-env
environment from the first lesson is active. This environment has the HuggingFace libraries used below.
::::
The code below uses the transformers
library to build a Sentiment Analysis pipeline
.
# load in the pipeline object from HuggingFace
from transformers import pipeline #<1>
# create a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis") # <2>
Line 5 Load in the sentiment analysis pipeline.
Since we didn't specify a model, you can see in the output above that HuggingFace picked a distilbert model for us by default.
We will learn more about what exactly distilbert
is and how it works later on. For now, think of it as a useful NLP genie who can look at a sentence and tell us whether its has a positive or negative tone.
Next, let's find out what the model thinks about the sentence: "HuggingFace pipelines are awesome!"
# sentiment analysis on a simple, example sentence
example_sentence = "HuggingFace pipelines are awesome!"
classifier(example_sentence)
Not bad. We see a strong confident score for a POSITIVE
label, as could be expected.
We can also pass many sentences at once, which starts to show the bulk processing power of these models. Let's process four sentences at once: three positive ones, and a clearly negative one.
# many sentences at once, in a python list
many_sentences = [
"HuggingFace pipelines are awesome!",
"I hope you're enjoying this course so far",
"Hopefully the material is clear and useful",
"I don't like this course so far",
]
# process many sentences at once
results = classifier(many_sentences)
# check the tone of each sentence
for result in results:
print(f"label: {result['label']}, score: {round(result['score'], 4)}")
Congrats! You've now ran a HuggingFace pipeline and used it to analyze the tone of a few sentences. Next, let's take a closer look at the pipeline object.
pipeline
Under the hood, a pipeline handles three key HuggingFace NLP pieces: Config, Preprocessor, and Model.
To better understand each piece, let's take one small step down the ladder of abstraction and build our own simple pipeline.
We will use the same distilbert
model from before. First we need the three key pieces mentioned above. Thankfully, we can import each of these pieces from the transformers
library.
The config
class is a simple map with the options and configurations of a model. It has the key-value pairs that define a model's architecture and hyperparameters.
# config for the model
from transformers import DistilBertConfig
The preprocessor
object in this case is a Tokenizer
. Tokenizers convert strings and characters into special tensor inputs for the LLM.
:::: callout-note Correctly pre-processing inputs is one of the most important and error-prone steps in using ML models. In other words, it's good to offload to a class that's already been tested and debugged. ::::
# input preprocessor to tokenize strings
from transformers import DistilBertTokenizer
The model
class holds the weights and parameters for the actual LLM. It's the "meat and bones" of the setup, so to speak.
# the text classifier model
from transformers import DistilBertForSequenceClassification
We need to know a model's full, proper name in to load it from HuggingFace. Its name is how we find the model on the HuggingFace Model Hub.
Once we know its full name, there is a handy from_pretrained()
function that will automatically find and download the pieces for us.
In this case, the distilbert model's full name is:
distilbert-base-uncased-finetuned-sst-2-english
.
#| classes: code-alone
# sentiment analysis model name
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
In the code below we can now load each of the three NLP pieces for this model.
#| classes: code-alone
# create the config
config = DistilBertConfig.from_pretrained(model_name)
# create the input tokenizer
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
# create the model
model = DistilBertForSequenceClassification.from_pretrained(model_name)
Next we will compose these three pieces together to mimic the original pipeline
example.
simple_pipeline
First, we create a preprocess
function to turn a given text
string into the proper, tokenized inputs than an LLM expects.
#| classes: code-alone
def preprocess(text: str):
"""
Sends `text` through the model's tokenizer.
The tokenizer turns words and characters into proper inputs for an NLP model.
"""
tokenized_inputs = tokenizer(text, return_tensors='pt')
return tokenized_inputs
Let's test this preprocessing function on the example sentence from earlier.
# manually preprocessing the example sentence: "HuggingFace pipelines are awesome!"
preprocess(example_sentence)
It turned an input string into numerical embeddings for the LLM. We'll breakdown what exactly this output means later on in the course. For now, think of it as sanitizing and formatting the text into a format that the LLM has been trained to work with.
Next up, let's make our own forward
function that run the LLM on preprocessed inputs.
#| classes: code-alone
def forward(text: str):
"""
First we preprocess the `text` into tokens.
Then we send the `tokenized_inputs` to the model.
"""
tokenized_inputs = preprocess(text)
outputs = model(**tokenized_inputs)
return outputs
Let's check what this outputs for our running example sentence.
outputs = forward(example_sentence); outputs
You'll see a lot going on in the SequenceClassifierOutput
above. To be honest, this is where the original pipeline
does most of the heavy-lifting for us. It takes the raw, detailed output from an LLM and converts it into a more human-readable format.
We'll mimic this heavy-lifting by using the Config
class and model outputs to find out whether the sentence is positive or negative.
#| classes: code-alone
def process_outputs(outs):
"""
Converting the raw model outputs into a human-readable result.
Steps:
1. Grab the raw "scores" from the model for Positive and Negative labels.
2. Find out which score is the highest (aka the model's decision).
3. Use the `config` object to find the class label for the highest score.
4. Turn the raw score into a human-readable probability value.
5. Print out the predicted labels with its probability.
"""
# 1. Grab the raw "scores" that from the model for Positive and Negative labels
logits = outs.logits
# 2. Find the strongest label score, aka the model's decision
pred_idx = logits.argmax(1).item()
# 3. Use the `config` object to find the class label
pred_label = config.id2label[pred_idx]
# 4. Calculate the human-readable number for the score
pred_score = logits.softmax(-1)[:, pred_idx].item()
# 5. return the label and score in a dictionary
return {
'label': pred_label,
'score': pred_score,
}
We can now put together a simple_pipeline
, and check how it compares to the original pipeline
.
def simple_pipeline(text):
"""
Putting the NLP pieces and functions together into a pipeline.
"""
# get the model's raw output
model_outs = forward(text)
# convert the raw outputs into a human readable result
predictions = process_outputs(model_outs)
return predictions
Calling the simple_pipeline
on the example sentence, drumroll please...
# running our simple pipeline on the example text
simple_pipeline(example_sentence)
And just like that, we too a small peek under the pipeline
hood and built our own, simple working version.
One pain point: we had to know the full, proper name of the different Distilbert*
pieces to import the Config, Preprocessor, and Model. This gets overwhelming fast given the flood of LLM models released almost daily. Thankfully, HuggingFace has come up with a great solution to this problem: the Auto
class.
Auto
classesWith Auto
classes, we don't have to know the exact or proper name of the LLM's objects to import them. We only need the proper name of the model on the hub:
# viewing our distilbert model's name
model_name
Run the cell below to import the Auto classes. Then we'll use them with the model name to create an even cleaner simple_pipeline
.
#| classes: code-alone
# importing the Auto classes
from transformers import AutoConfig
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
Next we create the three key NLP pieces with the Auto classes.
#| classes: code-alone
# building the pieces with `Auto` classes
config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
We can now use these pieces to build a simple_pipeline
class that's cleaner than before, and can handle any model_name:
class SentimentPipeline:
def __init__(self, model_name: str):
"""
Simple Sentiment Analysis pipeline.
"""
self.model_name = model_name
self.config = AutoConfig.from_pretrained(self.model_name)
self.tokenizer = AutoTokenizer.from_pretrained(self.model_name)
self.model = AutoModelForSequenceClassification.from_pretrained(self.model_name)
def preprocess(self, text: str):
"""
Sends `text` through the LLM's tokenizer.
The tokenizer turns words and characters into special inputs for the LLM.
"""
tokenized_inputs = self.tokenizer(text, return_tensors='pt')
return tokenized_inputs
def forward(self, text: str):
"""
First we preprocess the `text` into tokens.
Then we send the `token_inputs` to the model.
"""
token_inputs = self.preprocess(text)
outputs = self.model(**token_inputs)
return outputs
def process_outputs(self, outs):
"""
Here we mimic the post-processing that HuggingFace automatically does in its `pipeline`.
"""
# grab the raw scores from the model for Positive and Negative labels
logits = outs.logits
# find the strongest label score, aka the model's decision
pred_idx = logits.argmax(1).item()
# use the `config` object to find the actual class label
pred_label = self.config.id2label[pred_idx]
# calculate the human-readable probability score for this class
pred_score = logits.softmax(-1)[:, pred_idx].item()
# return the predicted label and its score
return {
'label': pred_label,
'score': pred_score,
}
def __call__(self, text: str):
"""
Overriding the call method to easily and intuitively call the pipeline.
"""
model_outs = self.forward(text)
preds = self.process_outputs(model_outs)
return preds
SentimentPipeline
Let's leverage both the new class and a different model, to show the power of Auto classes.
For fun, let's use BERT model that was trained specifically on tweets. The full model's name is finiteautomata/bertweet-base-sentiment-analysis
.
#| classes: code-alone
# using a different model
new_model_name = 'finiteautomata/bertweet-base-sentiment-analysis'
# creating a new sentiment pipeline
simple_pipeline = SentimentPipeline(new_model_name)
Now let's run it on our handy example sentence.
# calling our new, flexible pipeline
simple_pipeline(example_sentence)
Congrats! You've now built a flexible pipeline for Sentiment Analysis that can leverage most NLP models on the HuggingFace hub.
This notebook went through the basics of using a HuggingFace pipeline to run sentiment analysis on a few sentences. We then looked under the hood at the pipeline's three key pieces: Config, Preprocessor, and Model.
Lastly, we built our own simple_pipeline
from scratch to see how the pieces fit together.
The goal of this notebook was two fold. First, we wanted to gain hands-on experience with using the transformers
API from HuggingFace. It's an incredibly powerful library, that lets us do what used to be difficult, research-level NLP tasks in a few lines of code.
Second, we wanted to get some familiarity with downloading models. The model weights that we downloaded from HuggingFace are the same ones that we will be fine-tuning, quantizing, and deploying on our devices throughout the course.
There are two appendixes below. The first one gives a handy way of counting the number of weights in a model. The second one goes into more details about how to interactively debug an analyze the code in a Jupyter notebook.
The following code snippet counts the number of trainable parameters in a model. It's a question that comes up often when working with LLMs, and having a quick reference to find out a rough model's size often comes in handy.
#| classes: code-alone
def count_parameters(model):
"""
Counts the number of trainable parameters in a `model`.
"""
return sum(p.numel() for p in model.parameters() if p.requires_grad)
Here we use it to count the number of parameters in the distilbert model from above.
# view the number of parameters in the last model used
f"Number of trainable params: {count_parameters(model):,}"
classifier
, notebook style.What is the classifier
object, exactly? Jupyter has many powerful ways of inspecting and analyzing its code.
One of the simplest ways of checking an object is to call it by itself in a code cell, as shown below.
# show the contents of the `classifier` object
classifier
We can see the classifier
is a type of TextClassification
pipeline. This makes sense: we fed it an input sentence and asked it to classify the statement as positive vs. negative.
There is also a tab-autocomplete feature to find the members and methods of an object. For example, to look up everything in classifier
, hit tab after adding a .
.
Uncomment the cells below and hit the tab key to test the auto-complete feature.
## tab after the `.` to auto-complete all variables/methods
# classifier.
Let's say you vaguely remember the name of a variable or function, say for example the forward()
method. In that case you can type the first few letters and hit tab to auto-complete the full set of options:
## tab after the `.for` to auto-complete the rest of the options
# classifier.for
?
and ??
Lastly, we can literally interrogate an object in Jupyter for more information.
If we tag a single ?
after an object, we'll get its basic documentation (docstring). Note that we omit it here to keep the notebook from getting too busy.
#| output: false
## the power of asking questions
classifier?
If we tag on two question marks: ??
, then we get the full source code of the object:
#| output: false
## really curious about classifier
classifier??
Both ?
and ??
are excellent and quick ways to look under the hood of any object in Jupyter.
classifier
functionLet's take a look at the function that does the heavy lifting for our sentiment analysis task: forward()
.
# looking at what actually runs the inputs
classifier.forward
What does this function actually do? Let's find out.
# source code of the forward function
classifier.forward??
We can see that it automatically handles whether we're running a TensorFlow (tf
) or PyTorch (pt
) model. Then, it makes sure the tensors are on the correct device. Lastly is calls another function, _forward()
on the prepared inputs.
We can follow the rabbit hole as far down as needed. Let's take a look at the source of _forward
.
# going deeper
classifier._forward??
Ah, we can see it calls the model
of the classifier. This is the distilbert
model we saw earlier! Now we can peek under the hood at the actual Transformer LLM.
# the distilbert sentiment analysis model
classifier.model
We will breakdown the different pieces in this model later on in the course.
The important takeaway for now is that this shows the main structure of most Transformer LLM models. The changes are mostly incremental from this foundation.