Summary
Hugging Face is a leading AI company and a prominent player in the field of natural language processing (NLP). They are best known for their open-source contributions, especially the transformers
library, which provides access to a wide range of pre-trained models that can be used for a variety of NLP tasks. These models include state-of-the-art architectures such as BERT, GPT, T5, and many others, which have been pre-trained on large datasets and can be fine-tuned for specific tasks with minimal effort.
The transformers
library is designed to make it easier for researchers and developers to implement powerful NLP solutions by providing easy-to-use tools for both fine-tuning and inference. Fine-tuning refers to the process of adapting a pre-trained model to a specific application, like sentiment analysis, text classification, or summarization. Hugging Face’s library abstracts much of the complexity of this process, allowing users to focus more on their specific task rather than on the intricacies of training deep learning models.
Additionally, Hugging Face has created an ecosystem around their library, providing access to datasets, model hubs, and other tools that help accelerate the development of AI applications. This includes a model hub where users can find and share pre-trained models for a variety of tasks. Hugging Face has also expanded into areas such as speech recognition and computer vision, making their tools applicable to more than just text-based NLP tasks.
In this notebook, we will discuss an introduction to Hugging Face. Python functions and data files needed to run this notebook are available via this link.
To locate a specific model on Hugging Face, we can use the model filter options:
task
: Filters models based on the specified task. sort
: Determines the sorting criteria for the models. direction
: Specifies the sorting order: -1
for descending order. limit
: Restricts the number of models returned. #!pip install huggingface_hub==0.23.5
from huggingface_hub import HfApi
api = HfApi()
D:\Learning\MyWebsite\FinalGithub\AlreadyPublihsed\blogs\HuggingFace\env_hugg_face\lib\site-packages\tqdm\auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
## shows list of available models
#api = HfApi()
#list(api.list_models())
from huggingface_hub import ModelFilter
models = api.list_models(
filter=ModelFilter(
task="summarization"),
sort="downloads",
direction=-1,
limit=2
)
modelList = list(models)
modelList
D:\Learning\MyWebsite\FinalGithub\AlreadyPublihsed\blogs\HuggingFace\env_hugg_face\lib\site-packages\huggingface_hub\utils\endpoint_helpers.py:247: FutureWarning: 'ModelFilter' is deprecated and will be removed in huggingface_hub>=0.24. Please pass the filter parameters as keyword arguments directly to the `list_models` method. warnings.warn(
[ModelInfo(id='google-t5/t5-small', author='google-t5', sha='df1b051c49625cf57a3d0d8d3863ed4d13564fe4', created_at=datetime.datetime(2022, 3, 2, 23, 29, 4, tzinfo=datetime.timezone.utc), last_modified=datetime.datetime(2023, 6, 30, 2, 31, 26, tzinfo=datetime.timezone.utc), private=False, gated=False, disabled=None, downloads=7478752, likes=365, library_name='transformers', tags=['transformers', 'pytorch', 'tf', 'jax', 'rust', 'onnx', 'safetensors', 't5', 'text2text-generation', 'summarization', 'translation', 'en', 'fr', 'ro', 'de', 'multilingual', 'dataset:c4', 'arxiv:1805.12471', 'arxiv:1708.00055', 'arxiv:1704.05426', 'arxiv:1606.05250', 'arxiv:1808.09121', 'arxiv:1810.12885', 'arxiv:1905.10044', 'arxiv:1910.09700', 'license:apache-2.0', 'autotrain_compatible', 'text-generation-inference', 'endpoints_compatible', 'region:us'], pipeline_tag='translation', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, siblings=[RepoSibling(rfilename='.gitattributes', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='README.md', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='config.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='flax_model.msgpack', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='generation_config.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='model.safetensors', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='onnx/decoder_model.onnx', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='onnx/decoder_model_merged.onnx', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='onnx/decoder_model_merged_quantized.onnx', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='onnx/decoder_model_quantized.onnx', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='onnx/decoder_with_past_model.onnx', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='onnx/decoder_with_past_model_quantized.onnx', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='onnx/encoder_model.onnx', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='onnx/encoder_model_quantized.onnx', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='pytorch_model.bin', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='rust_model.ot', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='spiece.model', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='tf_model.h5', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='tokenizer.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='tokenizer_config.json', size=None, blob_id=None, lfs=None)], spaces=None, safetensors=None), ModelInfo(id='facebook/bart-large-cnn', author='facebook', sha='37f520fa929c961707657b28798b30c003dd100b', created_at=datetime.datetime(2022, 3, 2, 23, 29, 5, tzinfo=datetime.timezone.utc), last_modified=datetime.datetime(2024, 2, 13, 18, 2, 5, tzinfo=datetime.timezone.utc), private=False, gated=False, disabled=None, downloads=5060118, likes=1205, library_name='transformers', tags=['transformers', 'pytorch', 'tf', 'jax', 'rust', 'safetensors', 'bart', 'text2text-generation', 'summarization', 'en', 'dataset:cnn_dailymail', 'arxiv:1910.13461', 'license:mit', 'model-index', 'autotrain_compatible', 'endpoints_compatible', 'region:us'], pipeline_tag='summarization', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, siblings=[RepoSibling(rfilename='.gitattributes', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='README.md', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='config.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='flax_model.msgpack', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='generation_config.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='generation_config_for_summarization.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='merges.txt', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='model.safetensors', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='pytorch_model.bin', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='rust_model.ot', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='tf_model.h5', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='tokenizer.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='vocab.json', size=None, blob_id=None, lfs=None)], spaces=None, safetensors=None)]
The AutoModel
class from the transformers
library is a convenient wrapper that automatically selects and imports the appropriate model architecture based on a given model ID.
# Import AutoModel
from transformers import AutoModel
#pip install torch torchvision torchaudio
Import model id for text summarization:
modelId = 'google-t5/t5-small'
model = AutoModel.from_pretrained(modelId)
# Save the model to a local directory
model.save_pretrained(save_directory=f"models/{modelId}")
Transformer Components
Transformers are advanced neural network models designed to learn context and sequence understanding effectively. The three primary components of a transformer model are:
By executing the command pip install datasets
, you gain access to the Hugging Face Datasets library, which provides various functionalities for working with datasets. These include:
For detailed documentation, refer to the Hugging Face Datasets guide.
#!pip install datasets
#!pip install fsspec==2023.9.2
#!pip install -U datasets
from datasets import load_dataset_builder
data_builder = load_dataset_builder("imdb")
data_builder.info.description
''
data_builder.info.features
{'text': Value(dtype='string', id=None), 'label': ClassLabel(names=['neg', 'pos'], id=None)}
Downloading a Dataset
from datasets import load_dataset
data = load_dataset("imdb_urdu_reviews")
Split parameter
data = load_dataset("imdb", split="train")
Configuration parameter
#data = load_dataset("wikipedia", "20220301.en")
from datasets import load_dataset
ds = load_dataset("fka/awesome-chatgpt-prompts")
ds_train = load_dataset("fka/awesome-chatgpt-prompts", split="train")
ds_train
Dataset({ features: ['act', 'prompt'], num_rows: 170 })
print(ds_train.shape)
(170, 2)
import pprint
pprint.pprint(ds_train[1])
{'act': 'SEO Prompt', 'prompt': 'Using WebPilot, create an outline for an article that will be ' "2,000 words on the keyword 'Best SEO prompts' based on the top 10 " 'results from Google. Include every relevant heading possible. Keep ' 'the keyword density of the headings high. For each section of the ' 'outline, include the word count. Include FAQs section in the ' 'outline too, based on people also ask section from Google for the ' 'keyword. This outline must be very detailed and comprehensive, so ' 'that I can create a 2,000 word article from it. Generate a long ' 'list of LSI and NLP keywords related to my keyword. Also include ' 'any other words related to the keyword. Give me a list of 3 ' 'relevant external links to include and the recommended anchor ' 'text. Make sure they’re not competing articles. Split the outline ' 'into part 1 and part 2.'}
Benefits of Datasets
datasets
library for enhanced usability in workflows. There are two approaches to load Hugging Face models:
The Auto Classes in the transformers
library provide general-purpose interfaces for various components, making them flexible and convenient for machine learning workflows. Key features include:
These classes offer flexibility and direct control, making them ideal for customizing and optimizing machine learning tasks.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from transformers import BertModel
# Load pretrained BERT-base model with 12 encoder and 110M parameters
model_BERT_base = BertModel.from_pretrained('bert-base-uncased')
AutoTokenizer
Tokenizers are used to preprocess text-based input data for models. It is recommended to use the tokenizer associated with the model to ensure that the input is processed in the same manner as during the model's training. This is particularly important when using Auto classes. The pipeline functions handle much of this process automatically.
To retrieve the appropriate tokenizer for a model, use AutoTokenizer
. Similar to other Auto classes, you can load it by passing the model name to from_pretrained
. Here's how you can save it as tokenizer
:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
This ensures that the tokenizer is compatible with the model you're working with.
The pipeline module provides a high-level abstraction compared to the more direct approach using AutoClasses. It is imported from the transformers
library and encapsulates all the task-specific steps required for inference, or making predictions on data. The pipeline is ideal for quickly applying pre-trained models to common machine learning tasks, especially when you're just getting started.
In other words, Pipelines is a wrapper that encapsulates task-specific workflows for each machine learning task supported by the module, including:
SummarizationPipeline
TextClassificationPipeline
AudioClassificationPipeline
ImageSegmentationPipeline
QuestionAnsweringPipeline
These task-specific pipelines utilize Auto classes behind the scenes. They automatically download the appropriate models and apply the relevant processing, such as using tokenizers
, based on the model name provided to the pipeline function.
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love Hugging Face!")
result
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended.
[{'label': 'POSITIVE', 'score': 0.9998641014099121}]
Key Difference
from transformers import (
TextClassificationPipeline,
SummarizationPipeline,
ImageSegmentationPipeline,
AudioClassificationPipeline
)
models = api.list_models(
filter=ModelFilter(
task="question-answering"),
sort="downloads",
direction=-1,
limit=2
)
modelList = list(models)
modelList[0]
D:\Learning\MyWebsite\FinalGithub\AlreadyPublihsed\blogs\HuggingFace\env_hugg_face\lib\site-packages\huggingface_hub\utils\endpoint_helpers.py:247: FutureWarning: 'ModelFilter' is deprecated and will be removed in huggingface_hub>=0.24. Please pass the filter parameters as keyword arguments directly to the `list_models` method. warnings.warn(
ModelInfo(id='deepset/roberta-base-squad2', author='deepset', sha='adc3b06f79f797d1c575d5479d6f5efe54a9e3b4', created_at=datetime.datetime(2022, 3, 2, 23, 29, 5, tzinfo=datetime.timezone.utc), last_modified=datetime.datetime(2024, 9, 24, 15, 48, 47, tzinfo=datetime.timezone.utc), private=False, gated=False, disabled=None, downloads=1405258, likes=809, library_name='transformers', tags=['transformers', 'pytorch', 'tf', 'jax', 'rust', 'safetensors', 'roberta', 'question-answering', 'en', 'dataset:squad_v2', 'base_model:FacebookAI/roberta-base', 'base_model:finetune:FacebookAI/roberta-base', 'license:cc-by-4.0', 'model-index', 'endpoints_compatible', 'region:us'], pipeline_tag='question-answering', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, siblings=[RepoSibling(rfilename='.gitattributes', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='README.md', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='config.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='flax_model.msgpack', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='merges.txt', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='model.safetensors', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='pytorch_model.bin', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='rust_model.ot', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='special_tokens_map.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='tf_model.h5', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='tokenizer_config.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='vocab.json', size=None, blob_id=None, lfs=None)], spaces=None, safetensors=None)
models = api.list_models(
filter=ModelFilter(
task="question-answering"),
sort="likes",
direction=-1,
limit=2
)
modelList = list(models)
modelList[0]
ModelInfo(id='deepset/roberta-base-squad2', author='deepset', sha='adc3b06f79f797d1c575d5479d6f5efe54a9e3b4', created_at=datetime.datetime(2022, 3, 2, 23, 29, 5, tzinfo=datetime.timezone.utc), last_modified=datetime.datetime(2024, 9, 24, 15, 48, 47, tzinfo=datetime.timezone.utc), private=False, gated=False, disabled=None, downloads=1405258, likes=809, library_name='transformers', tags=['transformers', 'pytorch', 'tf', 'jax', 'rust', 'safetensors', 'roberta', 'question-answering', 'en', 'dataset:squad_v2', 'base_model:FacebookAI/roberta-base', 'base_model:finetune:FacebookAI/roberta-base', 'license:cc-by-4.0', 'model-index', 'endpoints_compatible', 'region:us'], pipeline_tag='question-answering', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, siblings=[RepoSibling(rfilename='.gitattributes', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='README.md', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='config.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='flax_model.msgpack', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='merges.txt', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='model.safetensors', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='pytorch_model.bin', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='rust_model.ot', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='special_tokens_map.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='tf_model.h5', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='tokenizer_config.json', size=None, blob_id=None, lfs=None), RepoSibling(rfilename='vocab.json', size=None, blob_id=None, lfs=None)], spaces=None, safetensors=None)
We can give task, model or both as for pipeline creation as below:
from transformers import BertForMaskedLM, pipeline
nlp = pipeline(task="fill-mask",
model='bert-base-cased'
)
print(type(nlp.model))
preds = nlp(f"If you don’t know how to swim, you will {nlp.tokenizer.mask_token} in this lake.")
for p in preds:
print(f"Token:{p['token_str']}. Score: {100*p['score']:,.2f}%")
BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions. - If you're using `trust_remote_code=True`, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception). - If you are not the owner of the model architecture class, please contact the model code owner to update it. Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight'] - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
<class 'transformers.models.bert.modeling_bert.BertForMaskedLM'> Token:drown. Score: 72.56% Token:die. Score: 23.95% Token:be. Score: 0.63% Token:drowned. Score: 0.45% Token:fall. Score: 0.39%
If only one of these is defined, the pipeline module will use the default for the other, based on what is specified for the task or the model in the Hub.
nlp = pipeline(task="fill-mask")
No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base). Using a pipeline without specifying a model name and revision in production is not recommended. Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight'] - This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
pipeline with Auto classes
The pipeline function can also utilize Auto Classes for the model, tokenizer, configuration, and more, providing additional flexibility if needed. When only the model is passed to the pipeline, it will automatically infer the task and the required tokenizer to complete the pipeline.
from transformers import DistilBertForSequenceClassification
model_BERT_base = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
#my_pipeline = pipeline(model=model_BERT_base)
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
from transformers import DistilBertTokenizerFast
sequence_clf_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',)
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
pipe = pipeline("text-classification", model=sequence_clf_model, tokenizer=tokenizer)
pipe('Please add Here We Go by Dispatch to my road trip playlist')
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[{'label': 'LABEL_0', 'score': 0.529563844203949}]
Clean Tokens with Tokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
print(tokenizer.backend_tokenizer.normalizer.normalize_str('Hey yo, how aré yoü?'))
hey yo, how are you?
Lastly, there are specific classes, like GPT2Tokenizer
, designed for individual model tokenizers. These need to be imported from the transformers
library, similar to AutoTokenizer
. The tokenizer can then be used with the .tokenize()
method, where the input text is passed to the function. The GPT2Tokenizer
splits the input based on whitespaces and adds a special "G" character to denote a whitespace.
from transformers import GPT2Tokenizer
input = "Hey yo, how aré yoü?"
gpt_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
gpt_tokens = gpt_tokenizer.tokenize(text=input)
gpt_tokens
['Hey', 'Ġyo', ',', 'Ġhow', 'Ġar', 'é', 'Ġyo', 'ü', '?']
The GPT2Tokenizer splits the input by whitespaces and adds a special "G" character to indicate a whitespace.
Images consist of small units called pixels, each representing a point in the image and containing information about its color or grayscale intensity. The number of pixels can vary from one to an infinite amount. Similar to how words form the foundation of text analysis, the information within pixels serves as the basis for image analysis, processing, and machine learning tasks.
Pre-processing is crucial when using images for machine learning tasks for several reasons:
Cropping
Involves removing unwanted portions of the original image, focusing only on the relevant areas.
Resizing
Adjusts the dimensions of the image, either enlarging or shrinking it to specific height and width. However, this process should be done carefully, as it can impact the resolution.
To perform specific processing steps on an image, use the image_transforms
utilities module in the transformers library.
from transformers import image_transforms
An image can be loaded into Python using Image.open()
from the Pillow library. We won't be covering this package in detail here, as images will be pre-loaded in the exercises.
from PIL import Image
original_image = Image.open("./data/image_1.jpg")
original_image
Image transformations usually take place on a NumPy array representation of the image. To convert an image to a NumPy array, pass the original image to the np.array()
function and save it as an image array.
import numpy as np
image_array = np.array(original_image)
image_array
array([[[ 0, 82, 140], [ 0, 82, 140], [ 1, 83, 141], ..., [ 84, 157, 202], [ 92, 154, 205], [ 73, 125, 183]], [[ 0, 82, 140], [ 0, 82, 140], [ 1, 83, 141], ..., [ 58, 145, 188], [ 69, 145, 194], [ 61, 127, 179]], [[ 0, 82, 140], [ 0, 82, 140], [ 1, 83, 141], ..., [ 35, 138, 179], [ 49, 140, 185], [ 58, 136, 184]], ..., [[208, 206, 59], [205, 199, 63], [200, 191, 70], ..., [ 42, 55, 0], [ 32, 44, 0], [ 24, 34, 0]], [[164, 182, 0], [163, 178, 0], [159, 169, 0], ..., [ 58, 74, 9], [ 43, 57, 6], [ 35, 47, 11]], [[155, 176, 11], [160, 178, 18], [157, 172, 17], ..., [ 46, 62, 17], [ 30, 43, 15], [ 21, 31, 20]]], dtype=uint8)
Cropping an image
To crop an image, use the .center_crop()
method from image_transforms
. This method requires two parameters: the image in NumPy array format and the target size for the crop, which is a rectangle defined by height and width to extract from the center of the image. For instance, a 1000 by 1000 square. If we wanted to classify the image based on eye color, we could use the center crop method to return a new image focusing only on the center of the original.
image_cropped = image_transforms.center_crop(
image=image_array,
size=(500,500)
)
image_cropped
array([[[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]], ..., [[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]], [[0, 0, 0], [0, 0, 0], [0, 0, 0], ..., [0, 0, 0], [0, 0, 0], [0, 0, 0]]], dtype=uint8)
from transformers import pipeline
classifier = pipeline(task="image-classification",
model="google/vit-base-patch16-224")
Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.
Let's take the previous image as an example. The instantiated classifier pipeline can accept an image in three forms: an HTTP link (as a string), the local path to the image (as a string), or an image loaded using the Pillow
(PIL) Python library. When using PIL, first import the Image
class. Then, use the Image.open()
method, passing in the local path to the image file (e.g., a JPEG). The advantage of using PIL images is that it provides standard functions for working with images, such as loading and saving. If there are multiple images, you can pass a list of links, local paths, or PIL images.
from PIL import Image
original_image = Image.open("./data/image_2.jpg")
original_image
classifier(original_image)
[{'label': 'reflex camera', 'score': 0.5019135475158691}, {'label': 'binoculars, field glasses, opera glasses', 'score': 0.43400973081588745}, {'label': 'lens cap, lens cover', 'score': 0.03254600986838341}, {'label': 'tripod', 'score': 0.008797748945653439}, {'label': "loupe, jeweler's loupe", 'score': 0.0027222465723752975}]
# top_k option limits number of labels to return
results = classifier(original_image, top_k=2)
print(results[0]['label'])
reflex camera
results
[{'label': 'reflex camera', 'score': 0.5019135475158691}, {'label': 'binoculars, field glasses, opera glasses', 'score': 0.43400973081588745}]
Document Question and Answering (Document Q&A) is a machine learning task that involves answering questions based on the content of a given document or text passage. For example, this could involve a memo or any other text-based document. This task requires both a document and a question as input. The document can be image-based or text-based, such as a research paper, contract, user manual, or something similar. The question is a text string that asks something specific about the document's content, like "What are the action steps?" The answer is then generated by analyzing the document's content, and it can be either a direct quote or a paraphrased response.
For example, if we have a memo and want to quickly understand its content, we could use a question and answering pipeline. When performing inference with the pipeline, the necessary processing steps are handled behind the scenes. To instantiate a pipeline for Document Question and Answering, you pass "document-question-answering" as the task parameter. We'll save this pipeline as dqa
.
For a document QA pipeline, the input is slightly different: both the document and the question need to be provided. Using this dqa
pipeline object, pass in the document in the form of an image, saved here as document_image
, and the question to ask, such as "What is this memo about?", saved as question_text
.
from transformers import pipeline
#pip install sentencepiece
#pip install protobuf
doc_que_anw = pipeline(
task="document-question-answering",
model="naver-clova-ix/donut-base-finetuned-docvqa")
Config of the encoder: <class 'transformers.models.donut.modeling_donut_swin.DonutSwinModel'> is overwritten by shared encoder config: DonutSwinConfig { "attention_probs_dropout_prob": 0.0, "depths": [ 2, 2, 14, 2 ], "drop_path_rate": 0.1, "embed_dim": 128, "hidden_act": "gelu", "hidden_dropout_prob": 0.0, "hidden_size": 1024, "image_size": [ 2560, 1920 ], "initializer_range": 0.02, "layer_norm_eps": 1e-05, "mlp_ratio": 4.0, "model_type": "donut-swin", "num_channels": 3, "num_heads": [ 4, 8, 16, 32 ], "num_layers": 4, "patch_size": 4, "path_norm": true, "qkv_bias": true, "transformers_version": "4.46.3", "use_absolute_embeddings": false, "window_size": 10 } Config of the decoder: <class 'transformers.models.mbart.modeling_mbart.MBartForCausalLM'> is overwritten by shared decoder config: MBartConfig { "activation_dropout": 0.0, "activation_function": "gelu", "add_cross_attention": true, "add_final_layer_norm": true, "attention_dropout": 0.0, "bos_token_id": 0, "classifier_dropout": 0.0, "d_model": 1024, "decoder_attention_heads": 16, "decoder_ffn_dim": 4096, "decoder_layerdrop": 0.0, "decoder_layers": 4, "dropout": 0.1, "encoder_attention_heads": 16, "encoder_ffn_dim": 4096, "encoder_layerdrop": 0.0, "encoder_layers": 12, "eos_token_id": 2, "forced_eos_token_id": 2, "init_std": 0.02, "is_decoder": true, "is_encoder_decoder": false, "max_position_embeddings": 128, "model_type": "mbart", "num_hidden_layers": 12, "pad_token_id": 1, "scale_embedding": true, "transformers_version": "4.46.3", "use_cache": true, "vocab_size": 57532 }
The result is a dictionary containing the score, or the probability of the answer; the answer to the question; the start which is the first word index in the document of the answer; the end which is the last word index of the answer, and word which are all indices within the document for each word in the answer. In this example, the score is 0.789 and the answer is "distribution". This indicates there is a 79 percent probability the memo is about distribution, according to the model. Setting the parameter max_answer_len will control the total number of words an answer can have which can be helpful for ensuring conciseness in the model responses.
pic_ = Image.open("./data/image_3.jpg")
pic_
doc_image = "./data/image_3.jpg"
question_text = "what is this image?"
result = doc_que_anw(doc_image, question_text, max_answer_len=25)
result
D:\Learning\MyWebsite\FinalGithub\AlreadyPublihsed\blogs\HuggingFace\env_hugg_face\lib\site-packages\transformers\generation\utils.py:1375: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation. warnings.warn( MBartModel is using MBartSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True` or `layer_head_mask` not None. Falling back to the manual attention implementation, but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.
[{'answer': 'mil'}]
The answer is wrong
Visual Question Answering (VQA) is similar to Document Question and Answering, but instead of using images of documents, it involves using actual images or videos of objects. For example, consider an image of elephants. Like Document Q&A, the pipeline task requires both an image and a related question, such as "What type of animal is in this picture?". This process is also similar to image classification, but in VQA, instead of simply relying on class labeling, we are asking the model a specific question about the content of the image.
visual_que_anw = pipeline(
taks="visual-question-answering",
model="dandelin/vilt-b32-finetuned-vqa"
)
Now see if visual_que_anw can predict parrot image:
result = visual_que_anw(
image="./data/image_3.jpg",
question="what is this image?")
result
[{'score': 0.9652839303016663, 'answer': 'parrot'}, {'score': 0.43784281611442566, 'answer': 'bird'}, {'score': 0.16257372498512268, 'answer': 'parrots'}, {'score': 0.0582440085709095, 'answer': 'birds'}, {'score': 0.025064218789339066, 'answer': 'zoo'}]
Another example:
image_1 = Image.open("./data/image_4.jpg")
image_1
result = visual_que_anw(
image="./data/image_4.jpg",
question="what is status of person, happy, frown, sad?")
result
[{'score': 0.61009681224823, 'answer': 'happy'}, {'score': 0.3348560035228729, 'answer': 'happiness'}, {'score': 0.0971321165561676, 'answer': 'smiling'}, {'score': 0.0769975483417511, 'answer': 'smile'}, {'score': 0.038463376462459564, 'answer': 'laughing'}]
Information Retrieval
Information retrieval is widely used across various industries, such as customer support (to find solutions to customer issues), legal compliance (to locate relevant regulatory information), and database searches. These tasks also offer significant benefits for individuals who are visually impaired.
Preprocessing for multi-modal tasks
Q&A tasks are multi-modal, meaning they involve multiple types of data, such as images and text. To ensure accurate performance, each data type should be processed with the appropriate methods, such as tokenizing the text inputs and resizing the images.
It's time to use Hugging Face with audio data, specifically audio classification and automatic speech recognition.
What is audio data?
Audio data refers to sound waves, which are continuous signals characterized by their length and amplitude. To represent audio digitally, these signals are converted into discrete values, creating a digital representation of the sound wave. This process involves sampling, a critical step that transforms continuous sound waves into discrete data, enabling machine learning algorithms to effectively process and analyze audio.
A higher sampling rate improves the resolution or quality of the digital representation by capturing more frequent samples. Speech models are commonly trained at a standard rate of 16kHz, with the specific sampling rate typically detailed in the model card.
To maintain consistency, all audio observations in our dataset must have the same sampling rate. Resampling ensures uniformity across audio files, which can enhance both consistency and computational efficiency.
Resampling Using Hugging Face
To resample audio files with Hugging Face, start by importing the Audio
module from the datasets
library, which facilitates extracting audio data from audio files. Suppose you are working with a dataset of songs. Use the cast_column
method to modify a specific column in the dataset—in this case, to change the sampling rate.
The method requires two parameters:
"audio"
. Audio
class, where you can define the desired sampling_rate
parameter (e.g., 16,000 Hz). In most datasets, you can find the sampling rate of an audio file by accessing the relevant metadata. For instance, use the index of the file (e.g., 0
), followed by the "audio"
column, and then "sampling_rate"
.
#pip install librosa
#pip install soundfile
from datasets import Audio, Dataset
dataset = Dataset.from_dict({"audio": ['./data/sample_file.wav']})
songs = dataset.cast_column("audio", Audio(sampling_rate=16_000))
songs
Dataset({ features: ['audio'], num_rows: 1 })
import pprint
pprint.pprint(songs[0])
{'audio': {'array': array([-1.35140567e-07, 2.25905501e-06, -2.25680424e-06, ..., 6.85696193e-07, -4.86526346e-07, -4.92819936e-08]), 'path': './data/sample_file.wav', 'sampling_rate': 16000}}
Filtering Audio Files by Length
Filtering audio files based on their length is a common preprocessing step with two key benefits:
To filter a dataset by audio file length, you first need a method to calculate the duration. The librosa
library provides the get_duration()
function, which computes the duration of an audio file when given its path. For example, the file paths can be accessed from the "path" column in a dataset of songs.
Steps:
Calculate Duration:
for
loop to iterate through the dataset, calling get_duration()
for each file path to calculate its length. durations
. Add Duration as a Column:
add_column()
method, specifying the column name (e.g., "duration"
) and passing the list. Filter the Dataset:
.filter()
method to filter rows based on the new "duration"
column. d
) is less than 10 seconds. Rows meeting this condition are retained, while others are excluded. This process streamlines the dataset for efficient model training and inference.
import librosa
duration = []
for row in songs["audio"]:
duration.append(librosa.get_duration(path=row['path']))
songs = songs.add_column("duration", duration)
#songs = songs.filter(
# lambda d: d < 5.0, input_columns=["duration"]
#)
songs
Dataset({ features: ['audio', 'duration'], num_rows: 1 })
What is audio classification?
Now that we’ve covered processing, let’s explore audio classification. This task involves assigning one or more labels to audio clips based on their content. Common applications include tasks like language identification, such as determining the language spoken in an audio clip.
Using Hugging Face pipelines
To set up a pipeline for audio classification, you can specify "audio-classification"
as the task parameter and provide an appropriate model. Different models available in the Hugging Face model hub under audio classification are often fine-tuned for specific sub-tasks, such as classifying audio by music genre.
from transformers import pipeline
audio_classifier = pipeline(task="audio-classification",
model="superb/wav2vec2-base-superb-ks")
D:\Learning\MyWebsite\FinalGithub\AlreadyPublihsed\blogs\HuggingFace\env_hugg_face\lib\site-packages\transformers\configuration_utils.py:306: UserWarning: Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 Transformers. Using `model.gradient_checkpointing_enable()` instead, or if you are using the `Trainer` API, pass `gradient_checkpointing=True` in your `TrainingArguments`. warnings.warn(
genreClassifier = pipeline(task="audio-classification", trust_remote_code=True,
model="mtg-upf/discogs-maest-30s-pw-73e-ts")
Audio arrays can be accessed using the audio key in each row of the dataset, followed by the array key. To predict the label for an audio file, pass the audio array to the classifier. The output will be a list of dictionaries, each containing a score and a label. The score represents the probability of the label, which, in this case, could indicate a specific type of music.
songs['audio'][0]['array']
array([-1.35140567e-07, 2.25905501e-06, -2.25680424e-06, ..., 6.85696193e-07, -4.86526346e-07, -4.92819936e-08])
audio = songs['audio'][0]['array']
prediction = genreClassifier(audio)
print(prediction)
[{'score': 0.1338157057762146, 'label': 'Electronic---Ambient'}, {'score': 0.041763655841350555, 'label': 'Rock---Math Rock'}, {'score': 0.03641388565301895, 'label': 'Rock---Post Rock'}, {'score': 0.03486892580986023, 'label': 'Rock---Acoustic'}, {'score': 0.026921866461634636, 'label': 'Electronic---Downtempo'}]
Automatic Speech Recognition (ASR) involves transcribing audio recordings of spoken language into text.
Use Cases of ASR:
Digital Assistants:
Customer Service:
Accessibility:
ASR is a versatile technology with a broad range of applications across industries, enhancing efficiency, accessibility, and user experience.
Popular ASR Models on Hugging Face
Wav2Vec by Meta
"facebook/wav2vec2-base-960h"
Whisper by OpenAI
"openai/whisper-base"
The simplest way to create a pipeline for ASR is by setting the task to "automatic-speech-recognition"
, which will default to using Meta's Wav2Vec model. We'll use this pipeline to transcribe an audio file named my_audio.wav
. The pipeline can accept input in one of three formats: a string pointing to a local file or public URL, a numpy array representing the audio in its digital form, or a dictionary containing the sampling rate and raw audio data.
transcriber = pipeline(task="automatic-speech-recognition",
model="facebook/wav2vec2-base-960h")
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at facebook/wav2vec2-base-960h and are newly initialized: ['wav2vec2.masked_spec_embed'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
#pip install ffmpeg-python
# Path to audio file
transcriber("sample_file.wav")
# Numpy array
transcriber(numpy_audio_array)
# Dictionary
transcriber({"sampling_rate": 16_000, "raw": "sample_file.wav"})
Resampling is needed here to ensure the input is in the expected sampling rate. The audio information within a dataset is typically in a dictionary name, "audio", under the key, "array". Here we pass the audio array into the transcriber to predict the sentence. In this example, the predicted text is "what game do you want to play".
sampling_rate = 16_000
dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
input = data[0]['audio']['array']
prediction = transcriber(input)
# Compute the WER between predictions and reference
wer_score = wer.compute(predictions=[prediction],
references=[reference]
)
print(wer_score)
Fine-tuning optimizes a pre-trained model's performance for a specific task or dataset by making targeted adjustments. A pre-trained model is built using a large dataset from a particular domain and is designed to perform well on general tasks, like text classification. Think of it as tuning a car to perform better on specific terrains or to brake more effectively in wet conditions.
Why Fine-Tune a Pre-Trained Model?
Task Adaptation:
Fine-tuning allows the model to specialize in a specific task or domain, improving its effectiveness for particular use cases.
Reduced Training Time:
Leveraging the knowledge of a pre-trained model significantly reduces the time and computational resources required compared to training a model from scratch on a large dataset.
Steps for Fine-Tuning:
Let’s walk through the steps of fine-tuning the "bert-base-cased"
model, transitioning it from general English text prediction to specializing in classifying news article topics.
Choosing the Model
The first step in fine-tuning is selecting the model to use. We download the latest version of the chosen model using Hugging Face's Auto classes. For text classification tasks, the AutoModelForSequenceClassification
class is ideal. Use the .from_pretrained()
method with the model name to load it, and save the instance as model
.
from transformers import AutoModelForSequenceClassification
model_name = "bert-base-cased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The next step is preparing the dataset. Tokenizing the input data is crucial to ensure optimal results. This can be accomplished using the AutoTokenizer
class from the transformers library. Steps for Tokenization:
.from_pretrained()
method. .map()
method, which applies a specified function to each row of the dataset. This process converts the text data into the format required by the model for fine-tuning.
# Prepare the dataset
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
import pandas as pd
# load training data set
df_qa = pd.read_csv('./data/train-squad.csv')
from datasets import Dataset
qa_dataset = Dataset.from_pandas(df_qa)
qa_dataset
Dataset({ features: ['Unnamed: 0', 'context', 'question', 'id', 'answer_start', 'text'], num_rows: 86821 })
def preprocess(data):
return tokenizer(data['question'], data['text'], truncation=True, padding=True, max_length=512)
qa_dataset = qa_dataset.map(preprocess, batched=True)
This explanation covers the process of fine-tuning a model using Hugging Face's Trainer
and how to leverage it for your machine learning task. Here's a detailed breakdown:
Trainer
Module¶Trainer
module to simplify and automate the training loop process. Fine-tuning a pre-trained model (like BERT or GPT) involves adapting it to a specific task (e.g., news article classification). The Trainer
handles much of the heavy lifting, such as managing the training loop, gradients, and evaluation.Trainer
automates this loop.TrainingArguments
)¶TrainingArguments
module. These arguments configure how the model will be trained, including:output_dir
, which specifies where the results of the fine-tuning will be saved (e.g., trained model weights and logs). In your case, this directory is ./results
.train
): The model learns from the training dataset (in this case, train
), where it is exposed to labeled examples that help it understand the relationships between inputs (e.g., news articles) and outputs (e.g., labels like "science").test
): After each training step, the model is evaluated on a test dataset (in this case, test
). This helps determine how well the model generalizes to unseen data and ensures it is not overfitting.Trainer
expects several inputs to perform training:train
), which is used to help the model learn patterns.test
), which is used to test how well the model generalizes to unseen data.trainer.train()
, the training loop begins, and the model is updated based on the training dataset. Once the training is complete, you’ll have a fine-tuned model..save_model()
method. You specify a local path (e.g., ./results
) to save the model, which you can later load for inference.Example Walkthrough
Trainer
object and pass in your fine-tuning model, training arguments, training dataset (train
), and evaluation dataset (test
).trainer.train()
to start the training loop. The model will learn from the train
dataset and be evaluated on the test
dataset.trainer.save_model()
.Summary
Trainer
class automates the process of fine-tuning models by handling the training loop, evaluation, and saving of results.output_dir
).This process simplifies working with pre-trained models, making it easier to apply them to specific tasks without having to manually handle the underlying training mechanics.