A practical guide to multilingual large language model (RoBERTa) classification
This step-by-step tutorial provides an accessible introduction to customizing (fine-tuning) a pre-trained multilingual language model (RoBERTa) for text classification tasks. It demonstrates how to use the model’s existing knowledge to classify text accurately, even with a small set of labeled examples. It takes input as JSON files with text documents and their corresponding labels for training, validating and testing. It covers using specialized models for English, German, and French while employing XLM-RoBERTa for over 100 additional languages.
Relevant References for Further Reading: - Unsupervised Cross-lingual Representation Learning at Scale - https://arxiv.org/pdf/1911.02116 - RoBERTa: A Robustly Optimized BERT Pretraining Approach - https://arxiv.org/pdf/1907.11692 - CamemBERT: a Tasty French Language Model - https://arxiv.org/pdf/1911.03894 - WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models - https://aclanthology.org/2022.naacl-main.293.pdf - Sharpness-Aware Minimization for Efficiently Improving Generalization - https://arxiv.org/pdf/2010.01412
Learning Objectives
This tutorial has the following learning objectives: - Learning how to work with large language models (RoBERTa) - Customizing (fine-tuning) a large language model for a text classification task in any language (100+ languages supported) - Low-resource learning (with only few hundred examples) using the SAM optimizer
Target Audience
- Social scientists willing to learn about using large language models with basic prior understanding of it
- Social scientists with expertise in large-language models, interested in fine-tuning for multiple languages from only few examples.
- Computer scientists interested in learning about how large-language models are used for social text classification.
- Advanced NLP researchers and professors looking for tutorials that can help their students in learning new topics.
Prerequisites
Use this tutorial preferably in google colab, as the setup depends on the pre-installed packages of the colab environment.
#Environment Setup Run the cells below:
#!pip install transformers
!wget https://raw.githubusercontent.com/davda54/sam/main/sam.py
--2025-10-10 20:22:33-- https://raw.githubusercontent.com/davda54/sam/main/sam.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2484 (2.4K) [text/plain]
Saving to: ‘sam.py.1’
sam.py.1 0%[ ] 0 --.-KB/s sam.py.1 100%[===================>] 2.43K --.-KB/s in 0s
2025-10-10 20:22:33 (56.8 MB/s) - ‘sam.py.1’ saved [2484/2484]
import json
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import get_cosine_schedule_with_warmup
from torch.utils.data import DataLoader
from sam import SAM
import shutil
import torch
## Utils
def convert_in_output_size(labels, mapping):
label_resized = []
for l in labels:
tmp_l = torch.tensor([1 if k in l else 0 for k in mapping])
label_resized.append(tmp_l)
label_resized = torch.stack(label_resized, dim=0)
return label_resized
def convert_labels(labels, mapping):
if isinstance(labels[0], list):
labels = convert_in_output_size(labels, mapping)
return torch.tensor([mapping[l] for l in labels])
def flatten_list(list_to_flatten):
"""
Returns one list from a list of lists.
"""
return [x for xs in list_to_flatten for x in xs]
def infer_output_size(data):
"""
Returns the number of possible labels and the possible labels.
"""
labels = data['Labels']
if isinstance(labels[0], list):
labels = flatten_list(labels)
labels = set(labels)
return len(labels), labels
def generate_dataloader(text, y, batch_size, workers=1):
"""
Returns a dataloader with input_ids and attention_mask to process the text.
"""
attention_mask = text['attention_mask']
input_ids = text['input_ids']
dataset = list(zip(input_ids, attention_mask, y))
dataloader = DataLoader(dataset, shuffle=True, batch_size=batch_size, num_workers=workers)
return dataloader
Tutorial Content
1) Introduction
We will start with the most pressing questions: - What exactly is a text-classification? - What is a pre-trained language model? - What even is fine-tuning?
We will answer all questions in the following text.
In text-classification we try to assign a property to a text.
For example we are interested in classifying texts that are about fruits. We could easily find a dictionary with all fruits (e.g.: ‘Apple’, ‘Banana’, ‘Pear’ etc.) everytime we recognize such a word in a text we know this text is about fruits, right? However, this might not be true all the time for example “Apple designed the new pencil pro.” is not about the fruit ‘Apple’ although we would recognize it as such with our dictionary approach. Furthermore, this would only work for the language in the dictionary. However, our tutorial is helpful for 100+ languages. So the context of the word might be helpful (more on this later). Classification is obviously transferable to more than just fruits. People try to classify the sentiment of a text, the stance towards an entity expressed in a text, the topic of a text, the expressed emotion in a text, and many many more.
2) Data Preparation
Let’s talk data: In order to make this script work you have to save three dictionaries in this structure in the file ‘
{'Text' : [list of texts]
'Labels': [list of labels]}
Each text document in the data should have a corresponding label such that:
length([list of texts]) == length([list of labels])
Example: python {'Text': ['Yesterday i ate an apple.', 'Yesterday I crashed my Apple.'], 'Labels': ['about_fruit', 'not_about_fruit']}
The train data is used to teach the model, the val data is required to validate if the model understands the train data correctly and the test data is used to proof the capabilities of the final version of the model on the unseen test set.
You can classify text into one category (single label classification) or several categories per text (multilabel classification).
Is your data ready? Then lets start.
try:
with open("./train.json") as f:
train = json.load(f)
with open("./val.json") as f:
val = json.load(f)
with open("./test.json") as f:
test = json.load(f)
except:
#Dummy Values
train = {'Text': ["An Apple is a Fruit!", "An Apple is not a Fruit!", "An Apple has no seeds."]*64,
'Labels': ['is_correct', 'is_incorrect', 'is_incorrect']*64}
val = {'Text': ["An Apple is not a Fruit!", "An Apple is a Fruit!", "An Apple has no seeds."]*64,
'Labels': ['is_incorrect', 'is_correct', 'is_incorrect']*64}
test = {'Text': ["An Apple is not a Fruit!", "An Apple is has no seeds.", "An Apple is a Fruit."]*64,
'Labels': ['is_incorrect', 'is_incorrect', 'is_correct']*64}
assert len(train['Text']) == len(train['Labels']), "Number of texts does not match number of labels for train data!"
assert len(val['Text']) == len(val['Labels']), "Number of texts does not match number of labels for val data!"
assert len(test['Text']) == len(test['Labels']), "Number of texts does not match number of labels for test data!"
print(f"We loaded the train data with {len(train['Text'])} texts and {len(train['Labels'])} labels,")
print(f"the validation data with {len(val['Text'])} texts and {len(val['Labels'])} labels")
print(f"and the test data with {len(test['Text'])} texts and {len(test['Labels'])} labels.")
We loaded the train data with 192 texts and 192 labels,
the validation data with 192 texts and 192 labels
and the test data with 192 texts and 192 labels.
Great the data is ready!
Understanding the Data
You have to answer some questions about your data.
Finding a Language Specific Language Model
In which language is your text data written?
language = 'english' # language to use e.g., 'english', 'german', 'french'
print(f"MMhhh interesting your data is written in {language}. Let's load a fitting PLM!")
if language == 'english':
model_name = "roberta-base"
elif language == 'german':
model_name = "benjamin/roberta-base-wechsel-german"
elif language == 'french':
model_name = "camembert-base"
else:
print(f"Seems like we have no model available for {language}.")
print("We will load a multilingual language model. It knows text from 100 languages.")
model_name = 'xlm-roberta-base'
print(f"We loaded {model_name} for {language}.")
MMhhh interesting your data is written in english. Let's load a fitting PLM!
We loaded roberta-base for english.
Ok now that we talked about the language of your data you might be intersted what the model_name
stands for.
These are pre-trained language models ready to be used with your specific language. These language models already learned to understand language by solving a huge cloze-text written in the particular language.
This cloze-text is constructed over Wikipedia or other huge text datasets.
Now that we understood what text classification and pre-trained language model are.
We can now talk about the last question: What is fine-tuning?
As you might imagine solving a cloze-text over the whole internet makes you knowledgeable but not an expert in a field. We now want to transform our pre-trained language model into an expert for your task.
3) Defining the Classification Task
Choosing vs. Deciding
Before starting, it’s essential to clarify what type of classification task you want to perform. We distinguish between two main tasks:
1. Choosing (Single-Label Classification)
- Example: What is your favorite fruit?
- You select one correct label from a list of possible options.
2. Deciding (Multi-Label Classification)
- Example: Do you like apples?
- You evaluate each label independently and decide whether it applies.
In our dataset, we have single labels, so we will insert 'choosing'
as the classification type. Here’s the code for defining your task:
# Prompt the user to select the classification type
decision_type = 'choosing' # classify by 'choosing' or 'deciding'?
# Validate the input
assert decision_type in ['choosing', 'deciding'], "Invalid input! Please enter 'choosing' or 'deciding'."
# Ensure the labels align with the selected classification type
if decision_type == 'deciding':
assert isinstance(train['Labels'][0], list), (
"For 'deciding', labels should be a list (e.g., ['apple', 'banana'])."
)
else:
assert not isinstance(train['Labels'][0], list), (
"For 'choosing', each label should be a single value (e.g., 'apple')."
)
Choosing the Loss and Decision Functions
Based on your decision_type
we will now choose the correct loss function and the decision function. The loss function tells the model how well it achieved your task. The decision function tells us how to convert the model guesses in the actual decision.
losses = {'deciding': "multi_label_classification", #torch.nn.CrossEntropyLoss(),
'choosing': "single_label_classification",}
decisions = {'deciding': lambda x: torch.where(x > 0, 1, 0),
'choosing': lambda x: torch.argmax(x, dim=1)}
objective = losses[decision_type]
decision_function = decisions[decision_type]
Fixing the Number of Possible Answers:
The next question we need to clarify is: How many different labels are possible for your task?
Example:
Choose your favorite fruit from this list: python poss_labels = ['Banana', 'Apple', 'Pear', 'Peach'`] model_output = [0.4, 0.5, -0.1, 0.7]
Or decide if you like the particular fruit: python poss_labels = ['Banana', 'Apple', 'Pear', 'Peach'`] model_output = [0.4, 0.5, -0.1, 0.7]
In both cases we have 4 possible labels. In the first case we choose the fruit where the model signals the biggest aggreement ‘Peach’ in the second we decide for each fruit if we like it by chosing to like everything above 0. Therefore, our example output tells us that we like ‘Banana’, ‘Apple’, and ‘Peach’.
Let’s find out how many labels are possible in our data:
# Ask user to define the number of possible labels
output_size = "2" # Number of labels
# Ensure the user input is a valid integer
def new_func(output_size):
assert output_size.isdigit(), "Output_size needs to be a natural number."
new_func(output_size)
output_size = int(output_size)
# Infer the number of labels and possible labels from the datasets
infered_output_size_train, possible_labels_train = infer_output_size(train)
infered_output_size_val, possible_labels_val = infer_output_size(val)
infered_output_size_test, possible_labels_test = infer_output_size(test)
# Determine the maximum number of inferred labels
possible_labels = max(infered_output_size_train, infered_output_size_val, infered_output_size_test)
# Warn the user if datasets have inconsistent labels
if not (possible_labels_train == possible_labels_val == possible_labels_test):
print(
f"Warning: Train contains {possible_labels_train} possible labels, "
f"Val contains {possible_labels_val} possible labels, and "
f"Test contains {possible_labels_test} possible labels. "
"This inconsistency is not recommended. All datasets should ideally contain the same labels."
)
# Ensure the output size matches the inferred number of labels
assert output_size == possible_labels, (
f"We inferred {possible_labels} with the following possible labels: {possible_labels_train}."
)
assert possible_labels_train == possible_labels_val == possible_labels_test, (
"Make sure that train, val, and test labels are equal!"
)
# Create mapping dictionaries for labels and IDs
id2label = {i: k for i, k in enumerate(possible_labels_train)}
label2id = {k: i for i, k in enumerate(possible_labels_train)}
print(f"Your task distinguishes {output_size} different labels. These are: {possible_labels_train}")
Your task distinguishes 2 different labels. These are: {'is_incorrect', 'is_correct'}
4) Setting up the Model
Fantastic!
We are close. We clarified the language, the objective and number of possible answers.
Now, let’s load the necessary model and tokenizer.
The tokenizer translates the language into a model-specific vocabulary that the model can process efficiently.
model_config = {'pretrained_model_name_or_path': model_name,
'num_labels': output_size,
'problem_type': objective,
'id2label': id2label,
'label2id': label2id}
model = AutoModelForSequenceClassification.from_pretrained(**model_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Training Specific Settings
Now we need to specify some training-specific parameters.
Don’t worry if you’re unsure about what to change—the preset values should work just fine for most cases.
A Quick Intuition on Training
When training a model, you provide it with some example data points (in your case, the training data). From this data, the model learns helpful patterns that explain the correlation between input and output.
To make the most of the data:
- We feed small portions (batches) to the model at a time, controlled by the batch_size.
- The model uses these batches to infer patterns but commits to those patterns cautiously, guided by the learning rate (lr).
- To ensure the model retains knowledge from pre-training, we use a warm-up rate, which helps the model transition smoothly without forgetting its pre-trained knowledge.
Key Parameters
Let’s go through each parameter one by one:
- batch_size: Determines how many examples we show to the model before deducing rules to improve classification.
- learning_rate (lr): Controls how strongly the model commits to patterns it recognizes within each batch.
- num_epoch: Specifies how many times the model sees all the training data (e.g., 3 times).
- warm_up_rate: Indicates the portion of training during which the model makes smaller adjustments.
- Device: Refers to the hardware used for training. If you have a GPU, your training will be much faster.
batch_size = 64
lr = 1e-4
num_epochs = 1
warm_up_rate = 0.1
num_training_steps = (len(train['Labels'])//batch_size)*num_epochs
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
For now we only need the train-set to teach the model and the val-set to decide if we taught our model well. Lets first translate our text into the models vocabulary.
# Padding tokenizing text aka. translating text into the model vocabulary
train_text = tokenizer([m for m in train['Text']], truncation=True, padding='longest', return_tensors='pt')
val_text = tokenizer([m for m in val['Text']], truncation=True, padding='longest', return_tensors='pt')
test_text = tokenizer([m for m in test['Text']], truncation=True, padding='longest', return_tensors='pt')
# Using the label2id mapping to convert the label strings into label ids
train_y = convert_labels(train['Labels'], label2id)
val_y = convert_labels(val['Labels'], label2id)
test_y = convert_labels(test['Labels'], label2id)
# Retrieve Dataloaders for fast iteration over the data
train_dataloader = generate_dataloader(train_text, train_y, batch_size)
val_dataloader = generate_dataloader(val_text, val_y, batch_size)
test_dataloader = generate_dataloader(test_text, test_y, batch_size)
The learning of patterns and adaptation of the model is achieved by the optimizer. In our case it is a special optimizer that keeps a model from optimizing. If you are really interested you can read more about it here. The scheduler adapts the learning rate according the warm_up_rate
# Initialize optimizer and scheduler
optimizer = SAM(model.parameters(), torch.optim.Adam, lr=lr, adaptive=True)
scheduler = get_cosine_schedule_with_warmup(optimizer = optimizer,
num_warmup_steps = num_training_steps*warm_up_rate,
num_training_steps = num_training_steps,
last_epoch = -1)
5) Fine-tuning (Training)
Let’s start the training process!
During fine-tuning, we show the training data to the model and adjust its parameters to optimize performance for the task.
Here’s what happens:
- The model learns patterns in the data to perform the classification task.
- After each epoch (a complete pass through the training dataset), we test the model on the validation set to monitor progress.
- The best-performing model is saved during the training process.
Once training is complete, you will have a well-trained model ready for use.
# Memory and system check before training
import psutil
import gc
# Trainings-loop with memory management fixes
import os
# Set tokenizers parallelism to avoid fork warnings
os.environ["TOKENIZERS_PARALLELISM"] = "false"
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
# Check system memory
memory = psutil.virtual_memory()
print(f"System Memory: {memory.total / 1024**3:.2f} GB total, {memory.available / 1024**3:.2f} GB available")
# Check GPU memory if available
if torch.cuda.is_available():
gpu_memory = torch.cuda.get_device_properties(0).total_memory
gpu_memory_allocated = torch.cuda.memory_allocated(0)
gpu_memory_reserved = torch.cuda.memory_reserved(0)
print(f"GPU Memory: {gpu_memory / 1024**3:.2f} GB total")
print(f"GPU Memory Allocated: {gpu_memory_allocated / 1024**3:.2f} GB")
print(f"GPU Memory Reserved: {gpu_memory_reserved / 1024**3:.2f} GB")
# Clear any existing GPU cache
torch.cuda.empty_cache()
gc.collect()
print(f"Device: {device}")
else:
print("No GPU available - using CPU")
# Check model size
model_size = sum(p.numel() for p in model.parameters())
print(f"Model has {model_size:,} parameters")
# Recommend batch size based on available memory
if torch.cuda.is_available():
available_gpu_memory = gpu_memory - gpu_memory_reserved
if available_gpu_memory < 4 * 1024**3: # Less than 4GB available
recommended_batch_size = 8
elif available_gpu_memory < 8 * 1024**3: # Less than 8GB available
recommended_batch_size = 16
else:
recommended_batch_size = 32
print(f"Recommended batch size: {recommended_batch_size}")
if batch_size > recommended_batch_size:
print(f"Warning: Current batch size ({batch_size}) may be too large. Consider reducing to {recommended_batch_size}")
System Memory: 7.76 GB total, 5.06 GB available
No GPU available - using CPU
Model has 124,647,170 parameters
best_loss = float('inf')
best_epoch = 0
already_trained = 0
best_model_path = ''
should_delete = True
# Move model to device
model.to(device)
for epoch in range(num_epochs): # Repeat num_epochs times
model.train()
for batch_idx, batch in enumerate(train_dataloader): # Train the model on the batch
try:
input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device)
# First forward pass
output = model(input_ids, attention_mask, labels=y)
loss = output.loss
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# SAM optimizer first step
optimizer.first_step(zero_grad=True)
# Second forward pass (required by SAM)
output2 = model(input_ids, attention_mask, labels=y)
loss2 = output2.loss
loss2.backward()
# SAM optimizer second step
optimizer.second_step(zero_grad=True)
# Update learning rate AFTER optimizer steps
scheduler.step()
print(f"Train: Epoch {epoch}, Train step {already_trained+batch_idx}, Loss {loss.item():.4f}, learning_rate {scheduler.get_last_lr()[0]:.2e}", flush=True)
# Clear cache periodically to prevent memory buildup
if batch_idx % 5 == 0 and torch.cuda.is_available():
torch.cuda.empty_cache()
except RuntimeError as e:
if "out of memory" in str(e).lower():
print(f"OOM Error at batch {batch_idx}. Trying to recover...")
torch.cuda.empty_cache()
gc.collect()
# Try with smaller effective batch size
if batch_size > 8:
batch_size = batch_size // 2
print(f"Reducing batch size to {batch_size}")
train_dataloader = generate_dataloader(train_text, train_y, batch_size)
val_dataloader = generate_dataloader(val_text, val_y, batch_size)
break
else:
raise e
else:
raise e
already_trained += batch_idx
# Validation phase
model.eval()
val_loss = []
with torch.no_grad():
for batch_idx, batch in enumerate(val_dataloader): # Validate the current state of the model on the validation data
input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device)
val_output = model(input_ids, attention_mask, labels=y)
val_loss.append(val_output.loss)
val_loss = torch.mean(torch.stack(val_loss))
print(f"Validation: Epoch {epoch}, Train step {already_trained}, Loss {val_loss.item():.4f}, old best/epoch {str(best_loss)[1:6]}/{best_epoch}", flush=True)
if val_loss < best_loss: # Save the model if the val_loss is the best loss we have seen so far
best_loss = val_loss.item()
best_epoch = epoch
if should_delete and best_model_path and os.path.exists(best_model_path):
shutil.rmtree(best_model_path)
best_model_path = f"./my_model_epoch_{best_epoch}_val_loss_{str(val_loss.item())[1:6]}"
model.save_pretrained(best_model_path, from_pt=True)
print(f"**** END EPOCH {epoch} ****")
# Clean up memory after each epoch
if torch.cuda.is_available():
torch.cuda.empty_cache()
gc.collect()
print(f"**** FINISHED TRAINING FOR N={num_epochs} ****")
print(f"BEST EPOCH: {best_epoch}")
print(f"BEST LOSS: {best_loss}")
Train: Epoch 0, Train step 0, Loss 0.7092, learning_rate 8.43e-05
/srv/conda/envs/notebook/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:192: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn(
Train: Epoch 0, Train step 1, Loss 0.7057, learning_rate 3.02e-05
Train: Epoch 0, Train step 2, Loss 0.6496, learning_rate 0.00e+00
Validation: Epoch 0, Train step 2, Loss 0.5889, old best/epoch nf/0
**** END EPOCH 0 ****
**** FINISHED TRAINING FOR N=1 ****
BEST EPOCH: 0
BEST LOSS: 0.5889086127281189
The training is finished now we can load the model.
best_model = AutoModelForSequenceClassification.from_pretrained(best_model_path)
6) Evaluation
Finally, with the loaded model we can now predict our results for the unseen test set to understand the models performance in more detail.
y_pred = []
with torch.no_grad():
for batch_idx, batch in enumerate(test_dataloader):
input_ids, attention_mask, y = batch[0].to(device), batch[1].to(device), batch[2].to(device)
y_pred.append(model(input_ids, attention_mask, labels=y).logits)
y_pred = torch.cat(y_pred, dim=0)
y_pred = decision_function(y_pred)
from sklearn.metrics import classification_report
print(classification_report(test_y, y_pred, target_names=label2id.keys(), zero_division=True))
precision recall f1-score support
is_incorrect 0.67 1.00 0.80 128
is_correct 1.00 0.00 0.00 64
accuracy 0.67 192
macro avg 0.83 0.50 0.40 192
weighted avg 0.78 0.67 0.53 192
Results
The classification report shows us four metric results these are the precision, the recall, the f1-score, and the accuracy. Additionally, the report displays two different average aggregations, these are the macro avg, and the weighted average.
The precision tells us “When we predict a label, is it the correct label?”.
The recall tells us “How many instances of a class do we find?”.
The f1-score is the harmonic mean of the precision and the recall.
The accuracy tells us “How many of our predictions are correct?”.
The macro avg aggregates the f1-score per class it tells us “How well do we classify, if all classes occur equally often.”.
The weighted avg aggregates the f1-score weighted by class size it tells us “How well do we classify the complete label set.”.
Analysis
We can see that we have two classes ‘is_correct’ and ‘is_incorrect’. We have one instance with from the class ‘is_correct’ and two from the class ‘is_incorrect’. Our model does not learn to predict the class ‘is_correct’. We find all instances of the class ‘is_incorrect’. You can see this by the fact that the recall is 1.0. The precision is at 0.67, as one instance that we predict to be ‘is_incorrect’ is actually an instance of the class ‘is_correct’.
Contact Details
For questions or feedback, contact Stephan Linzbach via Stephan.Linzbach@gesis.org.