Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM
Machine Learning Mastery

Building an End-to-End Sentiment Analysis Pipeline with Scikit-LLM

Introduction

Traditional machine learning pipelines for predictive tasks like text classification usually rely on extracting structured, numerical features from raw text - for instance, TF-IDF frequencies or token embeddings - to feed into classical models such as logistic regression, ensembles, or support vector machines.

With the rise of large language models (LLMs), the rules of the game have somewhat changed: it is now possible to leverage zero-shot or few-shot reasoning on existing, pre-trained models for language tasks as part of a machine learning framework.

Scikit-LLM is a Python library that addresses this: it bridges the gap between classical machine learning and modern LLM API calls. In this article, we will use Scikit-LLM alongside Groq backend models to build an end-to-end pipeline for sentiment analysis (a domain-specific form of text classification), achieving reasonably fast inference results with open-source models. From preprocessing to inference, we will use a large, realistically-sized dataset - the IMDB movie reviews dataset.

Prerequisites, Setup, and Obtaining the Dataset

To make the code shown in this tutorial work, you'll need to have installed the Scikit-LLM library:

pip install scikit-llm

Once installed, the first step is to set it up and configure API credentials. In other words, we will need to "connect" Scikit-LLM to an endpoint - namely an LLM API repository like Groq. Make sure you register on Groq and generate an API key here: you'll need to copy and paste it in the code below:

from skllm.config import SKLLMConfig

# 1. Pointing to a Groq's compatible endpoint
SKLLMConfig.set_gpt_url("https://api.groq.com/openai/v1")

# 2. Set your free Groq API key
# Get yours at https://console.groq.com/keys
SKLLMConfig.set_openai_key("YOUR-API-KEY-GOES-HERE")

Scikit-LLM uses an endpoint function, set_gpt_url, that is compatible with OpenAI by default; we have routed it to make internal requests to a custom Groq URL: https://api.groq.com/openai/v1.

The next stage of the process is importing the IMDB Movie Reviews dataset - which has about 50K instances - and preparing it for the sentiment analysis pipeline we will build. Instances consist of a text review labeled with a sentiment, which can be positive or negative (this is a binary classification problem, solvable with models like logistic regression, for instance). For convenience, we read the dataset from a publicly available GitHub repository version in CSV format:

import pandas as pd
from sklearn.model_selection import train_test_split

# Fetching a large, realistic-sized dataset (IMDB Movie Reviews - 50,000 rows)
# We will read the data from a public raw CSV for convenience
url = "https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv"
print("Downloading dataset...")
df = pd.read_csv(url)
print(f"Total dataset size: {df.shape[0]} rows")

# In a realistic LLM pipeline using a free-tier API, sending 50,000 requests
# will likely trigger quota limits. Thus, we will use 500 rows for demonstrating our pipeline execution.
# Feel free to use more data if you have paid API access.
df_sampled = df.sample(n=500, random_state=42)

# The IMDB dataset contains HTML tags and formatting noise: that's perfect for testing our cleaner
X = df_sampled["review"]
y = df_sampled["sentiment"]

# Labels are 'positive' or 'negative'
# Splitting into training (for initializing zero-shot labels) and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Note that we fetched 500 rows only for demonstration purposes, as otherwise inference may take long without sufficient computing resources. You can freely change this sample size, n=500, to adapt it to your own needs.

Building the Sentiment Analysis Pipeline

Here comes the most interesting part of the process! A data science pipeline boils down to a series of preprocessing, cleaning, and data preparation steps followed by model setup or training, inference, and evaluation.

For a predictive, text-based scenario like ours, preprocessing typically entails cleaning and normalizing the text. Scikit-learn provides an elegant class, FunctionTransformer, to define and encapsulate preprocessing steps based on a custom function:

from sklearn.preprocessing import FunctionTransformer

def clean_text_data(texts):
    """Cleans raw text inputs by removing HTML tags and stripping whitespace."""
    series = pd.Series(texts).astype(str)
    # Remove HTML tags like
    cleaned = series.str.replace(r'<[^>]+>', ' ', regex=True)
    # Remove extra spaces
    cleaned = cleaned.str.strip().str.replace(r'\s+', ' ', regex=True)
    return cleaned.tolist()

# Wrapping the cleaning function to enable its use inside a Pipeline object
text_cleaner = FunctionTransformer(clean_text_data)

Now we put together this preprocessing object with a model instance to create the Pipeline. Once defined, this pipeline orchestrates the whole process of preparing the data and passing it to the model at both training and inference stages - even though we use the term "training", no actual weight-based training will occur, as we are utilizing a pre-trained model from Groq for zero-shot classification. Fitting the model only involves passing it the classification labels to use.

from sklearn.pipeline import Pipeline
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier

# Define the end-to-end pipeline
sentiment_pipeline = Pipeline([
    ("cleaner", text_cleaner),
    # Updated to use Groq's active Llama 3.1 8B model
    ("llm_classifier", ZeroShotGPTClassifier(model="custom_url::llama-3.1-8b-instant"))
])

# Fit the pipeline
# Note: For Zero-Shot classification, fit() doesn't train the LLM.
# It simply registers the unique labels present in 'y_train' (positive, negative).
print("Fitting the pipeline...")
sentiment_pipeline.fit(X_train, y_train)

Once we have run the pipeline to "fit" the model, we use it once more for inference. Both steps use familiar scikit-learn syntax. Besides evaluating the model pipeline's performance, we also display a few example predictions:

from sklearn.metrics import classification_report

print(f"Running predictions on {len(X_test)} test samples...")

# Run predictions through the pipeline
predictions = sentiment_pipeline.predict(X_test)

# Evaluate the pipeline's performance on the realistic data
print("\n--- Classification Report ---")
print(classification_report(y_test, predictions))

# Display a few side-by-side examples
print("\n--- Sample Predictions ---")
for review, actual, predicted in zip(X_test[:3], y_test[:3], predictions[:3]):
    # Truncate review for display purposes
    short_review = review[:100] + "..."
    print(f"Review: {short_review}")
    print(f"Actual: {actual} | Predicted: {predicted}\n")

Here's the detailed output - execution of the above code may take a few minutes to complete:

--- Classification Report ---
              precision    recall  f1-score   support

    negative       0.95      0.97      0.96        60
    positive       0.95      0.93      0.94        40

    accuracy                           0.95       100
   macro avg       0.95      0.95      0.95       100
weighted avg       0.95      0.95      0.95       100

--- Sample Predictions ---
Review: I saw mommy...well, she wasn't exactly kissing Santa Clause; he has his hand on her thigh and wicked...
Actual: negative | Predicted: negative

Review: This entry is certainly interesting for series fans (like myself), but yet it is mostly incomprehens...
Actual: negative | Predicted: negative

Review: Ingrid Bergman (Cleo Dulaine) has never been so beautiful. Gary Cooper as "Cleent" so perfectly cast...
Actual: positive | Predicted: positive

Our pipeline is doing a solid job at classifying sentiment in reviews. Well done!

Wrapping Up

This article walked you through defining an end-to-end pipeline for sentiment classification using Scikit-LLM and freely available, pre-trained LLMs from API endpoints like Groq. This is a versatile approach to using classic scikit-learn syntax in novel, LLM-driven machine learning applications.

Comments

No comments yet. Start the discussion.