Machine Learning Mastery 3h ago

Multimodal Browser AI with Transformers.js for Images and Speech

Most browser AI tutorials cover text because it is a natural starting point, but the applications people actually want to build are rarely text-only.

In this article, you will learn how to build multimodal AI capabilities - image classification, image captioning, and speech transcription - that run entirely in the browser using Transformers.js, with no server, no API key, and no data leaving the user's device.

Topics we will cover include:

How to set up and run image classification and image captioning pipelines using Vision Transformer models in the browser.
How to implement browser-based speech transcription using OpenAI's Whisper architecture via the Web Audio API.
How to combine all three pipelines into a single multimodal media analyzer that loads models in parallel and presents results in a unified dashboard.

Introduction

Most browser AI tutorials cover text because it is a natural starting point, but the applications people actually want to build are rarely text-only. Users take photos, record voice notes, upload screenshots. The data is multimodal and the AI should be too.

Transformers.js handles this natively. It supports computer vision (image classification, object detection, segmentation), audio (automatic speech recognition, audio classification, text-to-speech), and multimodal tasks, all running locally in the browser, with no server, no API key, and no data leaving the user's device.

This tutorial builds three capabilities in sequence: image classification, image captioning, and speech transcription. Each is a self-contained HTML file you can open in a browser. The final section combines all three into a single multimodal media analyzer.

What You Need

A modern browser: Chrome 109+, Edge 109+, or Firefox 90+. These versions support ES modules and WebAssembly, both of which Transformers.js requires.
A local web server: Browser security policies block ES module imports from file:// URLs - opening the HTML files directly by double-clicking will not work. You need to serve them over HTTP. You do not need Node.js, npm, or any build tools. The CDN import handles the library.

Starting a Local Server

Pick whichever option matches what you already have installed:

# Python -- pre-installed on macOS and most Linux systems
python3 -m http.server 8080

# Node.js
npx serve .

# VS Code -- install the Live Server extension, then right-click any HTML file
# and choose "Open with Live Server"

Once the server is running, open http://localhost:8080 in your browser.

Project Structure

Create one folder for the project. Each task gets its own HTML file:

multimodal-demo/
├── image-classifier.html
├── image-captioner.html
├── speech-transcriber.html
└── media-analyzer.html

Models and Download Sizes

Every model downloads once on the first run and caches in the browser. Subsequent loads are instant and work offline. Here is what to expect on the first run:

Task	Model	Pipeline task	First-run download
Image Classification	`Xenova/vit-base-patch16-224`	`image-classification`	~88 MB
Image Captioning	`Xenova/vit-gpt2-image-captioning`	`image-to-text`	~246 MB
Speech Transcription	`Xenova/whisper-tiny.en`	`automatic-speech-recognition`	~78 MB

The combined app loads all three, roughly 400 MB total on first run. A progress indicator for each model is non-negotiable UX.

Task 1: Image Classification

Image classification assigns labels from a fixed set to an input image. The model used here is ViT-Base/16, a Vision Transformer trained by Google on ImageNet-21k and fine-tuned on ImageNet-1k, converted to ONNX format for browser use. It classifies images into 1,000 ImageNet categories and returns a ranked list with confidence scores.

What the output looks like:

// Output from classifier(imageUrl)
[
  { label: 'golden retriever', score: 0.9421 },
  { label: 'Labrador retriever', score: 0.0312 },
  { label: 'Sussex spaniel', score: 0.0098 },
  // ... top_k results total
]

Each object has a label string (the ImageNet class name) and a score float between 0 and 1. By default, the pipeline returns 5 results. Set top_k in the call to get more or fewer.

Full Working Demo

Save this file as image-classifier.html in your project folder. Copy the code below and open it on your localhost.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Image Classifier</title>
<style>
  * { box-sizing: border-box; margin: 0; padding: 0; }
  body { font-family: system-ui, sans-serif; max-width: 700px; margin: 2rem auto; padding: 0 1rem; background: #f8fafc; color: #1e293b; }
  h1 { margin-bottom: 0.25rem; font-size: 1.4rem; }
  .subtitle { color: #64748b; font-size: 0.9rem; margin-bottom: 1.5rem; }
  #status { font-size: 0.85rem; color: #64748b; margin-bottom: 1rem; }
  .upload-area { border: 2px dashed #cbd5e1; border-radius: 8px; padding: 2rem; text-align: center; cursor: pointer; background: white; transition: border-color 0.2s; }
  .upload-area:hover { border-color: #2563eb; }
  .upload-area input { display: none; }
  #preview { margin-top: 1rem; max-width: 100%; border-radius: 8px; display: none; }
  .result-row { display: flex; align-items: center; gap: 0.75rem; margin-top: 0.6rem; }
  .result-label { min-width: 200px; font-size: 0.9rem; }
  .bar-bg { flex: 1; background: #e2e8f0; border-radius: 4px; height: 16px; }
  .bar-fill { background: #2563eb; height: 100%; border-radius: 4px; transition: width 0.4s ease; }
  .result-score { min-width: 48px; text-align: right; font-size: 0.85rem; color: #475569; }
  #results { margin-top: 1.25rem; }
  #results h3 { font-size: 0.95rem; color: #374151; margin-bottom: 0.5rem; }
</style>
</head>
<body>
  <h1>Image Classifier</h1>
  <p class="subtitle">Upload any image - ViT classifies it into ImageNet categories. Runs entirely in your browser.</p>
  <div id="status">Downloading model (~88 MB on first run)...</div>
  <div id="drop-zone" class="upload-area">
    <p>Click to upload or drag an image here</p>
    <p style="font-size:0.8rem;color:#94a3b8;margin-top:0.5rem;">JPG, PNG, WebP, GIF supported</p>
    <input type="file" id="file-input" accept="image/*">
  </div>
  <img id="preview" alt="Preview">
  <div id="results"></div>
  <script type="module">
    import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.2';

    const statusEl = document.getElementById('status');
    const dropZone = document.getElementById('drop-zone');
    const fileInput = document.getElementById('file-input');
    const preview = document.getElementById('preview');
    const resultsEl = document.getElementById('results');

    // ── Load the image classification pipeline ──────────────────────────
    // Task: 'image-classification'
    // Model: Xenova/vit-base-patch16-224 -- ViT trained on ImageNet-1k
    // dtype: 'q8' -- 8-bit quantized for smaller download, good accuracy
    let classifier;
    pipeline('image-classification', 'Xenova/vit-base-patch16-224', {
      dtype: 'q8',
      progress_callback: (p) => {
        if (p.status === 'progress') {
          statusEl.textContent = `Downloading model: ${Math.round(p.progress ?? 0)}%`;
        }
      }
    }).then(pipe => {
      classifier = pipe;
      statusEl.textContent = 'Model ready. Upload an image to classify it.';
      dropZone.style.borderColor = '#22c55e';
    }).catch(err => {
      statusEl.textContent = `Error loading model: ${err.message}`;
    });

    // ── Classify an image from a data URL ───────────────────────────────
    async function classifyImage(dataUrl) {
      statusEl.textContent = 'Classifying...';
      resultsEl.innerHTML = '';
      try {
        // Pass the data URL directly -- the pipeline handles image decoding
        // top_k: 5 returns the 5 highest-scoring ImageNet labels
        const results = await classifier(dataUrl, { top_k: 5 });
        statusEl.textContent = 'Done.';

        // Build a bar chart of results
        let html = '<h3>Top predictions</h3>';
        results.forEach(({ label, score }) => {
          const pct = (score * 100).toFixed(1);
          const bar = (score * 100).toFixed(0);
          html += `
            <div class="result-row">
              <span class="result-label">${label}</span>
              <div class="bar-bg"><div class="bar-fill" style="width:${bar}%"></div></div>
              <span class="result-score">${pct}%</span>
            </div>`;
        });
        resultsEl.innerHTML = html;
      } catch (err) {
        statusEl.textContent = `Classification error: ${err.message}`;
      }
    }

    // ── File handling ────────────────────────────────────────────────────
    function handleFile(file) {
      if (!file || !file.type.startsWith('image/')) return;
      const reader = new FileReader();
      reader.onload = (e) => {
        const dataUrl = e.target.result;
        // Show the image preview
        preview.src = dataUrl;
        preview.style.display = 'block';
        // Classify only if the model has finished loading
        if (classifier) {
          classifyImage(dataUrl);
        } else {
          statusEl.textContent = 'Model still loading -- please wait a moment and try again.';
        }
      };
      // Read as a base64 data URL -- works as pipeline input
      reader.readAsDataURL(file);
    }

    // Click to browse
    dropZone.addEventListener('click', () => fileInput.click());
    fileInput.addEventListener('change', (e) => handleFile(e.target.files[0]));

    // Drag and drop
    dropZone.addEventListener('dragover', (e) => {
      e.preventDefault();
      dropZone.style.borderColor = '#2563eb';
    });
    dropZone.addEventListener('dragleave', () => {
      dropZone.style.borderColor = '#cbd5e1';
    });
    dropZone.addEventListener('drop', (e) => {
      e.preventDefault();
      dropZone.style.borderColor = '#cbd5e1';
      handleFile(e.dataTransfer.files[0]);
    });
  </script>
</body>
</html>

What this code does:

The pipeline() call starts downloading the model immediately when the page opens.
The progress callback updates the status text so the user can see the download progressing.
Once classifier is assigned, the drop zone border turns green as a visual cue.
When a file is dropped or selected, FileReader converts it to a base64 data URL, which the pipeline accepts directly as image input - no manual preprocessing needed.
The classifier returns an array of { label, score } objects, which the rendering loop converts into a horizontal bar chart. The top_k: 5 option limits results to the five most likely classes.

Task 2: Image Captioning

Image captioning generates a natural language sentence describing what is in an image. It is meaningfully different from classification: instead of picking from 1,000 fixed labels, the model generates free-form text. "A golden retriever running through a field of tall grass" versus just "golden retriever." More descriptive, more flexible, larger model.

The model used here is Xenova/vit-gpt2-image-captioning, a Vision Transformer encoder that reads the image paired with a GPT-2 decoder that generates the caption. The ONNX version weighs in at 246 MB, noticeably larger than the classifier, because the generative decoder is a full language model.

What the output looks like:

// Output from captioner(imageUrl)
[{ generated_text: 'a dog is playing on a tennis court' }]

The output is an array with one object containing a generated_text string. It is always an array even for a single image, because the pipeline supports batching.

Full Working Demo

Save this file as image-captioner.html. Run it on http://localhost.

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Image Captioner</title>
<style>
  * { box-sizing: border-box; margin: 0; padding: 0; }
  body { font-family: system-ui, sans-serif; max-width: 700px; margin: 2rem auto; padding: 0 1rem; background: #f8fafc; color: #1e293b; }
  h1 { margin-bottom: 0.25rem; font-size: 1.4rem; }
  .subtitle { color: #64748b; font-size: 0.9rem; margin-bottom: 1.5rem; }
  #status { font-size: 0.85rem; color: #64748b; margin-bottom: 1rem; }
  .upload-area { border: 2px dashed #cbd5e1; border-radius: 8px; padding: 2rem; text-align: center; cursor: pointer; background: white; }
  .upload-area input { display: none; }
  #preview { margin-top: 1rem; max-width: 100%; border-radius: 8px; display: none; }
  .comparison { display: grid; grid-template-columns: 1fr 1fr; gap: 1rem; margin-top: 1.25rem; }
  .result-card { background: white; border: 1px solid #e2e8f0; border-radius: 8px; padding: 1rem; }
  .result-card h3 { font-size: 0.8rem; text-transform: uppercase; letter-spacing: 0.05em; color: #64748b; margin-bottom: 0.5rem; }
  .caption-text { font-size: 1rem; color: #1e293b; line-height: 1.5; font-style: italic; }
  .label-list { list-style: none; }
  .label-list li { font-size: 0.9rem; padding: 0.2rem 0; border-bottom: 1px solid #f1f5f9; display: flex; justify-content: space-between; }
  @media (max-width: 500px) { .comparison { grid-template-columns: 1fr; } }
</style>
</head>
<body>
  <h1>Image Captioner</h1>
  <p class="subtitle">Generates a natural language description of any image. Runs classification and captioning in parallel for comparison.</p>
  <div id="status">Downloading models (~334 MB on first run)...</div>
  <div id="drop-zone" class="upload-area">
    <p>Click to upload or drag an image here</p>
    <input type="file" id="file-input" accept="image/*">
  </div>
  <img id="preview" alt="Preview">
  <div class="comparison" id="comparison">
    <div class="result-card">
      <h3>Classification (fixed labels)</h3>
      <ul id="label-list" class="label-list"></ul>
    </div>
    <div class="result-card">
      <h3>Caption (generated text)</h3>
      <p id="caption-text" class="caption-text">--</p>
    </div>
  </div>
  <script type="module">
    import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.2';

    const statusEl = document.getElementById('status');
    const dropZone = document.getElementById('drop-zone');
    const fileInput = document.getElementById('file-input');
    const preview = document.getElementById('preview');
    const comparisonEl = document.getElementById('comparison');
    const labelListEl = document.getElementById('label-list');
    const captionEl = document.getElementById('caption-text');

    // ── Load both pipelines in parallel ─────────────────────────────────
    // Using Promise.all starts both downloads simultaneously,
    // which is faster than loading them one after the other.
    let classifier, captioner;
    Promise.all([
      pipeline('image-classification', 'Xenova/vit-base-patch16-224', {
        dtype: 'q8',
        progress_callback: (p) => {
          if (p.status === 'progress') {
            statusEl.textContent = `Downloading classifier: ${Math.round(p.progress ?? 0)}%`;
          }
        }
      }),
      pipeline('image-to-text', 'Xenova/vit-gpt2-image-captioning', {
        dtype: 'q8',
        progress_callback: (p) => {
          if (p.status === 'progress') {
            statusEl.textContent = `Downloading captioner: ${Math.round(p.progress ?? 0)}%`;
          }
        }
      })
    ]).then(([classifierPipe, captionerPipe]) => {
      classifier = classifierPipe;
      captioner = captionerPipe;
      statusEl.textContent = 'Both models ready. Upload an image.';
      dropZone.style.borderColor = '#22c55e';
    }).catch(err => {
      statusEl.textContent = `Error loading models: ${err.message}`;
    });

    // ── Analyze an image ────────────────────────────────────────────────
    async function analyzeImage(dataUrl) {
      statusEl.textContent = 'Analyzing...';
      labelListEl.innerHTML = '';
      captionEl.textContent = 'Generating...';
      try {
        const [classResults, captionResults] = await Promise.all([
          classifier(dataUrl, { top_k: 3 }),
          captioner(dataUrl)
        ]);
        statusEl.textContent = 'Done.';
        // Show classification labels
        let labelsHtml = '';
        classResults.forEach(({ label, score }) => {
          labelsHtml += `<li><span>${label}</span><span>${(score * 100).toFixed(1)}%</span></li>`;
        });
        labelListEl.innerHTML = labelsHtml;
        // Show caption
        captionEl.textContent = captionResults[0].generated_text;
      } catch (err) {
        statusEl.textContent = `Analysis error: ${err.message}`;
      }
    }

    // ── File handling ────────────────────────────────────────────────────
    function handleFile(file) {
      if (!file || !file.type.startsWith('image/')) return;
      const reader = new FileReader();
      reader.onload = (e) => {
        const dataUrl = e.target.result;
        preview.src = dataUrl;
        preview.style.display = 'block';
        if (classifier && captioner) {
          analyzeImage(dataUrl);
        } else {
          statusEl.textContent = 'Models still loading -- please wait.';
        }
      };
      reader.readAsDataURL(file);
    }

    dropZone.addEventListener('click', () => fileInput.click());
    fileInput.addEventListener('change', (e) => handleFile(e.target.files[0]));
    dropZone.addEventListener('dragover', (e) => e.preventDefault());
    dropZone.addEventListener('drop', (e) => {
      e.preventDefault();
      handleFile(e.dataTransfer.files[0]);
    });
  </script>
</body>
</html>

Read on Machine Learning Mastery ↗ ← Back to News

Multimodal Browser AI with Transformers.js for Images and Speech