Building Semantic Search with Transformers.js and Sentence Embeddings
Machine Learning Mastery Grade 10 3d ago

Building Semantic Search with Transformers.js and Sentence Embeddings

You've probably shipped this bug before, where a user types " affordable laptop " into your search bar and gets zero results.

In this article, you will learn how sentence embeddings work and how to build a fully client-side semantic search engine using Transformers.js, with no server, no API key, and no backend infrastructure required. Topics we will cover include: - How sentence embeddings and cosine similarity form the foundation of semantic search. - How to generate and cache embeddings using the Transformers.js feature-extraction pipeline, including batching and Web Worker offloading. - How to build a complete, reusable SemanticSearch class and persist its index across page loads. Introduction You’ve probably shipped this bug before, where a user types “affordable laptop” into your search bar and gets zero results. But you know the database has dozens of laptop articles. They’re just all titled “budget notebook.” The words are different. The meaning is identical. Keyword search treats both as unrelated strings. This isn’t an edge case. It’s the core limitation of keyword matching: it compares characters, not concepts. It doesn’t know that “cancel” and “return” describe related actions, that “broken” and “defective” mean the same thing, or that “I can’t log in” and “account access issue” are the same problem phrased two different ways. What Sentence Embeddings Actually Are Semantic search fixes this by comparing meaning. And with Transformers.js, you can build it entirely in the browser with no server, no API key, and no backend infrastructure. This tutorial walks through the full pipeline: how sentence embeddings work, how to generate them, how cosine similarity scores relevance, and how to wire it all into a working knowledge base search application. A transformer model cannot process raw text. Before any computation happens, a sentence needs to become numbers. Embeddings are the result of that conversion: a sentence represented as a list of floating-point values called a vector. The key property isn’t just that sentences become numbers. It’s that sentences with similar meaning become vectors that are geometrically close to each other in the same vector space. The model used throughout this tutorial, sentence-transformers/all-MiniLM-L6-v2, maps every sentence to a point in a 384-dimensional vector space. The model was fine-tuned on over 1 billion sentence pairs specifically to learn this geometric property. “I need to cancel my order” and “How do I return a product?” end up close together. “The weather is beautiful today” ends up far from both. The 384 dimensions aren’t human-readable. You can’t look at dimension 47 and say what it encodes. What matters for search is not any individual dimension but the distance between two vectors. Short distance means similar meaning. Large distance means unrelated. Pooling and Normalization The raw transformer model outputs one vector per token; every word and subword in a sentence gets its own vector. For semantic search, you need one vector per sentence. Mean pooling handles this by averaging all token vectors, weighted by the attention mask, so padding tokens don’t contribute. Normalization then scales the result to unit length (magnitude = 1), which simplifies the similarity calculation covered in the next section. In Transformers.js, both happen automatically when you pass { pooling: ‘mean’, normalize: true } to the pipeline call. Without these options, you get token-level embeddings, which are useful for tasks like named entity recognition, but not for sentence-level search. The Feature-Extraction Pipeline The feature-extraction task is different from every other Transformers.js pipeline. Tasks like text-classification or question-answering return human-readable outputs: labels, scores, strings. feature-extraction returns the raw vector representations that the model computed internally. You’re working one level lower, getting the numbers that all higher-level tasks are built on top of. | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | import { pipeline } from 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.2'; // Load the feature-extraction pipeline // Xenova/all-MiniLM-L6-v2 is the ONNX-converted version of // sentence-transformers/all-MiniLM-L6-v2 -- same model weights, browser-compatible format const extractor = await pipeline( 'feature-extraction', 'Xenova/all-MiniLM-L6-v2', { dtype: 'q8' } // 8-bit quantization: smaller download (~23 MB), good accuracy ); // Embed a single sentence // pooling: 'mean' -- averages all token vectors into one sentence vector // normalize: true -- scales the result to unit length (needed for cosine similarity) const output = await extractor('I need help with my order', { pooling: 'mean', normalize: true }); console.log(output); // Tensor { // dims: [1, 384], // 1 sentence, 384 dimensions // type: 'float32', // data: Float32Array(384) // the actual numbers // } // Convert to a plain JavaScript array for use in your own code const vector = output.tolist()[0]; // [0.045, 0.073, -0.012, ...] -- 384 numbers console.log(`Vector length: ${vector.length}`); // 384 | What this code does: - pipeline() downloads and initializes the model on first run (the browser caches it after that, so subsequent page loads are instant) - You then call the extractor with a string and the two options that give you a single, normalized sentence vector - The result is a Tensor object; calling .tolist()[0] converts it to a plain JavaScript array of 384 numbers you can work with directly Understanding the Output Tensor The Tensor object returned by feature-extraction has three fields worth knowing: - dims is the shape [n_sentences, 384]. Pass one sentence and dims[0] is 1. Pass ten sentences in a batch and dims[0] is 10. The second dimension is always 384 for this model - type is ‘float32‘, meaning each of the 384 values is a 32-bit floating-point number - data is a Float32Array containing all the numbers in row-major order. For a batch of 3 sentences, this is a flat array of 3 × 384 = 1,152 numbers .tolist() converts the tensor to a nested JavaScript array, one inner array per sentence. output.tolist()[0] gives the vector for the first sentence as a plain array of 384 numbers. Batching: Embed Multiple Sentences at Once Passing an array of strings to the extractor processes all of them in a single model call. This is significantly faster than calling the pipeline once per sentence, because the transformer processes all inputs in parallel within one forward pass. | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | // Embed multiple documents in one call -- always prefer this over looping const sentences = [ 'How do I track my shipment?', 'What is your return policy?', 'How can I reset my password?', 'Do you offer international delivery?' ]; const batchOutput = await extractor(sentences, { pooling: 'mean', normalize: true }); // batchOutput.dims = [4, 384] -- 4 sentences, each with 384 dimensions console.log(`Batch shape: [${batchOutput.dims}]`); // Convert to array of arrays -- one 384-element array per sentence const vectors = batchOutput.tolist(); console.log(`Number of vectors: ${vectors.length}`); // 4 console.log(`Each vector has: ${vectors[0].length} dimensions`); // 384 | What this code does: - Instead of four separate extractor() calls, one call handles all four sentences simultaneously - The transformer architecture is optimized for batched input, so the time it takes to embed 10 sentences together is much closer to embedding 1 sentence than to embedding 10 individually Batching is the most important performance decision in a semantic search system. When indexing a corpus of 50 documents, one batch call is far faster than 50 individual calls. The difference compounds as your corpus grows. Cosine Similarity: The Math Behind the Search Once you have vectors for your documents and a vector for the search query, you need a way to measure how similar any two vectors are. That’s what cosine similarity does. Cosine similarity measures the angle between two vectors. A score of 1.0 means the vectors point in the same direction (identical meaning). A score of 0 means they’re completely unrelated. Because we used normalize: true when generating embeddings, both vectors already have unit length (magnitude = 1), which simplifies the formula considerably: | 1 2 3 4 | cosine_similarity(A, B) = (A · B) / (|A| × |B|) Since normalize: true sets |A| = |B| = 1, this becomes: cosine_similarity(A, B) = A · B = Σ(A[i] × B[i]) | Just sum the element-wise products of the two vectors. That number is the cosine similarity. For sentence embeddings with mean pooling and normalization, practical scores fall roughly in these ranges: | Score Range | Interpretation | |---|---| | 0.90 to 1.00 | Near-identical meaning | | 0.70 to 0.90 | Strong semantic match | | 0.50 to 0.70 | Related topic, different angle | | 0.30 to 0.50 | Loose connection | | Below 0.30 | Likely unrelated | Here’s the implementation: | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | /** * Compute cosine similarity between two normalized vectors. * * This is just the dot product because normalize: true ensures * both vectors already have unit length, making the denominator 1. * * @param {number[]|Float32Array} vecA - First normalized embedding vector * @param {number[]|Float32Array} vecB - Second normalized embedding vector * @returns {number} Similarity score between -1 and 1 (typically 0 to 1 for sentences) */ function cosineSimilarity(vecA, vecB) { if (vecA.length !== vecB.length) { throw new Error(`Vector length mismatch: ${vecA.length} vs ${vecB.length}`); } let dotProduct = 0; for (let i = 0; i doc.text); // Single batch call embeds all documents at once -- much faster than looping const output = await this.extractor(texts, { pooling: 'mean', normalize: true }); // Convert the tensor to an array of 384-element arrays, one per document const vectors = output.tolist(); // Attach each vector to its original document object // The spread (...doc) preserves

Comments

No comments yet. Start the discussion.