What this project is

This project is a tech demo which showcases a semantic search engine that runs entirely in your browser. Semantic search is an AI-powered modern search technique that understands the contextual meaning and intent behind your search query, rather than merely matching exact keywords. AI models use natural language processing (NLP) to analyze grammar, intent, and relationships between words. Text is converted into vectors using an embedding model, where an algorithm such as cosine similarity finds nearest neighbors in vector space to surface the best results.

For example when you search a query such as “why is my model great on training data but bad on new data?” this can surface a document about overfitting (the term for this phenomenon) even though this search query never used the word “overfitting.”

How this project works

For starters, I used Claude to generate a “corpus” (a.k.a. a knowledge base - a collection of documents) of 42 useful and interesting machine learning concepts. This is the content library of documents that you can search through in this demo.

When you load the demo for the first time you must click “Load model & enable search” which then downloads an embedding model directly from HuggingFace’s CDN. Specifically this uses an embedding model called all-MiniLM-L6-v2 (~25 MB). After the first download, this model is then cached locally so you don’t have to keep re-downloading it. For the purposes of this demo every time you refresh the page, this model runs locally and re-builds the embeddings (although this could easily be cached also). When embeddings are built, the corpus is converted into vectors which are held in memory (until the next page reload).

When text is fed through this embedding model, the output is specifically a 384-dimensional vector. This demo is searching through a small amount of text, which is why I use this popular (and small) embedding model. Larger models produce higher-dimensional embeddings (512, 768, 1024, 1536, 3072), but those are overkill for a corpus this small.

How search works

When you input a search query, this text is converted into a vector too (using the same model, a 384-dimensional vector is generated). This is then compared against all of the document library vectors using cosine similarity (basically: which vectors point in the most similar direction). The closest ones rank highest (that’s the colored match score you see on each result).

What actually runs the model in your browser?

An embedding model is essentially a large chunk of complex matrix math which in this case runs on-device (in the browser). Under the hood, transformers.js (an npm package from HuggingFace, specifically @huggingface/transformers) runs this embedding model locally using the ONNX Runtime Web which then executes it using WebAssembly (WASM) — a low-level instruction format that lets precompiled code run at near-native speed inside the browser’s sandbox (the model runs as compiled math on your CPU). Newer browsers also expose an optional WebGPU backend - the modern successor to WebGL that lets code tap the GPU for general-purpose computation. WebGPU can dramatically accelerate larger models, but is overkill for a model this small (WASM on the CPU is more than fast enough). To keep all of this from freezing the page, the entire model pipeline runs on a Web Worker (a separate background thread), so the interface stays responsive while it downloads, builds embeddings, and answers your queries.

For this tech demo I am using two CDNs — the HuggingFace CDN (mentioned earlier in this post) along with the jsDelivr CDN (which serves the ONNX Runtime Web + WASM binaries). Realistically if you are serving semantic search to a production application you would either want all of this running on your own backend server, or you could self-host the embedding model and the runtimes from your own CDN (so you control and own your own distribution), but for this demo using these publicly available CDNs is fine. The benefit of this architecture is that my website itself remains lightweight and portable (only shipping the tiny transformers.js library code which orchestrates everything); the embedding model and the runtime are downloaded and executed, when needed, from the two CDNs.

How this relates to RAG

This project is a tech demo for semantic search - this is the retrieval half (the “R”) of RAG. Full RAG (Retrieval-Augmented Generation) would take those top results and feed them to an LLM to write a natural-language answer. This demo stops at retrieval: it finds and ranks the relevant documents but doesn’t generate prose. LLMs are hundreds of times larger than embedding models and they generate language, whereas embedding models just measure meaning behind text, which was perfect for this small self-contained project.

I plan on exploring other AI/ML projects including RAG and small/local LLMs in the near future, but for now I wanted to put together a quick demo of semantic search since it’s a fun and useful concept that runs efficiently on modern hardware.

Thanks for reading!