Skip to content

Transformers.js: language models and ML in the browser

Published:

Introduction to Transformers.js by Hugging Face and LLM demo in the browser with CPU (WASM) or GPU (WebGPU) support depending on browser capabilities

Transformers.js is a JavaScript library by Hugging Face that lets you run machine learning models directly in the browser, with no server required. It is the JavaScript equivalent of the well-known Python transformers library by Hugging Face, with a similar API for web developers.

Why run models in the browser

Running inference on the client has clear advantages:

In short: zero dollars —everything runs on your machine— and completely private: no data is sent anywhere.

Transformers.js uses ONNX Runtime to execute models. Under the hood it can use:

The library automatically picks the best available backend (e.g. WebGPU if available, otherwise WASM on CPU).

What you can do with Transformers.js

It supports many tasks and modalities:

The high-level API is based on the concept of a pipeline: you choose the task and optionally the model, and the library takes care of downloading and running it.

import { pipeline } from '@huggingface/transformers';

const pipe = await pipeline('sentiment-analysis');
const result = await pipe('I love this article!');

Models are downloaded from the Hugging Face Hub and quantized versions (e.g. q4, q8) can be used to reduce size and memory requirements in constrained environments.

Example: LLM in the browser (CPU or GPU)

The following chat runs an LLM (SmolLM2) entirely in your browser using Transformers.js. The model is downloaded once and cached. Inference runs in a Web Worker so the UI stays responsive, responses arrive via streaming token by token, and markdown is rendered in real time.

If your browser supports WebGPU, inference will be accelerated using the GPU and the Llama 3.2 1B model will be unlocked; otherwise WASM (CPU) will be used.

Select a model and download it to the browser to start chatting.

Download: 182 MB · RAM recommended: ~512 MB

Conclusions

Although WebGPU is not yet available in all browsers, WASM provides a universal fallback that can run smaller models on CPU with acceptable results. As you can verify yourself in the chat above, WebGPU acceleration is already sufficient to hold fluid conversations with models up to 1B parameters.

The main drawback is model size: even quantized, they range from ~180 MB to over 1 GB. However, once downloaded they are stored in the browser cache, so subsequent visits load the model almost instantly without re-downloading.

This technology opens a promising path to integrate artificial intelligence into websites with no infrastructure cost, without sending data to third parties and without relying on paid APIs. Everything happens on the user’s device: zero dollars and total privacy.

The current version (v3) still has limitations with certain quantization formats on WebGPU. However, Transformers.js v4 is already available as a preview and brings a WebGPU runtime completely rewritten in C++ together with the ONNX Runtime team. It promises support for models over 8B parameters, up to 4x faster embedding models, direct GPU weight loading without going through WASM, and full offline operation. Once it stabilizes, the landscape of AI in the browser will take a significant leap forward.

References