Montevive

/ Blog /

News

Montevive

ALIA's Tokenizer: Why Spanish, Catalan, Basque, and Galician Cost Almost Half the Tokens of Llama 3

·erobles·Artificial Intelligence
ALIA's Tokenizer: Why Spanish, Catalan, Basque, and Galician Cost Almost Half the Tokens of Llama 3

What you don't see when using an LLM: it doesn't read your text, it reads tokens

When you ask ChatGPT, Claude, or your local LLM something, the model never sees the sentence you wrote. The first thing that happens is that a component called the tokenizer splits your text into pieces — tokens — and the model only receives numbers. All inference (and all cost) operates on those tokens, not on words.

The consequence, which almost nobody discusses publicly: two models can consume vastly different amounts of tokens to process exactly the same text. And models trained with English in mind are much less efficient with other languages.

A few weeks ago we published ALIA quantized to NVFP4 so anyone could run it on an NVIDIA DGX Spark. While researching that, we discovered something that deserves its own article: ALIA's tokenizer is radically different from Llama 3, Mistral, or GPT, and for Iberian languages it's between 1.7 and 2 times more efficient.

Since a paragraph won't convince anyone, we built a tool so you can see for yourself:

🌐 Try it live: labs.montevive.ai/alia-tokenizer-comparison/ — live comparison between ALIA, Llama 3, and Mistral tokenizers. Paste any text and watch in real time how each model chunks it. Runs 100% in your browser: no text leaves your device.

🎥 Video demo (3 min): — a visual tour of the demo with examples in Spanish, Catalan, Basque, and Galician.

The numbers, in a table

We took an administrative paragraph in each of the four Iberian languages and ran it through all three tokenizers. The results are striking:

LanguageALIALlama 3MistralALIA vs Llama 3
Spanish (legal-administrative text)31 tokens53671.71× more efficient
Catalan (institutional text)34 tokens62781.82× more efficient
Basque (public services)42 tokens811021.93× more efficient
Galician (local administration)38 tokens65801.71× more efficient
English (equivalent text)47 tokens41500.87× (Llama 3 wins)

The difference is most noticeable in administrative and territory-specific vocabulary — exactly the use case that matters most to a Spanish public administration:

  • Generalitat1 token in ALIA, 3 tokens in Llama 3 (Gen, eral, itat)
  • ayuntamiento1 token in ALIA, 3 tokens in Llama 3 (ay, untami, ento)
  • Cataluña1 token in ALIA, 3 tokens in Llama 3
  • Euskadi1 token in ALIA, 4 tokens in Llama 3
  • Xunta1 token in ALIA, 2 tokens in Llama 3
  • concejalía1 token in ALIA, 4 tokens in Llama 3

ALIA recognizes these pieces as atomic units. Llama 3 breaks them into meaningless fragments.

Why does this matter? Four concrete reasons

1. Cost

The price of any LLM API is measured in tokens, not words. If your RAG processes 10 million words per month in Spanish, with an Anglo-Saxon tokenizer you pay 70-90% more than you would with a tokenizer like ALIA's. The difference compounds: prompt + retrieved context + response, everything counts.

For a RAG system over the BOE (Spanish Official Gazette), municipal files, or healthcare documentation — where a single prompt can carry several thousand words of context — the annual bill changes by an order of magnitude.

2. Speed

An LLM's generation speed is measured in tokens per second, not words per second. If your model generates at 50 tok/s, and your tokenizer needs 1.7× more tokens for the same paragraph, your user perceives 1.7× less speed. For a conversational assistant in Spanish, that factor is the difference between "instant" and "noticeable latency."

3. Context window

All LLMs have a context limit measured in tokens. With an efficient tokenizer, the same window fits almost twice as much Spanish text:

  • Llama 3 with 8,192 tokens of context ≈ ~6,000 words of Spanish
  • ALIA with 8,192 tokens of context ≈ ~10,000 words of Spanish

For RAG over long documents (court rulings, administrative files, medical records) that's the difference between "I have to split the document into five pieces" and "it fits whole."

4. Quality

This is the most subtle but perhaps most important consequence. When a model chunks ayuntamiento into ay + untami + ento, its attention mechanism has to reconstruct the meaning from individually meaningless fragments. Each token "sees" the rest of the sentence worse, and the model spends capacity re-assembling words before it can even begin reasoning about them.

When ALIA sees ayuntamiento as a single token, that token already carries the complete semantics of the word from the start. The quality of Spanish responses improves — not because the model is better in abstract, but because the input is cleaner.

Why ALIA is like this: a custom-built vocabulary from scratch

Most current open-source models inherit their tokenizer from Llama (128,000 tokens, tiktoken-style BPE) or from Mistral (32,000 tokens, SentencePiece). Those vocabularies were trained on English-dominated corpora. Words in Spanish, Catalan, Basque, or Galician weren't sufficiently represented to achieve efficient encoding, so they appear fragmented.

ALIA does something different: the Barcelona Supercomputing Center team trained a SentencePiece tokenizer from scratch on a multilingual Iberian corpus, with a vocabulary of 256,000 tokens — double that of Llama 3 and eight times that of Mistral. That extra size is spent on pieces useful for Iberian languages: institution names, legal-administrative vocabulary, Catalan verbal morphology, Basque suffixes, Galician roots.

The result is what the demo shows: each "natural" word from Iberian administration is encoded as a single piece, not as a puzzle of fragments.

What this means for your project

If you work with text in Spanish, Catalan, Basque, or Galician — and especially if you work with Spanish administrative, legal, or sector-specific text — using an LLM with an Anglo-Saxon tokenizer is a silent efficiency loss that shows up in billing, latency, and response quality.

ALIA in NVFP4 gives you both things at once:

  1. The Iberian tokenizer: ~1.7× fewer tokens for the same text in Iberian languages
  2. A 40B parameter model trained on data representative of Iberian culture, now executable and adaptable on a €4,000 NVIDIA DGX Spark thanks to NVFP4 quantization

If you want to understand how to deploy and adapt ALIA to your organization's domain, we cover it in this other article.

Try it right now

The demo is at labs.montevive.ai/alia-tokenizer-comparison/. Paste a paragraph from your local official gazette, a Constitutional Court ruling, a company circular. You'll see all three tokenizers working in parallel, with token counts, efficiency ratios, and colored chips for each piece.

And it all runs 100% in your browser, with transformers.js: no text you paste leaves your device, not even for tokenization. Consistent with how we build things at Montevive.

More demos in the lab

This is the second demo at labs.montevive.ai. The first, also local-first and private, detects personally identifiable information (PII) in the text you paste, without servers and without sending data anywhere:

Same principle in both cases: what can be done locally without losing quality, should be done locally.

About ALIA: collaboration between BSC and Spanish research centers

ALIA is coordinated by the Barcelona Supercomputing Center (BSC-CNS), under the leadership of the Secretary of State for Digitalization and Artificial Intelligence (SEDIA) and driven by the Government of Spain.

The project builds upon ILENIA (Impulse of Languages in Artificial Intelligence), a consortium that integrates research centers specialized in language technologies for each co-official language:

Participating centers

Funding: Recovery, Transformation and Resilience Plan (NextGeneration EU), EuroHPC Joint Undertaking (European supercomputing consortium) and regional governments.

ALIA represents the convergence of these previous efforts into a unique, multilingual and sovereign infrastructure, trained on MareNostrum 5 (BSC, Barcelona).

Want to adapt ALIA to your organization?

At Montevive we help public administrations, regulated companies, and cooperatives deploy generative AI within their own infrastructure, keeping data in-house:

  • Domain fine-tuning on ALIA, adapted to your terminology (legal, healthcare, sector-specific)
  • Deployment on NVIDIA DGX Spark, Blackwell servers, or private cloud
  • Integration with your existing systems: APIs, RAG, specialized agents

📧 Contact: info@montevive.ai
🌐 More information: montevive.ai

ALIA belongs to everyone. And its tokenizer, moreover, is for us.