Montevive

/ Blog /

News

Montevive

ALIA on NVIDIA DGX Spark: Adapt Spain's State LLM to Your Domain on a mini HPC

·erobles·Artificial Intelligence
ALIA on NVIDIA DGX Spark: Adapt Spain's State LLM to Your Domain on a mini HPC

What if you could adapt ALIA to your domain—not just deploy it—on a €4,000 box?

In January 2025 we wrote about ALIA, the multilingual language model from the Spanish Government developed by the Barcelona Supercomputing Center (BSC). It's a serious project: 40 billion parameters, native support for Spanish, Catalan, Basque, and Galician (translating to a tokenizer almost 2× more efficient than Anglo-Saxon models—you can check it live in our demo), and training on data representative of Iberian languages and culture.

But there's a chasm between "having an open-source model" and "being able to truly customize it for my use case." Adapting a 40B parameter model—its internal vocabulary, its tone, its knowledge about your specific domain—has historically required a server cluster that most organizations don't have.

Today we're publishing the results of a month's work: ALIA-40b-instruct-2601 quantized to NVFP4 (the native format of new NVIDIA Blackwell GPUs). The model now fits comfortably in an NVIDIA DGX Spark—the new AI development workstation the size of a Mac mini, with 128 GB of unified memory—with plenty of room for local fine-tuning with LoRA/QLoRA. As a bonus, it also serves inference at ~10 tokens per second from the same box.

The model and all code are available right now on Hugging Face.

🎥 Video demo (5 min): —we show the quantized model loading, responding in Spanish at about 10 tokens per second, and the free memory space for fine-tuning.

Why this matters

ALIA is the most ambitious technological piece of Spain's public AI strategy. BSC and the Government have invested significant resources so Spain has a sovereign LLM: trained on European data, with an open license, and with quality comparable to commercial models for peninsular languages.

But a generic model—no matter how good—is rarely what you need in production. What your administration or company needs is ALIA speaking your specific language: the legal terminology of regional legislation, the healthcare vocabulary of your hospital, the incident corpus of your support department, the technical jargon of your sector.

This requires fine-tuning: taking the base model and training it a few more epochs on data representative of your domain. And this is where most projects get stuck:

  • Full fine-tuning of a 40B in BF16: needs ~400 GB of GPU memory between weights, gradients, optimizer states, and activations. Hardware: ~€80,000 server with 8× H100.
  • Cloud fine-tuning: saves you the hardware but forces you to send your private data to a US provider, which contradicts the sovereignty and GDPR motivation.
  • Aggressive generic 3-4 bit quantizations (Q3/Q4): particularly degrade quality for minority languages—precisely where ALIA is most valuable.

The consequence: many Spanish public administrations, cooperatives, and SMEs know ALIA exists, but can't find a realistic path to customize it on their own infrastructure.

NVIDIA DGX Spark + NVFP4: The opportunity

NVIDIA launched the DGX Spark in early 2026 as its compact AI development workstation: a GB10 GPU (Blackwell architecture), 128 GB of unified memory, and a price around €4,000. It's not designed to serve thousands of concurrent users—servers like the B100/H100 exist for that. It's designed so an AI development team has, on their desktop, a box where they can:

  • Fine-tune models up to ~70B with efficient techniques (LoRA, QLoRA, SFT, DPO)
  • Prototype agents and pipelines without depending on paid APIs
  • Iterate on your own data without taking it out of the organization
  • Serve local inference to validate results or for pilot deployments

The technical key enabling all this is NVFP4, the 4-bit floating-point format NVIDIA introduced with Blackwell. Unlike previous quantizations, NVFP4 is designed at the hardware level: Blackwell GPUs have specific tensor cores that multiply NVFP4 matrices at speeds comparable to FP8, with one-third the memory size. And crucially, it maintains quality much closer to the original BF16 than generic Q3/Q4 quantizations.

ALIA in NVFP4 occupies ~27 GB, versus 76 GB for the original. That means on a DGX Spark you have over 90 GB free—enough to do QLoRA fine-tuning comfortably: LoRA adapters + Adam optimizer states + activations + reasonable context, all in the same box, without touching the cloud.

On paper everything checked out. In practice, getting here required finding and fixing a bug in llama.cpp that had been silently producing corrupt files for months for any Llama family model quantized to NVFP4. The technical details are at the end of the article for those interested in the kitchen; first, what this means for you.

What we've published today

Three open artifacts, with Apache 2.0 license (same as original ALIA):

1. GGUF for llama.cpp and Ollama: montevive/ALIA-40b-instruct-2601-NVFP4-GGUF—27 GB, ready for llama-cli or deployment with Ollama. Mainly useful for inference and quick validation.

2. Safetensors for vLLM, TensorRT-LLM, and fine-tuning: montevive/ALIA-40b-instruct-2601-NVFP4—the same model in compressed-tensors format, compatible with the HuggingFace ecosystem. This is the base for applying QLoRA or LoRA to customize ALIA to your domain (via peft, axolotl, unsloth, or trl).

3. The patch for llama.cpp: Pull Request #22611—open and awaiting review from the maintenance team. Once merged, any future NVFP4 GGUF of a Llama model will come out correct without needing to manually apply the patch.

4. Interactive ALIA tokenizer demo: labs.montevive.ai/alia-tokenizer-comparison/—live comparator between ALIA, Llama 3, and Mistral tokenizers, running 100% in your browser. Paste any text and instantly see why ALIA is ~1.7× more efficient with peninsular languages. We cover it in detail in this dedicated article.

Who can leverage this today

Spanish public administration

A DGX Spark + ALIA allows a municipality, provincial government, regional ministry, or state agency to:

  • Adapt ALIA to their internal corpus: legal jargon, historical records, specific regulations
  • Train specialized versions: one for citizen service in co-official languages, another for grant classification, another for meeting minutes summarization
  • Validate results locally before deploying to production
  • Keep all data within the organization throughout the development cycle

Companies with sovereignty or strict GDPR requirements

Regulated sectors (healthcare, banking, legal, defense) that need to adapt generative AI to their domain without sending data to the cloud. Hardware cost: €4,000 for the workstation, vs. €30,000-80,000 for a fine-tuning server with H100s, vs. €0 fixed but data continuously leaving toward OpenAI/Anthropic.

Cooperatives, SMEs, and startups

Small teams wanting a specialized assistant for their product, technical support, or contracts. The DGX Spark pays for itself in a few months if your workload is continuous and data is sensitive.

Researchers and developers

If you work with Iberian LLMs, ALIA-40B in NVFP4 is probably the best available starting point in May 2026 for experimenting with multilingual fine-tuning on reasonable hardware.

Why Montevive

We've spent years working on applied AI with a focus on privacy, sovereignty, and efficiency. We don't sell hype or massive cloud models: we build the missing tools so truly useful AI works on your own infrastructure.

Some recent pieces:

This ALIA quantization fits the same line: making locally accessible what the market assumes has to live in the cloud.

Want to adapt ALIA to your organization?

If you work in a public administration, company, or cooperative that wants to use ALIA seriously—adapted to your domain, integrated into your systems, running on your infrastructure—we can help you do it:

  • Domain fine-tuning: we adapt ALIA to your terminology (legal, healthcare, technical, sector-specific) using LoRA/QLoRA on DGX Spark or Blackwell servers
  • Deployment consulting: we evaluate your use case, size hardware, design fine-tuning + inference architecture
  • Installation and configuration: NVIDIA DGX Spark, Blackwell servers, or private cloud infrastructure
  • Integration with your systems: APIs, RAG over internal documents, specialized agents, continuous evaluation

📧 Contact us: info@montevive.ai

🌐 More information: montevive.ai

ALIA belongs to everyone. Making it accessible to you—adaptable, executable, sovereign—is what we're building.


🔧 For the technical folks: how we did it

If you're interested in the engineering side of the story—what bug we found, how we diagnosed it, what numbers prove the patch works, and what inference performance our DGX Spark measures—keep reading. If not, thanks for making it this far.

The bug: when quantization produces random words

We applied the standard flow to produce the GGUF (the file format from llama.cpp):

  1. Quantize the BF16 model to NVFP4 with NVIDIA ModelOpt
  2. Convert the result to GGUF with convert_hf_to_gguf.py
  3. Run with llama-cli on the DGX Spark

Result: Certainlyrics|assistant|assistant|... For the prompt "The capital of France is", the model responded with random tokens mixed with chat markers.

The curious thing is that the same model quantized to the previous format (Q5_K_M) worked perfectly. The problem was specific to NVFP4. And the logs showed no errors: the conversion "succeeded", the model "loaded", inference "ran". Except the output was garbage.

The investigation

After several iterations (including writing an NVFP4 CUDA kernel by hand to rule out runtime issues), we found the source by reading the converter code line by line.

Llama family models (to which ALIA belongs) require a specific row permutation in the q_proj and k_proj matrices (the "query" and "key" projections of the attention mechanism) so GGML's RoPE convention matches Hugging Face's. That permutation is correctly applied in the standard BF16 path, inside the LlamaModel.modify_tensors function.

But the new NVFP4 path bypasses that function. It writes the quantized weights directly into the GGUF through _repack_nvfp4, without going through the permutation. Result: the attention heads come out shuffled, the model "works" technically but generates pure noise.

This bug affects any Llama architecture model quantized to NVFP4 with llama.cpp's standard converter: Llama 1/2/3, Mistral, Qwen2, ALIA, and all their derivatives. It's likely more than one person has produced corrupt GGUFs unknowingly because the models seem to load.

The solution: 16 lines and an upstream contribution

The patch applies the same permutation to NVFP4 matrices inside _repack_nvfp4. It's 16 lines that intentionally duplicate the logic from LlamaModel.permute to maintain separation between ModelBase (parent class) and LlamaModel (child class).

To validate it, we reproduced the bug on a small model (TinyLlama-1.1B) and measured perplexity—the standard metric for detecting LLM degradation:

| Build | Perplexity (wikitext-2) | |---|---| | BF16 reference | 44.0 | | NVFP4 GGUF (master without patch) | 4419 (total garbage) | | NVFP4 GGUF with our patch | 43.9 |

Perplexity goes from 4419 to 43.9—essentially matching the BF16 reference. The same is confirmed qualitatively with ALIA-40B: coherent responses in Spanish, Catalan, Basque, and Galician with the patch; garbage without it.

We've sent the patch upstream to llama.cpp so the rest of the world inherits it automatically: ggml-org/llama.cpp#22611.

Inference performance on DGX Spark

While the primary use case for DGX Spark is development and fine-tuning, we did measure inference to validate model quality. On the DGX Spark (GB10 Blackwell, 128 GB unified memory, aarch64), greedy generation with Spanish prompts:

| Backend | Format | Generation | |---|---|---| | llama.cpp (GPU) | NVFP4 | 10.2 tok/s | | llama.cpp (GPU) | Q8_0 (reference) | 5.4 tok/s | | vLLM (GPU) | NVFP4 | 8.7 tok/s | | llama.cpp (CPU, ARM NEON) | Q8_0 | 2.7 tok/s |

NVFP4 on Blackwell is ~1.9× faster in inference than Q8_0, and occupies 33% less on disk and memory. The number we care most about, however, is the 90+ GB free that NVFP4 leaves on a DGX Spark—space for gradients, activations, optimizer states, and context during QLoRA fine-tuning.

If you want to dive into absolute detail—the commit, the design rationale, the TinyLlama reproducer—everything is in the upstream PR #22611 and in the README of the Hugging Face repo.