¿Voy a tener que prohibir el uso de ChatGPT a mi equipo?

Al contrario. Moviwa nace para que no tengas que prohibir nada. Tus empleados siguen usando ChatGPT, Copilot o cualquier IA generativa con total normalidad. Moviwa actúa como un proxy invisible que filtra los datos sensibles antes de que salgan de tu empresa.

¿Es difícil de instalar?

Para nada. Moviwa es una capa invisible que se despliega en minutos. No requiere cambios en la infraestructura existente ni formación especial para los empleados. Funciona desde el primer día.

¿Mis empleados se sentirán vigilados?

No. Moviwa detecta patrones de datos sensibles de forma automática y los enmascara antes de que lleguen a la IA. No monitoriza conversaciones ni registra el contenido de las consultas. Tu equipo trabaja con libertad y tú duermes tranquilo.

¿Cumplís con la EU AI Act?

Sí. Moviwa está diseñado específicamente para ayudar a las empresas a cumplir con la EU AI Act y el RGPD. Proporcionamos trazabilidad, control de datos y las garantías que los reguladores exigen.

Desde 15 €/empleado/mes. El precio varía según el número de usuarios y las funcionalidades que necesites. Contacta con nosotros para un presupuesto personalizado.

ALIA on NVIDIA DGX Spark: Adapt Spain's State LLM to Your Domain on a mini HPC

May 2, 2026·erobles·Artificial Intelligence

What if you could adapt ALIA to your domain—not just deploy it—on a €4,000 box?

In January 2025 we wrote about ALIA, the multilingual language model from the Spanish Government developed by the Barcelona Supercomputing Center (BSC). It's a serious project: 40 billion parameters, native support for Spanish, Catalan, Basque, and Galician (translating to a tokenizer almost 2× more efficient than Anglo-Saxon models—you can check it live in our demo), and training on data representative of Iberian languages and culture.

But there's a chasm between "having an open-source model" and "being able to truly customize it for my use case." Adapting a 40B parameter model—its internal vocabulary, its tone, its knowledge about your specific domain—has historically required a server cluster that most organizations don't have.

Today we're publishing the results of a month's work: ALIA-40b-instruct-2601 quantized to NVFP4 (the native format of new NVIDIA Blackwell GPUs). The model now fits comfortably in an NVIDIA DGX Spark—the new AI development workstation the size of a Mac mini, with 128 GB of unified memory—with plenty of room for local fine-tuning with LoRA/QLoRA. As a bonus, it also serves inference at ~10 tokens per second from the same box.

The model and all code are available right now on Hugging Face.

🎥 Video demo (5 min): —we show the quantized model loading, responding in Spanish at about 10 tokens per second, and the free memory space for fine-tuning.

Why this matters

ALIA is the most ambitious technological piece of Spain's public AI strategy. BSC and the Government have invested significant resources so Spain has a sovereign LLM: trained on European data, with an open license, and with quality comparable to commercial models for peninsular languages.

But a generic model—no matter how good—is rarely what you need in production. What your administration or company needs is ALIA speaking your specific language: the legal terminology of regional legislation, the healthcare vocabulary of your hospital, the incident corpus of your support department, the technical jargon of your sector.

This requires fine-tuning: taking the base model and training it a few more epochs on data representative of your domain. And this is where most projects get stuck:

Full fine-tuning of a 40B in BF16: needs ~400 GB of GPU memory between weights, gradients, optimizer states, and activations. Hardware: ~€80,000 server with 8× H100.
Cloud fine-tuning: saves you the hardware but forces you to send your private data to a US provider, which contradicts the sovereignty and GDPR motivation.
Aggressive generic 3-4 bit quantizations (Q3/Q4): particularly degrade quality for minority languages—precisely where ALIA is most valuable.

The consequence: many Spanish public administrations, cooperatives, and SMEs know ALIA exists, but can't find a realistic path to customize it on their own infrastructure.

NVIDIA DGX Spark + NVFP4: The opportunity

NVIDIA launched the DGX Spark in early 2026 as its compact AI development workstation: a GB10 GPU (Blackwell architecture), 128 GB of unified memory, and a price around €4,000. It's not designed to serve thousands of concurrent users—servers like the B100/H100 exist for that. It's designed so an AI development team has, on their desktop, a box where they can:

Fine-tune models up to ~70B with efficient techniques (LoRA, QLoRA, SFT, DPO)
Prototype agents and pipelines without depending on paid APIs
Iterate on your own data without taking it out of the organization
Serve local inference to validate results or for pilot deployments

The technical key enabling all this is NVFP4, the 4-bit floating-point format NVIDIA introduced with Blackwell. Unlike previous quantizations, NVFP4 is designed at the hardware level: Blackwell GPUs have specific tensor cores that multiply NVFP4 matrices at speeds comparable to FP8, with one-third the memory size. And crucially, it maintains quality much closer to the original BF16 than generic Q3/Q4 quantizations.

ALIA in NVFP4 occupies ~27 GB, versus 76 GB for the original. That means on a DGX Spark you have over 90 GB free—enough to do QLoRA fine-tuning comfortably: LoRA adapters + Adam optimizer states + activations + reasonable context, all in the same box, without touching the cloud.

On paper everything checked out. In practice, getting here required finding and fixing a bug in llama.cpp that had been silently producing corrupt files for months for any Llama family model quantized to NVFP4. The technical details are at the end of the article for those interested in the kitchen; first, what this means for you.

What we've published today

Three open artifacts, with Apache 2.0 license (same as original ALIA):

1. GGUF for llama.cpp and Ollama: montevive/ALIA-40b-instruct-2601-NVFP4-GGUF—27 GB, ready for llama-cli or deployment with Ollama. Mainly useful for inference and quick validation.

2. Safetensors for vLLM, TensorRT-LLM, and fine-tuning: montevive/ALIA-40b-instruct-2601-NVFP4—the same model in compressed-tensors format, compatible with the HuggingFace ecosystem. This is the base for applying QLoRA or LoRA to customize ALIA to your domain (via peft, axolotl, unsloth, or trl).

3. The patch for llama.cpp: Pull Request #22611—open and awaiting review from the maintenance team. Once merged, any future NVFP4 GGUF of a Llama model will come out correct without needing to manually apply the patch.

4. Interactive ALIA tokenizer demo: labs.montevive.ai/alia-tokenizer-comparison/—live comparator between ALIA, Llama 3, and Mistral tokenizers, running 100% in your browser. Paste any text and instantly see why ALIA is ~1.7× more efficient with peninsular languages. We cover it in detail in this dedicated article.

Who can leverage this today

Spanish public administration

A DGX Spark + ALIA allows a municipality, provincial government, regional ministry, or state agency to:

Adapt ALIA to their internal corpus: legal jargon, historical records, specific regulations
Train specialized versions: one for citizen service in co-official languages, another for grant classification, another for meeting minutes summarization
Validate results locally before deploying to production
Keep all data within the organization throughout the development cycle

Companies with sovereignty or strict GDPR requirements

Regulated sectors (healthcare, banking, legal, defense) that need to adapt generative AI to their domain without sending data to the cloud. Hardware cost: €4,000 for the workstation, vs. €30,000-80,000 for a fine-tuning server with H100s, vs. €0 fixed but data continuously leaving toward OpenAI/Anthropic.

Cooperatives, SMEs, and startups

Small teams wanting a specialized assistant for their product, technical support, or contracts. The DGX Spark pays for itself in a few months if your workload is continuous and data is sensitive.

Researchers and developers

If you work with Iberian LLMs, ALIA-40B in NVFP4 is probably the best available starting point in May 2026 for experimenting with multilingual fine-tuning on reasonable hardware.

Why Montevive AI

We've spent years working on applied AI with a focus on privacy, sovereignty, and efficiency. We don't sell hype or massive cloud models: we build the missing tools so truly useful AI works on your own infrastructure.

Some recent pieces:

Local Privacy Filter—PII detection in the browser, no servers
Go Name Detector—open-source Go library for name detection
AutoCache—intelligent proxy that reduces LLM API costs by up to 90%

This ALIA quantization fits the same line: making locally accessible what the market assumes has to live in the cloud.

About ALIA: collaboration between BSC and Spanish research centers

ALIA is coordinated by the Barcelona Supercomputing Center (BSC-CNS), under the leadership of the Secretary of State for Digitalization and Artificial Intelligence (SEDIA) and driven by the Government of Spain.

The project builds upon ILENIA (Impulse of Languages in Artificial Intelligence), a consortium that integrates research centers specialized in language technologies for each co-official language:

Participating centers

BSC-CNS (Barcelona Supercomputing Center): general coordinator and responsible for the AINA project for Catalan
HiTZ (Basque Center for Language Technology) - University of the Basque Country: GAITU project for Basque
CiTIUS (Research Center for Intelligent Technologies) and ILG (Galician Language Institute) - University of Santiago de Compostela: NÓS project for Galician
CeAtic (Center for Advanced Studies in ICT) - University of Jaén: ALIA technology transfer in Andalusia
CENID (Digital Intelligence Center) - University of Alicante: VIVES project for Valencian

Funding: Recovery, Transformation and Resilience Plan (NextGeneration EU), EuroHPC Joint Undertaking (European supercomputing consortium) and regional governments.

ALIA represents the convergence of these previous efforts into a unique, multilingual and sovereign infrastructure, trained on MareNostrum 5 (BSC, Barcelona).

Want to adapt ALIA to your organization?

If you work in a public administration, company, or cooperative that wants to use ALIA seriously—adapted to your domain, integrated into your systems, running on your infrastructure—we can help you do it:

Domain fine-tuning: we adapt ALIA to your terminology (legal, healthcare, technical, sector-specific) using LoRA/QLoRA on DGX Spark or Blackwell servers
Deployment consulting: we evaluate your use case, size hardware, design fine-tuning + inference architecture
Installation and configuration: NVIDIA DGX Spark, Blackwell servers, or private cloud infrastructure
Integration with your systems: APIs, RAG over internal documents, specialized agents, continuous evaluation

📧 Contact us: info@montevive.ai

🌐 More information: montevive.ai

ALIA belongs to everyone. Making it accessible to you—adaptable, executable, sovereign—is what we're building.

🔧 For the technical folks: how we did it

If you're interested in the engineering side of the story—what bug we found, how we diagnosed it, what numbers prove the patch works, and what inference performance our DGX Spark measures—keep reading. If not, thanks for making it this far.

The bug: when quantization produces random words

We applied the standard flow to produce the GGUF (the file format from llama.cpp):

Quantize the BF16 model to NVFP4 with NVIDIA ModelOpt
Convert the result to GGUF with convert_hf_to_gguf.py
Run with llama-cli on the DGX Spark

Result: Certainlyrics|assistant|assistant|... For the prompt "The capital of France is", the model responded with random tokens mixed with chat markers.

The curious thing is that the same model quantized to the previous format (Q5_K_M) worked perfectly. The problem was specific to NVFP4. And the logs showed no errors: the conversion "succeeded", the model "loaded", inference "ran". Except the output was garbage.

The investigation

After several iterations (including writing an NVFP4 CUDA kernel by hand to rule out runtime issues), we found the source by reading the converter code line by line.

Llama family models (to which ALIA belongs) require a specific row permutation in the q_proj and k_proj matrices (the "query" and "key" projections of the attention mechanism) so GGML's RoPE convention matches Hugging Face's. That permutation is correctly applied in the standard BF16 path, inside the LlamaModel.modify_tensors function.

But the new NVFP4 path bypasses that function. It writes the quantized weights directly into the GGUF through _repack_nvfp4, without going through the permutation. Result: the attention heads come out shuffled, the model "works" technically but generates pure noise.

This bug affects any Llama architecture model quantized to NVFP4 with llama.cpp's standard converter: Llama 1/2/3, Mistral, Qwen2, ALIA, and all their derivatives. It's likely more than one person has produced corrupt GGUFs unknowingly because the models seem to load.

The solution: 16 lines and an upstream contribution

The patch applies the same permutation to NVFP4 matrices inside _repack_nvfp4. It's 16 lines that intentionally duplicate the logic from LlamaModel.permute to maintain separation between ModelBase (parent class) and LlamaModel (child class).

To validate it, we reproduced the bug on a small model (TinyLlama-1.1B) and measured perplexity—the standard metric for detecting LLM degradation:

Build	Perplexity (wikitext-2)
BF16 reference	44.0
NVFP4 GGUF (`master` without patch)	4419 (total garbage)
NVFP4 GGUF with our patch	43.9

Perplexity goes from 4419 to 43.9—essentially matching the BF16 reference. The same is confirmed qualitatively with ALIA-40B: coherent responses in Spanish, Catalan, Basque, and Galician with the patch; garbage without it.

We've sent the patch upstream to llama.cpp so the rest of the world inherits it automatically: ggml-org/llama.cpp#22611.

Inference performance on DGX Spark

While the primary use case for DGX Spark is development and fine-tuning, we did measure inference to validate model quality. On the DGX Spark (GB10 Blackwell, 128 GB unified memory, aarch64), greedy generation with Spanish prompts:

Backend	Format	Generation
llama.cpp (GPU)	NVFP4	10.2 tok/s
llama.cpp (GPU)	Q8_0 (reference)	5.4 tok/s
vLLM (GPU)	NVFP4	8.7 tok/s
llama.cpp (CPU, ARM NEON)	Q8_0	2.7 tok/s

NVFP4 on Blackwell is ~1.9× faster in inference than Q8_0, and occupies 33% less on disk and memory. The number we care most about, however, is the 90+ GB free that NVFP4 leaves on a DGX Spark—space for gradients, activations, optimizer states, and context during QLoRA fine-tuning.

If you want to dive into absolute detail—the commit, the design rationale, the TinyLlama reproducer—everything is in the upstream PR #22611 and in the README of the Hugging Face repo.

Back to blog

/ Blog /