All Posts
ModelsSillon: A Specialised 600M Model for Parisian Subway Operator (RATP)
Pleias trained a 600-million-parameter specialized model for RATP to detect and interpret safety signals in Parisian Subway users’ messages - combining a fully synthetic training pipeline and designed for on-premise deployment. After only three months of development, the model, beating closed models 200x times its size, is now in production at RATP’s sovereign infrastructure on Scaleway.
Open SourcePleias and NVIDIA release Nemotron-Personas-Belgium
Today at VivaTech, Pleias, in collaboration with NVIDIA, is releasing Nemotron-Personas-Belgium, a statistically grounded synthetic persona dataset covering the Belgian population at the level of regions, language communities, and communes. It is the second European dataset in the Nemotron Personas series, following Nemotron-Personas-France, announced in March 2026.
ModelsEU AI: the fables we told ourselves
The most powerful models Europe lost access to this year happen to be called the Fable series. It is the kind of coincidence you cannot improve on, because the suspension did not create a European vulnerability so much as expose a fable Europe had been telling itself for the better part of the post-chatGPT boom: that it did not need to build the substrate of artificial intelligence, only to use it well. Own the application layer, the story went, and let others burn the capital underneath. When the layer underneath was switched off from Washington, the story switched off with it.
.jpeg)

training dataSynth Beta: Frontier Data Efficiency
AI models often lack specialized knowledge because training data over-represents common online content. Synth is a tool that builds better training data from your own documents to fix this.
modelsCommonLingua
CommonLingua is a small, open-source model that identifies which of 334 languages a text is written in. It is designed to improve multilingual data pipelines, especially for underrepresented languages like those in Africa.
training dataWhat the Community Built with SYNTH
Three independent research papers used SYNTH as a training foundation, all confirming it works reliably for reasoning tasks. Community projects show synthetic training data can match or beat larger models at a fraction of the cost.
training dataFrench-Science-Commons: The Largest Open Corpus for Sciences in French
Pleias released French-Science-Commons, the largest open corpus of French scientific documents, containing over 1.2 million texts. It aims to make francophone research more discoverable in search engines and AI systems.
training dataPleias and Nvidia release Nemotron-Personas-France
Pleias and Nvidia released a dataset of realistic French synthetic personas to help train AI without real sensitive data. It helps regulated industries like healthcare and banking simulate documents without privacy or legal barriers.
training dataCommon Corpus Goes Global
Common Corpus expanded from a Europe/US-focused dataset to a global one, now over 2.2 trillion tokens with 53% from non-western countries. The update adds significant content from China, Japan, Korea, Brazil, India, Africa, and Southeast Asia.
training dataSynthetic pretraining, or Why You Should Plan For Greatness
AI companies are now training models on large amounts of artificially generated data, not just scraped web content. This shift means data design has become a core part of how AI models are built.
use caseBuilding Offline-First AI for Community Health Workers in West Africa
Researchers built an AI medical assistant that works without internet for health workers in rural West Africa. It runs on old Android phones and handles local languages to give workers quick treatment guidance.
modelsSpineDAO & Pleias Partnership In Specialised AI For Spine Care
Pleias and SpineDAO are partnering to build AI systems that safely scale expert spine care for back pain, the world's leading cause of disability. The project tests whether small, structured-reasoning models can outperform large generic LLMs in high-stakes clinical settings.
RAGMeet Stratum: an AI-native data layer that speeds up agentic AI
Pleias launched Stratum, a data layer that converts messy enterprise content into clean, agent-ready datasets. It handles document processing, privacy protection, and smart indexing for AI workflows.
use caseStratum: Dealing with Personal Identifiable Information
Pleias built a tool that automatically finds and redacts personal information from messy, real-world documents. It handles handwriting, scans, and multiple languages in a single pass.
RAGFully Local Specialised SLMs for Supporting Organisations Fighting Against Conflict-Related Sexual Violence
Pleias built an offline AI assistant to give survivors of conflict-related sexual violence quick access to legal information. It uses small local models to answer plain-language questions from verified legal guides, without needing internet.
training dataSYNTH, a practical pipeline for synthetic data in the e-commerce
Pleias released SYNTH, a large synthetic dataset built from 50,000+ open web pages to fix messy e-commerce AI data. They also launched two small language models designed for fast, reliable retrieval in real business applications.
training dataSYNTH: the new data frontier
SYNTH is a synthetic dataset built from Wikipedia to teach AI models reasoning skills more efficiently than standard web data. Two small models trained on it achieved top benchmark results using 10-50x less data than comparable models.
modelsActual LLM agents are coming
True LLM agents can independently plan, remember, and act across long tasks without predefined paths. Recent models like OpenAI's Deep Research and Claude Sonnet 3.7 show this is now possible, unlike older workflow-based systems.
modelsThe Model Is the Product
AI models are becoming the actual product, not just the technology powering other apps. Specialized training and falling costs are pushing companies to build complete solutions, not sell raw AI access.
open sourceWhat is open source AI?
Open source AI ranges from fully open models with training data to "open weights" models that only share parameters. Commercial labs often claim to be "open source" while hiding training data details, unlike truly open models from groups like EleutherAI or Allen AI.
modelsSeek No More: European Answer to the Global Problem
Pleias trained competitive AI models cheaply using only open, rights-clear data with a tiny team. Their small models surprisingly excel at answering questions and citing sources across European languages.
modelsTrain Green, Train Strong
OpenAI's new inference scaling approach uses far more compute, raising serious energy and CO2 concerns. The AI industry's lack of transparency makes it hard to measure or debate the true environmental cost.
RAGReasoning for Retrieval Augmented Generation
Pleias used attention scores from a small multilingual model to verify which sources a RAG system actually relies on during generation. This internal signal proved more accurate than the model's own text, catching citation errors the output itself missed.
modelsWill tokenizers disappear?
Meta's BLT paper proposes replacing tokenizers with entropy-based "patches" that allocate more compute to surprising text. It matches Llama's performance and excels on noisy text, but tokenizers aren't truly gone — just redesigned.
modelsMid-Training Is All You Need?
"Mid-training" is a new AI term for extra training done between initial model building and final fine-tuning. Companies like OpenAI use it to add capabilities to models without starting over.