Blogs, News & Use Cases

Blogs, partnership announcements, product updates, event recaps, press coverage, and customer stories

Pleias and Nvidia release Nemotron-Personas-France
Featured
training data
Mar 17, 2026

Pleias and Nvidia release Nemotron-Personas-France

Pleias and Nvidia released a dataset of realistic French synthetic personas to help train AI without real sensitive data. It helps regulated industries like healthcare and banking simulate documents without privacy or legal barriers.

Pleias
Pleias4 min read
Filter by topic

All Posts

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP) Models

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP)

Pleias trained a 600-million-parameter specialized model for RATP to detect and interpret safety signals in Parisian Subway users’ messages - combining a fully synthetic training pipeline and designed for on-premise deployment. After only three months of development, the model, beating closed models 200x times its size, is now in production at RATP’s sovereign infrastructure on Scaleway.

Pleias
PleiasJun 17, 2026 • 4 min read
Modelsuse case
Pleias and NVIDIA release Nemotron-Personas-BelgiumOpen Source

Pleias and NVIDIA release Nemotron-Personas-Belgium

Today at VivaTech, Pleias, in collaboration with NVIDIA, is releasing Nemotron-Personas-Belgium, a statistically grounded synthetic persona dataset covering the Belgian population at the level of regions, language communities, and communes. It is the second European dataset in the Nemotron Personas series, following Nemotron-Personas-France, announced  in March 2026.

Pleias
PleiasJun 17, 2026 • 4 min read
Open Source
EU AI: the fables we told ourselvesModels

EU AI: the fables we told ourselves

The most powerful models Europe lost access to this year happen to be called the Fable series. It is the kind of coincidence you cannot improve on, because the suspension did not create a European vulnerability so much as expose a fable Europe had been telling itself for the better part of the post-chatGPT boom: that it did not need to build the substrate of artificial intelligence, only to use it well. Own the application layer, the story went, and let others burn the capital underneath. When the layer underneath was switched off from Washington, the story switched off with it.

Anastasia StasenkoPierre-Carl Langlais
Anastasia Stasenko & Pierre-Carl LanglaisJun 14, 2026 • 13 min read
Models
Synth Beta: Frontier Data Efficiencytraining data

Synth Beta: Frontier Data Efficiency

AI models often lack specialized knowledge because training data over-represents common online content. Synth is a tool that builds better training data from your own documents to fix this.

Pleias
PleiasApr 29, 2026 • 2 min read
training datamodelsevaluation
CommonLinguamodels

CommonLingua

CommonLingua is a small, open-source model that identifies which of 334 languages a text is written in. It is designed to improve multilingual data pipelines, especially for underrepresented languages like those in Africa.

Pleias
PleiasApr 28, 2026 • 5 min read
modelsopen sourcebenchmarks
What the Community Built with SYNTHtraining data

What the Community Built with SYNTH

Three independent research papers used SYNTH as a training foundation, all confirming it works reliably for reasoning tasks. Community projects show synthetic training data can match or beat larger models at a fraction of the cost.

Pleias
PleiasApr 02, 2026 • 5 min read
training dataopen sourcebenchmarks
French-Science-Commons: The Largest Open Corpus for Sciences in Frenchtraining data

French-Science-Commons: The Largest Open Corpus for Sciences in French

Pleias released French-Science-Commons, the largest open corpus of French scientific documents, containing over 1.2 million texts. It aims to make francophone research more discoverable in search engines and AI systems.

Pleias
PleiasMar 19, 2026 • 3 min read
training dataopen source
Pleias and Nvidia release Nemotron-Personas-Francetraining data

Pleias and Nvidia release Nemotron-Personas-France

Pleias and Nvidia released a dataset of realistic French synthetic personas to help train AI without real sensitive data. It helps regulated industries like healthcare and banking simulate documents without privacy or legal barriers.

Pleias
PleiasMar 17, 2026 • 4 min read
training datamodelsopen source
Common Corpus Goes Globaltraining data

Common Corpus Goes Global

Common Corpus expanded from a Europe/US-focused dataset to a global one, now over 2.2 trillion tokens with 53% from non-western countries. The update adds significant content from China, Japan, Korea, Brazil, India, Africa, and Southeast Asia.

Pleias
PleiasFeb 19, 2026 • 4 min read
training dataopen source
Synthetic pretraining, or Why You Should Plan For Greatnesstraining data

Synthetic pretraining, or Why You Should Plan For Greatness

AI companies are now training models on large amounts of artificially generated data, not just scraped web content. This shift means data design has become a core part of how AI models are built.

Pleias
PleiasFeb 02, 2026 • 20 min read
training datamodels
Building Offline-First AI for Community Health Workers in West Africause case

Building Offline-First AI for Community Health Workers in West Africa

Researchers built an AI medical assistant that works without internet for health workers in rural West Africa. It runs on old Android phones and handles local languages to give workers quick treatment guidance.

Pleias
PleiasJan 22, 2026 • 4 min read
use casemodels
SpineDAO & Pleias Partnership In Specialised AI For Spine Caremodels

SpineDAO & Pleias Partnership In Specialised AI For Spine Care

Pleias and SpineDAO are partnering to build AI systems that safely scale expert spine care for back pain, the world's leading cause of disability. The project tests whether small, structured-reasoning models can outperform large generic LLMs in high-stakes clinical settings.

Pleias
PleiasDec 08, 2025 • 4 min read
modelsuse case
Meet Stratum: an AI-native data layer that speeds up agentic AIRAG

Meet Stratum: an AI-native data layer that speeds up agentic AI

Pleias launched Stratum, a data layer that converts messy enterprise content into clean, agent-ready datasets. It handles document processing, privacy protection, and smart indexing for AI workflows.

Pleias
PleiasNov 27, 2025 • 3 min read
RAGtraining data
Stratum: Dealing with Personal Identifiable Informationuse case

Stratum: Dealing with Personal Identifiable Information

Pleias built a tool that automatically finds and redacts personal information from messy, real-world documents. It handles handwriting, scans, and multiple languages in a single pass.

Pleias
PleiasNov 14, 2025 • 7 min read
use casemodels
Fully Local Specialised SLMs for Supporting Organisations Fighting Against Conflict-Related Sexual ViolenceRAG

Fully Local Specialised SLMs for Supporting Organisations Fighting Against Conflict-Related Sexual Violence

Pleias built an offline AI assistant to give survivors of conflict-related sexual violence quick access to legal information. It uses small local models to answer plain-language questions from verified legal guides, without needing internet.

Pleias
PleiasNov 13, 2025 • 2 min read
RAGuse casemodels
SYNTH, a practical pipeline for synthetic data in the e-commercetraining data

SYNTH, a practical pipeline for synthetic data in the e-commerce

Pleias released SYNTH, a large synthetic dataset built from 50,000+ open web pages to fix messy e-commerce AI data. They also launched two small language models designed for fast, reliable retrieval in real business applications.

Pleias
PleiasNov 12, 2025 • 4 min read
training datamodelsuse case
SYNTH: the new data frontiertraining data

SYNTH: the new data frontier

SYNTH is a synthetic dataset built from Wikipedia to teach AI models reasoning skills more efficiently than standard web data. Two small models trained on it achieved top benchmark results using 10-50x less data than comparable models.

Pleias
PleiasNov 10, 2025 • 9 min read
training datamodelsbenchmarks
Actual LLM agents are comingmodels

Actual LLM agents are coming

True LLM agents can independently plan, remember, and act across long tasks without predefined paths. Recent models like OpenAI's Deep Research and Claude Sonnet 3.7 show this is now possible, unlike older workflow-based systems.

Pleias
PleiasMar 13, 2025 • 15 min read
models
The Model Is the Productmodels

The Model Is the Product

AI models are becoming the actual product, not just the technology powering other apps. Specialized training and falling costs are pushing companies to build complete solutions, not sell raw AI access.

Pleias
PleiasMar 03, 2025 • 10 min read
models
What is open source AI?open source

What is open source AI?

Open source AI ranges from fully open models with training data to "open weights" models that only share parameters. Commercial labs often claim to be "open source" while hiding training data details, unlike truly open models from groups like EleutherAI or Allen AI.

Pleias
PleiasFeb 13, 2025 • 4 min read
open sourcemodelstraining data
Seek No More: European Answer to the Global Problemmodels

Seek No More: European Answer to the Global Problem

Pleias trained competitive AI models cheaply using only open, rights-clear data with a tiny team. Their small models surprisingly excel at answering questions and citing sources across European languages.

Pleias
PleiasJan 28, 2025 • 2 min read
modelstraining dataopen source
Train Green, Train Strongmodels

Train Green, Train Strong

OpenAI's new inference scaling approach uses far more compute, raising serious energy and CO2 concerns. The AI industry's lack of transparency makes it hard to measure or debate the true environmental cost.

Pleias
PleiasJan 21, 2025 • 2 min read
models
Reasoning for Retrieval Augmented GenerationRAG

Reasoning for Retrieval Augmented Generation

Pleias used attention scores from a small multilingual model to verify which sources a RAG system actually relies on during generation. This internal signal proved more accurate than the model's own text, catching citation errors the output itself missed.

Pleias
PleiasJan 02, 2024 • 1 min read
RAGmodels
Will tokenizers disappear?models

Will tokenizers disappear?

Meta's BLT paper proposes replacing tokenizers with entropy-based "patches" that allocate more compute to surprising text. It matches Llama's performance and excels on noisy text, but tokenizers aren't truly gone — just redesigned.

Pleias
PleiasDec 28, 2023 • 2 min read
modelsbenchmarks
Mid-Training Is All You Need?models

Mid-Training Is All You Need?

"Mid-training" is a new AI term for extra training done between initial model building and final fine-tuning. Companies like OpenAI use it to add capabilities to models without starting over.

Pleias
PleiasDec 26, 2023 • 15 min read
modelstraining data