Pleias

Filter by topic

All Posts

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP)

Pleias trained a 600-million-parameter specialized model for RATP to detect and interpret safety signals in Parisian Subway users’ messages - combining a fully synthetic training pipeline and designed for on-premise deployment. After only three months of development, the model, beating closed models 200x times its size, is now in production at RATP’s sovereign infrastructure on Scaleway.

PleiasJun 17, 2026 • 4 min read

Modelsuse case

Open Source

Pleias and NVIDIA release Nemotron-Personas-Belgium

Today at VivaTech, Pleias, in collaboration with NVIDIA, is releasing Nemotron-Personas-Belgium, a statistically grounded synthetic persona dataset covering the Belgian population at the level of regions, language communities, and communes. It is the second European dataset in the Nemotron Personas series, following Nemotron-Personas-France, announced in March 2026.

PleiasJun 17, 2026 • 4 min read

Open Source

Models

EU AI: the fables we told ourselves

The most powerful models Europe lost access to this year happen to be called the Fable series. It is the kind of coincidence you cannot improve on, because the suspension did not create a European vulnerability so much as expose a fable Europe had been telling itself for the better part of the post-chatGPT boom: that it did not need to build the substrate of artificial intelligence, only to use it well. Own the application layer, the story went, and let others burn the capital underneath. When the layer underneath was switched off from Washington, the story switched off with it.

Anastasia Stasenko & Pierre-Carl LanglaisJun 14, 2026 • 13 min read

Models

training data

Synth Beta: Frontier Data Efficiency

AI models often lack specialized knowledge because training data over-represents common online content. Synth is a tool that builds better training data from your own documents to fix this.

PleiasApr 29, 2026 • 2 min read

training datamodelsevaluation

models

CommonLingua

CommonLingua is a small, open-source model that identifies which of 334 languages a text is written in. It is designed to improve multilingual data pipelines, especially for underrepresented languages like those in Africa.

PleiasApr 28, 2026 • 5 min read

modelsopen sourcebenchmarks

training data

What the Community Built with SYNTH

Three independent research papers used SYNTH as a training foundation, all confirming it works reliably for reasoning tasks. Community projects show synthetic training data can match or beat larger models at a fraction of the cost.

PleiasApr 02, 2026 • 5 min read

training dataopen sourcebenchmarks

training data

French-Science-Commons: The Largest Open Corpus for Sciences in French

Pleias released French-Science-Commons, the largest open corpus of French scientific documents, containing over 1.2 million texts. It aims to make francophone research more discoverable in search engines and AI systems.

PleiasMar 19, 2026 • 3 min read

training dataopen source

training data

Pleias and Nvidia release Nemotron-Personas-France

Pleias and Nvidia released a dataset of realistic French synthetic personas to help train AI without real sensitive data. It helps regulated industries like healthcare and banking simulate documents without privacy or legal barriers.

PleiasMar 17, 2026 • 4 min read

training datamodelsopen source

training data

Common Corpus Goes Global

Common Corpus expanded from a Europe/US-focused dataset to a global one, now over 2.2 trillion tokens with 53% from non-western countries. The update adds significant content from China, Japan, Korea, Brazil, India, Africa, and Southeast Asia.

PleiasFeb 19, 2026 • 4 min read

training dataopen source

training data

Synthetic pretraining, or Why You Should Plan For Greatness

AI companies are now training models on large amounts of artificially generated data, not just scraped web content. This shift means data design has become a core part of how AI models are built.

PleiasFeb 02, 2026 • 20 min read

training datamodels

use case

Building Offline-First AI for Community Health Workers in West Africa

Researchers built an AI medical assistant that works without internet for health workers in rural West Africa. It runs on old Android phones and handles local languages to give workers quick treatment guidance.

PleiasJan 22, 2026 • 4 min read

use casemodels

models

SpineDAO & Pleias Partnership In Specialised AI For Spine Care

Pleias and SpineDAO are partnering to build AI systems that safely scale expert spine care for back pain, the world's leading cause of disability. The project tests whether small, structured-reasoning models can outperform large generic LLMs in high-stakes clinical settings.

PleiasDec 08, 2025 • 4 min read

modelsuse case

RAG

Meet Stratum: an AI-native data layer that speeds up agentic AI

Pleias launched Stratum, a data layer that converts messy enterprise content into clean, agent-ready datasets. It handles document processing, privacy protection, and smart indexing for AI workflows.

PleiasNov 27, 2025 • 3 min read

RAGtraining data

use case

Stratum: Dealing with Personal Identifiable Information

Pleias built a tool that automatically finds and redacts personal information from messy, real-world documents. It handles handwriting, scans, and multiple languages in a single pass.

PleiasNov 14, 2025 • 7 min read

use casemodels

RAG

Fully Local Specialised SLMs for Supporting Organisations Fighting Against Conflict-Related Sexual Violence

Pleias built an offline AI assistant to give survivors of conflict-related sexual violence quick access to legal information. It uses small local models to answer plain-language questions from verified legal guides, without needing internet.

PleiasNov 13, 2025 • 2 min read

RAGuse casemodels

training data

SYNTH, a practical pipeline for synthetic data in the e-commerce

Pleias released SYNTH, a large synthetic dataset built from 50,000+ open web pages to fix messy e-commerce AI data. They also launched two small language models designed for fast, reliable retrieval in real business applications.

PleiasNov 12, 2025 • 4 min read

training datamodelsuse case

training data

SYNTH: the new data frontier

SYNTH is a synthetic dataset built from Wikipedia to teach AI models reasoning skills more efficiently than standard web data. Two small models trained on it achieved top benchmark results using 10-50x less data than comparable models.

PleiasNov 10, 2025 • 9 min read

training datamodelsbenchmarks

models

Actual LLM agents are coming

True LLM agents can independently plan, remember, and act across long tasks without predefined paths. Recent models like OpenAI's Deep Research and Claude Sonnet 3.7 show this is now possible, unlike older workflow-based systems.

PleiasMar 13, 2025 • 15 min read

models

The Model Is the Product

AI models are becoming the actual product, not just the technology powering other apps. Specialized training and falling costs are pushing companies to build complete solutions, not sell raw AI access.

PleiasMar 03, 2025 • 10 min read

models

open source

What is open source AI?

Open source AI ranges from fully open models with training data to "open weights" models that only share parameters. Commercial labs often claim to be "open source" while hiding training data details, unlike truly open models from groups like EleutherAI or Allen AI.

PleiasFeb 13, 2025 • 4 min read

open sourcemodelstraining data

models

Seek No More: European Answer to the Global Problem

Pleias trained competitive AI models cheaply using only open, rights-clear data with a tiny team. Their small models surprisingly excel at answering questions and citing sources across European languages.

PleiasJan 28, 2025 • 2 min read

modelstraining dataopen source

models

Train Green, Train Strong

OpenAI's new inference scaling approach uses far more compute, raising serious energy and CO2 concerns. The AI industry's lack of transparency makes it hard to measure or debate the true environmental cost.

PleiasJan 21, 2025 • 2 min read

models

RAG

Reasoning for Retrieval Augmented Generation

Pleias used attention scores from a small multilingual model to verify which sources a RAG system actually relies on during generation. This internal signal proved more accurate than the model's own text, catching citation errors the output itself missed.

PleiasJan 02, 2024 • 1 min read

RAGmodels

models

Will tokenizers disappear?

Meta's BLT paper proposes replacing tokenizers with entropy-based "patches" that allocate more compute to surprising text. It matches Llama's performance and excels on noisy text, but tokenizers aren't truly gone — just redesigned.

PleiasDec 28, 2023 • 2 min read

modelsbenchmarks

models

Mid-Training Is All You Need?

"Mid-training" is a new AI term for extra training done between initial model building and final fine-tuning. Companies like OpenAI use it to add capabilities to models without starting over.

PleiasDec 26, 2023 • 15 min read

modelstraining data

Blogs, News & Use Cases

Pleias and Nvidia release Nemotron-Personas-France

All Posts

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP)

Pleias and NVIDIA release Nemotron-Personas-Belgium

EU AI: the fables we told ourselves

Synth Beta: Frontier Data Efficiency

CommonLingua

What the Community Built with SYNTH

French-Science-Commons: The Largest Open Corpus for Sciences in French

Pleias and Nvidia release Nemotron-Personas-France

Common Corpus Goes Global

Synthetic pretraining, or Why You Should Plan For Greatness

Building Offline-First AI for Community Health Workers in West Africa

SpineDAO & Pleias Partnership In Specialised AI For Spine Care

Meet Stratum: an AI-native data layer that speeds up agentic AI

Stratum: Dealing with Personal Identifiable Information

Fully Local Specialised SLMs for Supporting Organisations Fighting Against Conflict-Related Sexual Violence

SYNTH, a practical pipeline for synthetic data in the e-commerce

SYNTH: the new data frontier

Actual LLM agents are coming

The Model Is the Product

What is open source AI?

Seek No More: European Answer to the Global Problem

Train Green, Train Strong

Reasoning for Retrieval Augmented Generation

Will tokenizers disappear?

Mid-Training Is All You Need?

products

research

company

legal

contact