We build the data layer that makes your AI outperform

Partners & Ecosystem

Nvidia
Mozilla
AI Alliance
Wikimedia Foundation
thws
Scaleway
Nvidia
Mozilla
AI Alliance
Wikimedia Foundation
thws
Scaleway
Nvidia
Mozilla
AI Alliance
Wikimedia Foundation
thws
Scaleway

Our Products

Synth

Synthetic Data for AI agents

We simulate expert-level reasoning and domain-specific processes to generate training data that builds true specialization into your models. Synth handles the cold start problem, covers the long tail of edge cases through engineered simulations, and runs entirely on-premise to keep sensitive data under your control.

Learn more

Common Corpus

Fully Open Data for AI

We curate and structure the world's largest rights-cleared and provenance-based dataset for LLMs - government records, legal archives, scientific literature, multilingual sources - so you can plug it directly into your models, RAG pipelines, and MCP servers.

Learn more

Stratum

AI-Native Tooling for Agents

Turn your messy siloed documents into a single, structured, compliant data asset for your agentic AI workflows. Your AI systems and agents get not only the right information but rich trustworthy context - higher accuracy on a wider range of processes. A built-in privacy firewall handles PII before anything leaves the secure zone. Deployable fully on-premise.

Learn more

Recent Blogs

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP) Blog

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP)

Pleias trained a 600-million-parameter specialized model for RATP to detect and interpret safety signals in Parisian Subway users’ messages - combining a fully synthetic training pipeline and designed for on-premise deployment. After only three months of development, the model, beating closed models 200x times its size, is now in production at RATP’s sovereign infrastructure on Scaleway.

Pleias
Pleias
Read post
Pleias and NVIDIA release Nemotron-Personas-BelgiumBlog

Pleias and NVIDIA release Nemotron-Personas-Belgium

Today at VivaTech, Pleias, in collaboration with NVIDIA, is releasing Nemotron-Personas-Belgium, a statistically grounded synthetic persona dataset covering the Belgian population at the level of regions, language communities, and communes. It is the second European dataset in the Nemotron Personas series, following Nemotron-Personas-France, announced  in March 2026.

Pleias
Pleias
Read post
EU AI: the fables we told ourselvesBlog

EU AI: the fables we told ourselves

The most powerful models Europe lost access to this year happen to be called the Fable series. It is the kind of coincidence you cannot improve on, because the suspension did not create a European vulnerability so much as expose a fable Europe had been telling itself for the better part of the post-chatGPT boom: that it did not need to build the substrate of artificial intelligence, only to use it well. Own the application layer, the story went, and let others burn the capital underneath. When the layer underneath was switched off from Washington, the story switched off with it.

Anastasia StasenkoPierre-Carl Langlais
Anastasia Stasenko & Pierre-Carl Langlais
Read post
See More Blogs

Use Cases

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP) Use Case

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP)

Pleias trained a 600-million-parameter specialized model for RATP to detect and interpret safety signals in Parisian Subway users’ messages - combining a fully synthetic training pipeline and designed for on-premise deployment. After only three months of development, the model, beating closed models 200x times its size, is now in production at RATP’s sovereign infrastructure on Scaleway.

Read case study
Building Offline-First AI for Community Health Workers in West AfricaUse Case

Building Offline-First AI for Community Health Workers in West Africa

Researchers built an AI medical assistant that works without internet for health workers in rural West Africa. It runs on old Android phones and handles local languages to give workers quick treatment guidance.

Read case study
SpineDAO & Pleias Partnership In Specialised AI For Spine CareUse Case

SpineDAO & Pleias Partnership In Specialised AI For Spine Care

Pleias and SpineDAO are partnering to build AI systems that safely scale expert spine care for back pain, the world's leading cause of disability. The project tests whether small, structured-reasoning models can outperform large generic LLMs in high-stakes clinical settings.

Read case study
More Use Cases

Research Highlights

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-TrainingICLR Oral

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training.

ICLR Oral
Training Data
Open Source
Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM EvaluationEvaluation

Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation

The closure of Perspective API at the end of 2026 discards what has functioned as the de facto standard for automated toxicity measurement in NLP, CSS, and LLM evaluation research. We document the structural dependence that the communities built on this single proprietary tool and discuss how this dependence caused epistemic problems that have affected - and will likely continue to affect - collective research efforts.

Evaluation
Model in Distress: Sentiment Analysis on French Synthetic Social MediaACL Findings

Model in Distress: Sentiment Analysis on French Synthetic Social Media

Automated analysis of customer feedback on social media is hindered by three challenges: the high cost of annotated training data, the scarcity of evaluation sets, especially in multilingual settings, and privacy concerns that prevent data sharing and reproducibility. We address these issues by developing a generalizable synthetic data generation pipeline applied to a case study on customer distress detection in French public transportation. Our approach utilizes backtranslation with fine-tuned models to generate 1.7 million synthetic tweets from a small seed corpus, complemented by synthetic reasoning traces. We train 600M-parameter reasoners with English and French reasoning that achieve 77-79% accuracy on human-annotated evaluation data, matching or exceeding SOTA proprietary LLMs and specialized encoders. Beyond reducing annotation costs, our pipeline preserves privacy by eliminating the exposure of sensitive user data. Our methodology can be adopted for other use cases and languages.

ACL Findings
Models
Explore All Research
For Enterprise Teams
The data layer behind models that beat larger ones at a fraction of the cost - built from your own knowledge, kept compliant, and deployable at enterprise scale.
For Enterprise Teams
Book a demo
For Researchers & Developers
Build with our open datasets and highly efficient data tooling - trained transparently, documented end to end, and yours to fine-tune, audit, and ship however you want.
For Researchers & Developers
Explore on HuggingFace