Pleias

Our Products

Synth

Synthetic Data for AI agents

We simulate expert-level reasoning and domain-specific processes to generate training data that builds true specialization into your models. Synth handles the cold start problem, covers the long tail of edge cases through engineered simulations, and runs entirely on-premise to keep sensitive data under your control.

Learn more Learn more

Common Corpus

Fully Open Data for AI

We curate and structure the world's largest rights-cleared and provenance-based dataset for LLMs - government records, legal archives, scientific literature, multilingual sources - so you can plug it directly into your models, RAG pipelines, and MCP servers.

Learn more Learn more

Stratum

AI-Native Tooling for Agents

Turn your messy siloed documents into a single, structured, compliant data asset for your agentic AI workflows. Your AI systems and agents get not only the right information but rich trustworthy context - higher accuracy on a wider range of processes. A built-in privacy firewall handles PII before anything leaves the secure zone. Deployable fully on-premise.

Learn more Learn more

Recent Blogs

Blog

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP)

Pleias trained a 600-million-parameter specialized model for RATP to detect and interpret safety signals in Parisian Subway users’ messages - combining a fully synthetic training pipeline and designed for on-premise deployment. After only three months of development, the model, beating closed models 200x times its size, is now in production at RATP’s sovereign infrastructure on Scaleway.

Pleias

Read post

Blog

Pleias and NVIDIA release Nemotron-Personas-Belgium

Today at VivaTech, Pleias, in collaboration with NVIDIA, is releasing Nemotron-Personas-Belgium, a statistically grounded synthetic persona dataset covering the Belgian population at the level of regions, language communities, and communes. It is the second European dataset in the Nemotron Personas series, following Nemotron-Personas-France, announced in March 2026.

Pleias

Read post

Blog

EU AI: the fables we told ourselves

The most powerful models Europe lost access to this year happen to be called the Fable series. It is the kind of coincidence you cannot improve on, because the suspension did not create a European vulnerability so much as expose a fable Europe had been telling itself for the better part of the post-chatGPT boom: that it did not need to build the substrate of artificial intelligence, only to use it well. Own the application layer, the story went, and let others burn the capital underneath. When the layer underneath was switched off from Washington, the story switched off with it.

Anastasia Stasenko & Pierre-Carl Langlais

Read post

See More Blogs Read More Blogs

Use Cases

Use Case

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP)

Read case study

Use Case

Building Offline-First AI for Community Health Workers in West Africa

Researchers built an AI medical assistant that works without internet for health workers in rural West Africa. It runs on old Android phones and handles local languages to give workers quick treatment guidance.

Read case study

Use Case

SpineDAO & Pleias Partnership In Specialised AI For Spine Care

Pleias and SpineDAO are partnering to build AI systems that safely scale expert spine care for back pain, the world's leading cause of disability. The project tests whether small, structured-reasoning models can outperform large generic LLMs in high-stakes clinical settings.

Read case study

More Use Cases More Use Cases

Research Highlights

ICLR Oral

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Large Language Models (LLMs) are pre-trained on large amounts of data from different sources and domains. Such datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models. This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training.

Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation

The closure of Perspective API at the end of 2026 discards what has functioned as the de facto standard for automated toxicity measurement in NLP, CSS, and LLM evaluation research. We document the structural dependence that the communities built on this single proprietary tool and discuss how this dependence caused epistemic problems that have affected - and will likely continue to affect - collective research efforts.

Evaluation

ACL Findings

Model in Distress: Sentiment Analysis on French Synthetic Social Media

Automated analysis of customer feedback on social media is hindered by three challenges: the high cost of annotated training data, the scarcity of evaluation sets, especially in multilingual settings, and privacy concerns that prevent data sharing and reproducibility. We address these issues by developing a generalizable synthetic data generation pipeline applied to a case study on customer distress detection in French public transportation. Our approach utilizes backtranslation with fine-tuned models to generate 1.7 million synthetic tweets from a small seed corpus, complemented by synthetic reasoning traces. We train 600M-parameter reasoners with English and French reasoning that achieve 77-79% accuracy on human-annotated evaluation data, matching or exceeding SOTA proprietary LLMs and specialized encoders. Beyond reducing annotation costs, our pipeline preserves privacy by eliminating the exposure of sensitive user data. Our methodology can be adopted for other use cases and languages.

ACL Findings

Models

Explore All Research Explore All Research

For Enterprise Teams

The data layer behind models that beat larger ones at a fraction of the cost - built from your own knowledge, kept compliant, and deployable at enterprise scale.

Book a demo

For Researchers & Developers

Build with our open datasets and highly efficient data tooling - trained transparently, documented end to end, and yours to fine-tune, audit, and ship however you want.

Explore on HuggingFace

We build the data layer that makes your AI outperform

Partners & Ecosystem

Our Products

Synth

Common Corpus

Stratum

Recent Blogs

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP)

Pleias and NVIDIA release Nemotron-Personas-Belgium

EU AI: the fables we told ourselves

Use Cases

Sillon: A Specialised 600M Model for Parisian Subway Operator (RATP)

Building Offline-First AI for Community Health Workers in West Africa

SpineDAO & Pleias Partnership In Specialised AI For Spine Care

Research Highlights

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation

Model in Distress: Sentiment Analysis on French Synthetic Social Media

products

research

company

legal

contact