Pleias: a family of fully open small AI language models
I haven’t had a chance to use it yet, but this is more like it! Local models that are not only lighter in terms of environmental impact, but are trained on permissively-licensed data.
Training large language models required copyrighted data until it did not. Today we release Pleias 1.0 models, a family of fully open small language models. Pleias 1.0 models include three base models: 350M, 1.2B, and 3B parameters. They feature two specialized models for knowledge retrieval with unprecedented performance for their size on multilingual Retrieval-Augmented Generation, Pleias-Pico (350M parameters) and Pleias-Nano (1.2B parameters).
These represent the first ever models trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license. These are the first fully EU AI Act compliant models. In fact, Pleias sets a new standard for safety and openness.
Our models are:
- multilingual, offering strong support for multiple European languages
- safe, showing the lowest results on the toxicity benchmark
- performant for key tasks, such as knowledge retrieval
- able to run efficiently on consumer-grade hardware locally (CPU-only, without quantisation)
[…]
We are moving away from the standard format of web archives. Instead, we use our new dataset composed of uncopyrighted and permissibly licensed data, Common Corpus. To create this dataset, we had to develop an extensive range of tools to collect, to generate, and to process pretraining.
Source: Hugging Face
Image: David Man & Tristan Ferne