👋 Welcome to Occiglot!

Occiglot is an ongoing research collective for open-source language models for and by Europe. We strongly believe in transparent research and exchange of ideas. If you are working on topics relevant to European LLMs or seek to contribute to Occiglot, please contact us or join our Discord server. We are actively seeking collaborations!

Tokenizer Evaluation on European Languages

Intro The tokenizer is a vital component of any LLM, encoding sequences of text into a pre-defined set of tokens. However, the tokenizer is built seperately from the LLM itself and undergoes a seperate training phase with its own training data. Consequently, the tokenizers of most commercial models are heavily optimized for English text with varying performance for non-english languages. Since Occiglot is building LLMs for non-english languages based on existing models and tokenizers, we need to gain a thorough understanding of their inherent performance on the languages we aim to support....

A polyglot language model for the Occident.

Announcing Occiglot: Polyglot Language Models for the Occident

Mission Statement Recent advancements in transformer-based language models have demonstrated the potentially disruptive impact of this technology. Unfortunately, the high cost and required skill sets associated with training Large Language Models (LLM) leave the field dominated by a handful of big tech companies and deep tech startups, making core European values such as linguistic diversity, multilingualism, and cultural richness an afterthought of economically driven decisions. Occiglot strongly believes that dedicated language modeling solutions are key to maintaining Europe’s academic and economic competitiveness and AI sovereignty....

Technical Report

TBA