Mozhi (மொழி) is a small, from-scratch language model for Tamil — and an open commons to grow it. The model learns from one thing: original Tamil, written by people who know it. Start a contributor profile, add your Tamil, and watch your words become part of the corpus. Built in the open by Kriyaetive — small enough to run on hardware you own.
Tamil deserves models built for Tamil. English-first models shred Tamil into 3–8 tokens per word — a tax on cost, speed, and context. Mozhi builds the representation around Tamil's own unit, the அக்ஷரம் (akshara), making models that are cheaper, faster, and small enough to run on a phone.
Akshara tokenization cuts tokens per word by ~36%, so the same context window reaches further.
A language model is a compressor. Each Tamil-aware step lowers its bits-per-character.
Small and efficient means private, offline, in-browser inference — no server required.
Identical model and settings, three tokenizers, on the Thirukkural corpus. Lower bits / char is better — it's the model compressing Tamil more tightly.
| Tokenizer | Vocab | Tokens | chars / token | bits / char |
|---|---|---|---|---|
| char | 64 | 571,134 | 1.00 | 2.35 |
| akshara | 258 | 365,637 | 1.56 | 2.23 |
| BPE | 512 | 264,055 | 2.16 | 2.07 |
char → akshara → BPE · each step is more Tamil-aware and compresses tighter.
Mozhi is early — today it's a 3.2M-parameter model at 2.07 bits per character. It gets better the way every language model does: more clean, original Tamil. So instead of an anonymous box, you start a contributor profile — pick a handle, name your dialect, and add Tamil. Each piece is saved to your profile, counted toward the corpus, and released under CC0 so it can train open models for everyone. Then open your dashboard and watch your total grow.
Your profile lives on this device — no sign-up, no password. Everything you add is released under CC0 1.0 (public domain). First cohort: Tamil writers and poets. Start your profile → Sequencing is the point: we grow the corpus before we grow the model.
Mozhi belongs to the people who write Tamil into it. Every contribution is CC0 — public domain — so the corpus and the model stay open for anyone to use, study, or rebuild. You keep credit: contributions are tied to your handle, your totals are yours to see, and the writers who feed the model are the ones who shape it. This isn't a social network. It's a shared workbench for Tamil, with just enough identity to see who built what.
Your handle, your dialect, your running total — original Tamil written in your own voice.
The text you add, the corpus it joins, the Apache-2.0 model it trains — all public, all in the open.
Characters added, passages written, your share of the corpus — on a dashboard that runs on your own device.