Mozhi · an open Tamil language model by Kriyaetive.ai

01Why

Indian languages deserve models built for them, and we start with Tamil. English-first models shred Tamil into 3 to 8 tokens per word, a tax on cost, speed, and context. Mozhi builds the representation around Tamil's own unit, the அக்ஷரம் (akshara), making models that are cheaper, faster, and small enough to run on a phone.

FIG 0.3

Fewer tokens

Akshara tokenization cuts tokens per word by ~36%, so the same context window reaches further.

FIG 0.4

Better bits / char

A language model is a compressor. Each Tamil-aware step lowers its bits-per-character.

FIG 0.5

Runs on-device

Small and efficient means private, offline, in-browser inference, no server required.

02Results

Tamil-aware tokens compress better

Identical model and settings, three tokenizers, on the Thirukkural corpus. Lower bits / char is better. It means the model compresses Tamil more tightly.

Tokenizer	Vocab	Tokens	chars / token	bits / char
char	64	571,134	1.00	2.35
akshara	258	365,637	1.56	2.23
BPE	512	264,055	2.16	2.07

char → akshara → BPE · each step is more Tamil-aware and compresses tighter.

03Teach Mozhi · உங்கள் தமிழைக் கற்றுக்கொடுங்கள்

Teach Mozhi your Tamil

Mozhi is early. Today it's a 3.2M-parameter model at 2.07 bits per character. It gets better the way every language model does: more clean, original Tamil. So instead of an anonymous box, you start a contributor profile: pick a handle, name your dialect, and add Tamil. Each piece is saved to your profile, counted toward the corpus, and released under CC0 so it can train open models for everyone. Then open your dashboard and watch your total grow.

Your profile lives on this device, no sign-up, no password. Everything you add is released under CC0 1.0 (public domain). First cohort: Tamil writers and poets. Start your profile → Sequencing is the point: we grow the corpus before we grow the model.

04A commons, not a platform

A commons, not a platform

Mozhi belongs to the people who write Tamil into it. Every contribution is CC0, public domain, so the corpus and the model stay open for anyone to use, study, or rebuild. You keep credit: contributions are tied to your handle, your totals are yours to see, and the writers who feed the model are the ones who shape it. This isn't a social network. It's a shared workbench for Tamil, with just enough identity to see who built what.

01 · PROFILE

Contribute from a profile

Your handle, your dialect, your running total, original Tamil written in your own voice.

02 · CC0

Open all the way down

The text you add, the corpus it joins, the Apache-2.0 model it trains: all public, all in the open.

03 · IMPACT

See your impact

Characters added, passages written, your share of the corpus, on a dashboard that runs on your own device.

05How it works

The community and the model improve each other

FIG 0.6 data engine · a self-reinforcing loop

Community contributes Tamil → open corpus grows → efficient model improves → runs on-device → and back.