v0.1.0 · Apache-2.0 ·  read the whitepaper →

An open தமிழ் language commons

Mozhi (மொழி) is a small, from-scratch language model for Tamil — and an open commons to grow it. The model learns from one thing: original Tamil, written by people who know it. Start a contributor profile, add your Tamil, and watch your words become part of the corpus. Built in the open by Kriyaetive — small enough to run on hardware you own.

FIG 0.1 — architecture · the forward pass
INPUT · 3 aksharas மி ழ் EMBED · d=256 TRANSFORMER ×6 attention+ MLP SOFTMAX · next சொல்
2.07
bits / character
3.2M
parameters
3
tokenizers
100%
open source
FIG 0.2 — tokenization · fewer tokens, longer reach
WORDதமிழ்
tokens per tokenizer
char
ி
5 tokens
akshara
மிழ்
3 tokens
bpe
தமிழ்
2 tokens
01Why

Tamil deserves models built for Tamil. English-first models shred Tamil into 3–8 tokens per word — a tax on cost, speed, and context. Mozhi builds the representation around Tamil's own unit, the அக்ஷரம் (akshara), making models that are cheaper, faster, and small enough to run on a phone.

FIG 0.3

Fewer tokens

Akshara tokenization cuts tokens per word by ~36%, so the same context window reaches further.

FIG 0.4

Better bits / char

A language model is a compressor. Each Tamil-aware step lowers its bits-per-character.

FIG 0.5

Runs on-device

Small and efficient means private, offline, in-browser inference — no server required.

02Results

Tamil-aware tokens compress better

Identical model and settings, three tokenizers, on the Thirukkural corpus. Lower bits / char is better — it's the model compressing Tamil more tightly.

TokenizerVocabTokenschars / tokenbits / char
char64571,1341.00 2.35
akshara258365,6371.56 2.23
BPE512264,0552.16 2.07

char → akshara → BPE · each step is more Tamil-aware and compresses tighter.

03Teach Mozhi · உங்கள் தமிழைக் கற்றுக்கொடுங்கள்

Teach Mozhi your Tamil

Mozhi is early — today it's a 3.2M-parameter model at 2.07 bits per character. It gets better the way every language model does: more clean, original Tamil. So instead of an anonymous box, you start a contributor profile — pick a handle, name your dialect, and add Tamil. Each piece is saved to your profile, counted toward the corpus, and released under CC0 so it can train open models for everyone. Then open your dashboard and watch your total grow.

Your profile lives on this device — no sign-up, no password. Everything you add is released under CC0 1.0 (public domain). First cohort: Tamil writers and poets. Start your profile → Sequencing is the point: we grow the corpus before we grow the model.

04A commons, not a platform

A commons, not a platform

Mozhi belongs to the people who write Tamil into it. Every contribution is CC0 — public domain — so the corpus and the model stay open for anyone to use, study, or rebuild. You keep credit: contributions are tied to your handle, your totals are yours to see, and the writers who feed the model are the ones who shape it. This isn't a social network. It's a shared workbench for Tamil, with just enough identity to see who built what.

01 · PROFILE

Contribute from a profile

Your handle, your dialect, your running total — original Tamil written in your own voice.

02 · CC0

Open all the way down

The text you add, the corpus it joins, the Apache-2.0 model it trains — all public, all in the open.

03 · IMPACT

See your impact

Characters added, passages written, your share of the corpus — on a dashboard that runs on your own device.

05How it works

The community and the model improve each other

FIG 0.6 — data engine · a self-reinforcing loop
The Kriyaetive Mozhi data-engine loop A four-stage cycle: community contributes Tamil, the open corpus grows, the efficient model improves, it runs on-device, and that brings the community back. A single signal travels the loop clockwise. 01 COMMUNITY 02 CORPUS 03 MODEL 04 ON-DEVICE DATA ENGINE
Community contributes Tamil → open corpus grows → efficient model improves → runs on-device → and back.