Reddit - r/MachineLearning 1h ago

I built an open, from-scratch MT pipeline + parallel corpus for Tunisian Darija (Arabizi) early baseline, and I'm growing it into a curated community corpus [P]

I'm an 18-year-old independent student from Tunisia. I built and I'm leading an open, from-scratch machine-translation pipeline and parallel corpus for Tunisian Darija. Sharing it for feedback.

Why

Tunisian Darija, written in Arabizi (Latin letters + numerals like 3/7/9/5 for Arabic phonemes), has almost no open NLP resources. Existing Arabic tools route it through MSA and mishandle the orthography. To the best of my knowledge there was no open parallel corpus or from-scratch baseline for it.

What I built (all open)

Arabizi-aware SentencePiece BPE tokenizer (3/7/9/5 as protected symbols), shared 16k vocab.
~15.6M-param encoder–decoder Transformer, from scratch (no pretrained LM): transfer-learned from cleaned Moroccan Darija, then fine-tuned on hand-crafted Tunisian pairs.
Full cleaning / training / eval pipeline.

Honest results & limitations

v1 BLEU is 3.89 on a small locked test set - low, and I'll be upfront about it. The corpus is ~553 hand-crafted pairs, so data is the bottleneck, not architecture. I treat 3.89 as a first honest baseline to beat as the corpus grows.

Where I'm taking it

I'm expanding this into a larger, ethically-collected Darija corpus that I curate and validate - consent-documented field collection, every pair provenance-tagged. I'm looking for contributors to help grow it, with every contribution reviewed to keep quality and consent standards.

Looking for

Technical feedback/critique, and anyone interested in contributing data or collaborating on low-resource / dialectal Arabic MT.

I built an open, from-scratch MT pipeline + parallel corpus for Tunisian Darija (Arabizi) early baseline, and I'm growing it into a curated community corpus [P]

Why

What I built (all open)

Honest results & limitations

Where I'm taking it

Looking for

Links

Comments