I built an open, from-scratch MT pipeline + parallel corpus for Tunisian Darija (Arabizi) early baseline, and I'm growing it into a curated community corpus [P]
I'm an 18-year-old independent student from Tunisia. I built and I'm leading an open, from-scratch machine-translation pipeline and parallel corpus for Tunisian Darija. Sharing it for feedback.
Why
Tunisian Darija, written in Arabizi (Latin letters + numerals like 3/7/9/5 for Arabic phonemes), has almost no open NLP resources. Existing Arabic tools route it through MSA and mishandle the orthography. To the best of my knowledge there was no open parallel corpus or from-scratch baseline for it.
What I built (all open)
- Arabizi-aware SentencePiece BPE tokenizer (3/7/9/5 as protected symbols), shared 16k vocab.
- ~15.6M-param encoderβdecoder Transformer, from scratch (no pretrained LM): transfer-learned from cleaned Moroccan Darija, then fine-tuned on hand-crafted Tunisian pairs.
- Full cleaning / training / eval pipeline.
Honest results & limitations
v1 BLEU is 3.89 on a small locked test set - low, and I'll be upfront about it. The corpus is ~553 hand-crafted pairs, so data is the bottleneck, not architecture. I treat 3.89 as a first honest baseline to beat as the corpus grows.
Where I'm taking it
I'm expanding this into a larger, ethically-collected Darija corpus that I curate and validate - consent-documented field collection, every pair provenance-tagged. I'm looking for contributors to help grow it, with every contribution reviewed to keep quality and consent standards.
Looking for
Technical feedback/critique, and anyone interested in contributing data or collaborating on low-resource / dialectal Arabic MT.
Links
- GitHub repo: https://github.com/Dhiadev-tn/darija-translator
- Hugging Face dataset: https://huggingface.co/datasets/Dhiadev-tn/tunisian-darija-english
- Hugging Face model: https://huggingface.co/Dhiadev-tn/darija-translator
Comments
No comments yet. Start the discussion.