Teaching a grader the difference between pаypаl and paypal
Look at these two strings: paypal and pаypаl. In most fonts they render the same. The second one has two Cyrillic а characters standing in for Latin a. A person can't reliably tell them apart on sight, and a plain == comparison can't tell them apart at all unless it's checking code points, not glyphs.
That pair is one of 72 test cases in something I finished today: a grader for the kind of text bug that's obvious once you see it and invisible until then. It's built against Prime Intellect's Environments Hub, which collects RL environments and evals that AI labs train and test models against. This one grades text correctness specifically - nothing about the pretty stuff, just: is this string handled right.
I've spent about a year finding these bugs by hand, in real repos, as pull requests. 115 of mine are merged upstream as of this morning (misskey, strapi, MUI, Vue Router, Wails, Tencent's tdesign, and a long tail of smaller ones - zero self-merged, github.com/greymoth-jp if you want to check). Most of what I found reduces to a short list of repeating shapes: an Enter key that submits a form mid-IME-conversion, a .length check that splits a kanji in half, a locale file that silently drifts out of date behind the English source.
I keep the CJK/Unicode ones in a public corpus, cjk-failure-corpus - 97 of them now, each linked to a real PR or issue, not written from memory.
At some point the question stopped being "can I find one more of these" and started being "can a program judge whether an answer is correct, the way I've been judging them by hand." That's what a grader is. Building one turned out to be a different skill from finding bugs - but the same underlying judgment, made explicit and checkable.
Three families, 72 cases, no stored answers
tokenization-length (26 cases)
Given a string, report its length three different ways: grapheme clusters (what a person sees), Unicode code points, and UTF-16 code units, and flag whether a naive count would get it wrong.
𠮷野家 - the kanji Yoshinoya prints on its own storefront sign - is one code point and two UTF-16 units. A plain .length in JavaScript reports 2 for a string a human reads as one character.
encoding-injection (30 cases - the paypal one lives here)
Decide whether a string hides something: a homoglyph swap, an invisible character splicing a token in two, a bidirectional-override character (U+202E, the mechanism behind Trojan Source, publicly tracked as CVE-2021-42574) that can make a file actually named txt.exe display as exe.txt, or a fullwidth character that survives Unicode normalization into something dangerous (delete normalizes to delete).
Every positive case is paired with a negative one - a legitimate Japanese or Korean string, or a real emoji sequence - so a detector that just flags every non-ASCII string scores no better than chance.
rendering-output (16 cases)
Text that's fine going in and corrupted coming out: UTF-8 misread as Latin-1 leaves control characters that never occur in real text, or a Private-Use-Area code point that most default font stacks render as a blank box.
How it works
The part I care about more than the bug list: nothing in the grader is a stored answer key. Every oracle is re-derived at grading time, in Python - len() for code points, .encode('utf-16-le') for UTF-16 units, the grapheme package for clusters, a small reference scanner for the injection checks.
A correct answer scores 1.0 across every class. A naive baseline - roughly the level of check most real code actually ships - scores 0.43 on average. That gap is the training signal.
What was left out
I deliberately left two things out. The source corpus has a couple of checks (pangu spacing, budoux line-breaking) whose ground truth comes from a JavaScript library. Porting that to Python would mean trusting a second implementation to stay byte-identical to the first forever, which breaks the entire "re-derive, don't store" premise - so I cut them instead of faking the confidence.
Two Indic-conjunct cases got cut for the same reason: the Python grapheme library predates the current Unicode rule for that script (UAX #29 GB9c) and would grade them wrong on purpose. Scoping something out because you can't verify it honestly is a different move from shipping it and hoping nobody checks.
What I don't know yet
Whether this earns anything is untested. Prime Intellect runs a funded program for environment submissions - real money, open-tier bounties in the low hundreds and an approval-gated tier in the low thousands - but every credited contributor listed so far is an org (Arcee AI, Hud.so, Groq), not a solo account with no prior relationship. Publishing is one command. Getting paid, or even reviewed, by people who've never seen my name before is the actual experiment here, not this post.
What I do know
The corpus that took a year of manual, unglamorous bug-hunting to build turned out to be exactly the training data a grader like this needs. That wasn't the plan when I started filing PRs into strangers' repos for no clearer reason than "this is wrong and I can fix it." It's the kind of connection you only notice once you've done enough of the boring version by hand.
Comments
No comments yet. Start the discussion.