DEV Community 2h ago

What 'quality-tested' actually means for a library of 394 AI skills

A skill ships stable only if it clears two bars

Every skill carries a status. To reach stable - the only status the library promotes - it has to pass a seven-dimension evaluation:

Overall mean ≥ 4.0/5 across coherence, relevance, accuracy, completeness, usefulness, format-fit, and Editorial Naturalness.
Editorial Naturalness ≥ 4.0 as a hard floor - a skill can ace the other six and still fail here.

This dimension scores the output against observable AI tells (lexical, structural, tonal, genre). It's the one that stops competent-sounding slop from shipping. The library mean across all stable skills is 4.38.

The whole framework - dimensions, thresholds, the banned-phrase list - is in the repo, not a marketing page.

Two stages, because prose isn't code

Code eval is binary: it runs or it doesn't. Prose has no green checkmark, so the library tests in two layers.

First, binary assertions catch the mechanical failures - did it produce the required sections, did it refuse to fabricate a quote with no source. Across thousands of these the pass rate is high, and the few "failures" were skills correctly refusing to invent content on deliberately thin inputs - the behaviour you want.

Second, the graded rubric above handles the judgment calls binary checks can't.

Where it's soft - said plainly

The graded scoring uses a model as judge, and models are generous: they tend to like fluent text, including fluent AI text. So the scores are treated as a filter, not a verdict. Three things keep them honest:

The bar is set high (≥ 4.0 with the naturalness floor), so borderline output doesn't pass.
The rubric is anchored on observable tells, not taste, so two runs roughly agree.
The worked example in every skill lets you check the output yourself, by eye, in seconds.

It is not a guarantee every output is perfect. It's a documented, repeatable bar that's a lot higher than "we tried it once."

Why bother for free skills

Because the audience is media professionals, and they detect generic instantly. A skill library for people who notice bad writing has to be testable on exactly that axis, or the whole premise collapses.

The eval framework isn't a credential - it's the thing that makes "doesn't sound like AI" a claim you can check instead of a vibe. Open the repo, open any skill, read its example, and judge for yourself. That's the test that matters.

→ github.com/ur-grue/autopunk-media-skills

Read on DEV Community ↗ ← Back to News

What 'quality-tested' actually means for a library of 394 AI skills

A skill ships stable only if it clears two bars

Two stages, because prose isn't code

Where it's soft - said plainly

Why bother for free skills

Comments