DEV Community
Grade 10
5h ago
My video generation pipeline that built itself
Let me show you something cool. This two-minute video was built by Claude Code from a single prompt. Okay — one prompt and about thirty follow-ups. And then twenty more after Claude Code fumbled a git command and wiped out half of my video-editing material (don't ask). But it's still pretty cool, because I didn't do a single thing by hand. No image editor, no video timeline, no audio software, no clicking around in a tool. It was just a conversation between me and Claude. And all the tools it used along the way — for generating the images, synthesizing the voice, editing the video, the glue code that ties it together — Claude built for itself. This is not the greatest video in the world, but I think it does its job — explaining the rules of my side project — quite well. The two of us made it: me and Simona , my heavily customized Claude Code setup. I made the directorial calls — those highly detailed images don't work, try a chalkboard instead — and she did everything else. "Everything else" is a kit of skills — small, self-contained tools Simona reaches for the way you'd reach for an app: Image generation — OpenAI's gpt-image-2 and Google's Nano Banana 2 ( gemini-3.1-flash-image ), for every still in the video. Image-to-video — turning a still into a few seconds of motion. A separate skill per model: Seedance 2.0 (the workhorse here), Google's Veo 3, Kling 3, and LTX-2.3. Voice — a skill per model: ElevenLabs for the final narration (priciest, best), Google's Gemini TTS for cheap drafts, and Kokoro running locally for free dry runs. ffmpeg — the editing layer under all of it: the cuts, the zooms, the crossfades, the audio mix. Director — the meta-skill that ties the others together: it knows the whole pipeline, so a single high-level ask can fan out into image, voice, video, and edit steps in order. It's way simpler than it sounds. The rest of this post is how we built it. The whole thing is reproducible from the project's WORKLOG.md alone, which is over 900 lines and contains every prompt, every model call, every cost, and every fix. I'll quote from it where it makes the story sharper. The pipeline that emerged It grew out of one trivial request, one skill at a time. When I started, I had nothing. No pipeline, no skills, no plan — just a Claude Code session and a few images sitting in a folder. My first ask was almost trivial: take these images, show them one after another, and play a voice reading the narration over the top. That one request is what kicked everything off, because to pull it off Simona needed two things she didn't have yet — a way to make the voice, and a way to stitch it all into a video. Voice first So the first skill we built was voice . I found a text-to-speech API, pasted its documentation straight into the session, and told her: the key's already in the environment, read this, get me one line of spoken audio. She fumbled for a minute, hit a wrong parameter or two, and then a WAV came back. The moment it worked, we froze that path into a skill — a little directory with a SKILL.md explaining how and when to use it, and a small Python CLI wrapping the call — so she'd never have to rediscover it. That recipe (paste the docs, make one successful call, write down the path that worked) became how every later skill got built. Wiring up the skill was the easy part. Picking the actual voice was the surprisingly hard one. Apparently, describing a voice in words is not easy. "Deep, warm, a little sinister, older British man" gets you a dozen different readings, none of them the one in your head. So I went through ElevenLabs' voice library by ear instead, and landed on George, a British storyteller voice. He wasn't a werewolf host out of the box, so we pitched him down about 15% and ran him through a hall-echo filter, and suddenly he sounded like something with too many teeth narrating from the far end of a stone corridor. That's the narrator you hear across the whole video. I suspect using a real actor or singer as the reference would work even better. ffmpeg: describe the edit, get the command Then came assembly, and this is where Simona showed me something I didn't expect. I asked how she'd put the images and the audio together, and she just... wrote an ffmpeg command. It turns out an LLM is very good at ffmpeg — that famously cryptic tool with a thousand flags no human remembers. You don't write the command; you describe the edit, and she produces the invocation. She even, unprompted, started adding a slow zoom into each still — the Ken Burns effect — because an image held still for four seconds looks dead. I liked it. It was the beginning of the static image effects library. When I say "hold on this image and slowly zoom in," what actually runs is this: ffmpeg -i doors.png -vf "zoompan=z='1+(1.4-1)*on/(frames-1)':d=100: \ x='iw/2-iw/zoom/2':y='ih/2-ih/zoom/2':s=3840x2160:fps=25, \ scale=1920:1080:flags=lanczos" -frames :v 100 scene.mp4 No way I could write this manually. I'd have to go read
Let me show you something cool. This two-minute video was built by Claude Code from a single prompt. Okay — one prompt and about thirty follow-ups. And then twenty more after Claude Code fumbled a git command and wiped out half of my video-editing material (don't ask). But it's still pretty cool, because I didn't do a single thing by hand. No image editor, no video timeline, no audio software, no clicking around in a tool. It was just a conversation between me and Claude. And all the tools it used along the way — for generating the images, synthesizing the voice, editing the video, the glue code that ties it together — Claude built for itself. This is not the greatest video in the world, but I think it does its job — explaining the rules of my side project — quite well. The two of us made it: me and Simona, my heavily customized Claude Code setup. I made the directorial calls — those highly detailed images don't work, try a chalkboard instead — and she did everything else. "Everything else" is a kit of skills — small, self-contained tools Simona reaches for the way you'd reach for an app: - Image generation — OpenAI's gpt-image-2 and Google's Nano Banana 2 (gemini-3.1-flash-image ), for every still in the video. - Image-to-video — turning a still into a few seconds of motion. A separate skill per model: Seedance 2.0 (the workhorse here), Google's Veo 3, Kling 3, and LTX-2.3. - Voice — a skill per model: ElevenLabs for the final narration (priciest, best), Google's Gemini TTS for cheap drafts, and Kokoro running locally for free dry runs. - ffmpeg — the editing layer under all of it: the cuts, the zooms, the crossfades, the audio mix. - Director — the meta-skill that ties the others together: it knows the whole pipeline, so a single high-level ask can fan out into image, voice, video, and edit steps in order. It's way simpler than it sounds. The rest of this post is how we built it. The whole thing is reproducible from the project's WORKLOG.md alone, which is over 900 lines and contains every prompt, every model call, every cost, and every fix. I'll quote from it where it makes the story sharper. The pipeline that emerged It grew out of one trivial request, one skill at a time. When I started, I had nothing. No pipeline, no skills, no plan — just a Claude Code session and a few images sitting in a folder. My first ask was almost trivial: take these images, show them one after another, and play a voice reading the narration over the top. That one request is what kicked everything off, because to pull it off Simona needed two things she didn't have yet — a way to make the voice, and a way to stitch it all into a video. Voice first So the first skill we built was voice. I found a text-to-speech API, pasted its documentation straight into the session, and told her: the key's already in the environment, read this, get me one line of spoken audio. She fumbled for a minute, hit a wrong parameter or two, and then a WAV came back. The moment it worked, we froze that path into a skill — a little directory with a SKILL.md explaining how and when to use it, and a small Python CLI wrapping the call — so she'd never have to rediscover it. That recipe (paste the docs, make one successful call, write down the path that worked) became how every later skill got built. Wiring up the skill was the easy part. Picking the actual voice was the surprisingly hard one. Apparently, describing a voice in words is not easy. "Deep, warm, a little sinister, older British man" gets you a dozen different readings, none of them the one in your head. So I went through ElevenLabs' voice library by ear instead, and landed on George, a British storyteller voice. He wasn't a werewolf host out of the box, so we pitched him down about 15% and ran him through a hall-echo filter, and suddenly he sounded like something with too many teeth narrating from the far end of a stone corridor. That's the narrator you hear across the whole video. I suspect using a real actor or singer as the reference would work even better. ffmpeg: describe the edit, get the command Then came assembly, and this is where Simona showed me something I didn't expect. I asked how she'd put the images and the audio together, and she just... wrote an ffmpeg command. It turns out an LLM is very good at ffmpeg — that famously cryptic tool with a thousand flags no human remembers. You don't write the command; you describe the edit, and she produces the invocation. She even, unprompted, started adding a slow zoom into each still — the Ken Burns effect — because an image held still for four seconds looks dead. I liked it. It was the beginning of the static image effects library. When I say "hold on this image and slowly zoom in," what actually runs is this: ffmpeg -i doors.png -vf "zoompan=z='1+(1.4-1)*on/(frames-1)':d=100:\ x='iw/2-iw/zoom/2':y='ih/2-ih/zoom/2':s=3840x2160:fps=25,\ scale=1920:1080:flags=lanczos" -frames:v 100 scene.mp4 No way I could write this manually. I'd have to go read about zoompan , work out why the zoom is expressed as a per-frame fraction of the total frame count, puzzle through the x /y centering algebra, and then discover the hard way that you have to render at 4K and downscale with lanczos or the slow zoom develops a visible jitter. Or take mixing the narration in over a bed of ambient sound, with each voice line dropped at its own timestamp: ffmpeg ... -filter_complex \ "[1:a]adelay=300|300[a1];[2:a]adelay=4500|4500[a2];[3:a]adelay=10000|10000[a3];\ [0:a][a1][a2][a3]amix=inputs=4:duration=first:normalize=0[out]" ... That normalize=0 at the very end is the kind of detail that costs a human an hour and a forum thread to learn — leave it off and amix quietly divides every track's volume by the number of inputs, so your carefully recorded narration comes out faint and you have no idea why. Simona either already knows it or learns it once, the hard way, and then writes it into the skill so neither of us ever trips on it again. We froze the whole approach into an ffmpeg skill, the editing layer everything else now sits on top of. Images, then motion That gave me a working slideshow, and once I had it, the appetite grew. Hunting down images by hand felt silly when I could generate exactly the shot I wanted, so we built an image generation skill the same way — paste the provider's docs, get one good image back, freeze the path. The library of effects — Ken Burns in any direction, crossfades, slow scrolls for tall images, animated highlights drawn over a live UI — grew one request at a time. I'd ask for something new, she'd try a few versions, and we kept whatever looked right. Nobody planned that effect library. It accreted. Then static frames stopped being enough. I wanted real motion in the hero moments — the cloaked figure pulling back its hood, the mansion doors swinging open — and that meant AI-generated video. This is where money stops being a rounding error. A generated image costs a few cents; five seconds of generated video costs anywhere from thirty cents to three dollars depending on the model. So the entire shape of the video is, underneath, an economics decision. If I'd generated the whole two minutes as AI video it would have cost a fortune. Instead the cheap slideshows carry most of the runtime, and I spend real money on generated motion only for the handful of shots that actually earn it. Slideshow for the rules; generated video for the hood reveal. Finding a video model I could live with took longer than anything else, because this corner of the market is a mess. I started on Google's Veo — gorgeous, and brutal on the wallet at about three dollars for a single short clip. Then I moved to Kling, a Chinese model that ran roughly a dollar for five seconds and was good enough for a lot of shots (I tried Wan too, in the same bracket, and didn't keep it). I also tried LTX, which is probably the best open-source video model out there right now and is available through an official API for something like thirty to fifty cents per five-second clip; it has no audio at all, but that makes it perfect for cheap dry runs. And "official" is doing a lot of work in that sentence, because for most of these models there is no first-party API — you go through third-party platforms with their own strange credit systems and pricing, and finding one that's reliable and not a rip-off took real time. The one I settled on as my workhorse is Seedance 2.0, which is the king of the hill at the moment. Having a unified voice in gen-AI videos and slideshows was a challenge until I discovered reference-to-video models. Instead of handing the model a single still and a prompt, you give it several reference images, a sample of the voice you want, and a prompt describing how the whole thing should move and speak. This gave me consistency: the character stays the same character from shot to shot, and he speaks in the same voice that carries the slideshow narration. Pick one voice, use it for the spoken slides and feed it as the reference to the video model, and the seams between a generated clip and a static section stop announcing themselves. The whole thing feels like one narrator walking you through one world. Skills as a scar collection And every time we hit a wall, the fix went back into the skill. A voice model that choked on em-dashes near names, a zoom that jittered at high resolution, an image endpoint that quietly ignored a parameter — each one became a documented gotcha in its SKILL.md so she'd never walk into it twice. The skills are basically a scar collection. The strange part is how little I actually look inside these skills. I almost never open the files. I just ask her to revisit and tidy them every so often, and when one has grown into a sprawling mess I have her refactor it. Eventually I wired that up as a Claude Code hook so she does the housekeeping on her own schedule instead of waiting for me to remember — though that only earns its keep once a skill has gotten bi
Comments
No comments yet. Start the discussion.