I Used an AI Agent to Make a Product Video. The Cost Was $0, But There's a Catch.
The Plan: Smooth Sailing and a Reference Point
Getting started with OpenMontage was a breeze. Aside from installing FFmpeg, a well-documented prerequisite, the make setup command handled all Python and Node dependencies cleanly.
The journey began not with a blank prompt, but with a reference video I chose for its polish and recency: Linear's "Introducing Linear Agent." OpenMontage's video analyzer ingested the YouTube URL and returned something far more insightful than a mere transcript. It produced a five-aspect cinematographic breakdown-subject, motion, scene, framing, camera-and even inferred why the style worked, noting the "high-contrast dark palette... to make plain text look incredibly premium."
This single prompt took me from a URL to a creative brief with a full estimate of API costs, complete with two distinct concepts grounded in research on the local-first ecosystem. The agent was off to a flying start.
The Build: An Agent at the Helm
The agent returned with a STORYBOARD.md, a detailed 75-second timeline mapping script beats to assets and motion design. It looked solid, so I gave it the green light.
A key insight emerged mid-build: OpenMontage isn't an agent itself. As the README states, "Your AI coding assistant IS the orchestrator." The project is a powerful toolkit of pipelines, tools, and skills designed to be wielded by an external agent-in my case, the Antigravity CLI (agy). This reframed the experiment: I wasn't testing a monolithic product, but how well its tools and instructions could steer my chosen agent.
The first render attempt was a mixed bag. The agent correctly re-timed the video from 75s to 53s, accounting for the faster-than-estimated narration from the local Piper TTS model. But it also hit a wall. The storyboard had vaguely described a "dedicated subtitle track generated via our transcriber." The result? The entire script was dumped on-screen at once, complete with raw transcriber tokens like [_BEG_] and awkward word splits (CR DT). It was a perfect lesson in agentic workflows: the vaguest line in the plan is precisely where it will break.
First Look: Art School Project, Not Polished Product
After stripping the broken caption track, I watched the first complete version. The verdict was immediate: it was cohesive and on-theme, but it looked more like an art school video project than a professional explainer. The problems were substantive:
- Missing Information: The video mentioned CRDTs-the core technical concept-but never explained or diagrammed them. A crucial GCP billing issue earlier in the process had blocked access to Google's Imagen, and the fallback visuals never filled this explanatory gap.
- Sync Issues: On-screen text and bullet points were frequently out of sync with the narration, sometimes appearing seconds early or transposed.
- Amateur Aesthetics: The typography felt like a PowerPoint slide, and the b-roll clips, while individually fine, were repeated randomly.
The agent had nailed the vibe but whiffed on the substance. It had assembled atmosphere where I needed explanation.
Iteration and a Quality Plateau
I tasked the agent with a targeted revision: fix the timing and add a diagram explaining CRDTs. It successfully corrected the sync issues and generated a Mermaid flowchart for the diagram. This was a definite improvement, but it also highlighted the agent's limits.
The Mermaid diagram, while technically correct, had the aesthetic of a corporate IT presentation-the wrong register entirely for a polished product video. The output had hit a quality plateau. We could iterate on the details, but the fundamental feel remained amateurish. This was the stopping point.
Friction, Surprises, and the Real Cost
The process also revealed several fascinating and instructive points of friction. At one point, the agent silently stalled, seemingly stuck in a long thought process. It turned out to be waiting for a hidden permission prompt, a reminder to check an agent's underlying processes. More surprisingly, the agent autonomously patched OpenMontage's source code to fix a bug in how it loaded API keys from the .env file.
The most critical lesson, however, was about cost. The final bill for the video was effectively $0. The handful of calls to Google's Imagen API fell within the free tier. But getting there wasn't free. A single GOOGLE_API_KEY doesn't unlock all of Google's services; the key for Gemini doesn't work for Cloud Text-to-Speech or Imagen out of the box. Unlocking Imagen required enabling billing on my Google Cloud project, which involved a ten-minute detour to set up a new billing account with a $10 minimum prepay.
This is the crucial asterisk. The marginal cost of making the video was pennies, but the floor to enter was $10 and a bureaucratic setup process. The pocket change promise is real, but the on-ramp isn't free.
The Verdict: A Powerful Tool with a Ceiling
So, would I use OpenMontage again? Absolutely. It delivered on its core promise, orchestrating a complex production pipeline from a simple prompt for virtually no cost. The economic advantage over cloud video tools is staggering. Coming from the world of manual video editing, the ability to iterate on a script or timing with a single prompt feels like a superpower.
But the output has a ceiling. The final product never felt professional. A key workflow gap is the inability to easily review visual assets before they're baked into the final composition. For a truly polished result, I'd need a more hands-on approach, using the agent for heavy lifting but manually guiding the script, visual selection, and final composition.
OpenMontage proves that agent-driven video production is not only possible but incredibly cost-effective. It can get you 80% of the way there for 1% of the cost. Closing that final 20% gap, however, still requires a human hand at the wheel.
Comments
No comments yet. Start the discussion.