DEV Community

I Tested China's Top 4 AI Models for My Side Hustle - Here's What Won

The Billable Hours Math I Did First

Before running any actual prompts, I calculated what I was spending per typical project. Most of my client work falls into three buckets:

  • Long-form blog content (~3,000 output tokens per piece)
  • Code refactoring and generation (~1,500 tokens per task)
  • Translation and localization (~2,000 tokens per document)

Running that through GPT-4o at $10.00/M output tokens:

  • Blog: 3,000 tokens ร— $10/M = $0.03
  • Code: 1,500 tokens ร— $10/M = $0.015
  • Translation: 2,000 tokens ร— $10/M = $0.02

That's roughly $0.065 per task. Across 100 tasks a month, that's $6.50 - which doesn't sound bad until you remember I was burning through context window usage on longer documents and image analysis that pushed my real bill way higher. After context and vision calls, my actual cost per project was closer to $0.30-$0.50, and I run maybe 200-300 tasks a month doing the side hustle on top of contract work.

Switching even half of those to a $0.25/M model would drop my output costs to roughly $0.025 per task. Multiply by 150 tasks: $3.75 instead of $30. That's the difference between buying a new domain name and renewing my Adobe subscription. The math made sense. Now I just had to figure out which model wouldn't tank my quality.

DeepSeek: My New Default

I started with DeepSeek because it had the loudest buzz and the pricing that seemed almost too good to be true.

Models I tested:

  • V4 Flash at $0.25/M - the daily driver
  • V3.2 at $0.38/M - newer architecture
  • V4 Pro at $0.78/M - when I needed production polish
  • R1 at $2.50/M - the reasoning model for hairy logic problems
  • Coder at $0.25/M - code-specialized

V4 Flash at a quarter per million tokens is genuinely absurd. I threw my usual suite of blog rewrites and code explanations at it, and the quality was indistinguishable from GPT-4o for about 90% of what I needed.

Where it shines: code generation (it scored top-tier on HumanEval and MBPP from what I'd read, and my practical tests confirmed it), speed (around 60 tokens per second, which is the fastest of the four), and English-language work. My American clients couldn't tell the difference.

The weaknesses are real though. There's no native vision support, so anything image-related goes elsewhere. And on Chinese-language tasks, GLM and Kimi both edged it out in my blind tests. I had a client localization project where the tone was slightly off, and switching to GLM fixed it immediately.

For pure billable hour efficiency on English work? DeepSeek V4 Flash is now my go-to. Saving $0.30-$0.40 per task across hundreds of tasks adds up to actual rent money.

from openai import OpenAI

client = OpenAI(
    api_key="ga_xxxxxxxxxxxx",
    base_url="https://global-apis.com/v1"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "user", "content": "Rewrite this product description for a SaaS landing page, punchy and under 80 words"}
    ]
)
print(response.choices[0].message.content)

Qwen: The One With Everything

If DeepSeek is my main squeeze, Qwen is the Swiss Army knife I keep in my other pocket. Alibaba has been absolutely cranking out model variants, and the lineup reflects it.

Models I tested:

  • Qwen3-8B at $0.01/M - penny-per-million absurdness
  • Qwen3-32B at $0.28/M - my workhorse general model
  • Qwen3-Coder-30B at $0.35/M - solid code generation
  • Qwen3-VL-32B at $0.52/M - vision tasks
  • Qwen3-Omni-30B at $0.52/M - multimodal magic
  • Qwen3.5-397B at $2.34/M - when I need enterprise-grade reasoning

That Qwen3-8B at one cent per million output tokens? I genuinely thought it was a typo. It isn't. That model handles classification, short summaries, and simple extraction tasks for less than the cost of a rounding error. For a freelancer doing high-volume repetitive tasks, this is the dream.

The vision and omni-modal support is what really sealed it for me. When a client sends me product photos and asks for alt-text or marketing copy, Qwen3-VL handles it. When I need to extract structured data from screenshots, Qwen3-Omni does the job. No other model family gives me this kind of breadth.

Weaknesses: the naming is genuinely confusing. Qwen3.5, Qwen3.6, Qwen3-Coder-30B, Qwen3-VL-32B - I'm constantly second-guessing which one I'm picking. And on English-only creative work, DeepSeek feels slightly more natural. Also, some of the mid-tier models are priced aggressively, but a couple (looking at you, Qwen3.6-35B at $1/M) feel overpriced for what you get.

Still, if I had to pick ONE provider for everything, it'd be Qwen.

# Switching to Qwen3-32B for general tasks - barely any code change needed
response = client.chat.completions.create(
    model="Qwen/Qwen3-32B",
    messages=[
        {"role": "user", "content": "Write a Python function to merge two sorted lists efficiently"}
    ]
)
print(response.choices[0].message.content)

Kimi: The Premium Brain

Kimi is the one I reach for when the problem actually requires thinking. Moonshot AI built this thing specifically for reasoning tasks, and it shows.

Models I tested:

  • K2.5 at $3.00/M - their flagship
  • K2.6 at $3.50/M - top-tier reasoning

Yeah, Kimi is expensive. $3.00-$3.50/M puts it in the same bracket as GPT-4 territory. So why bother? Because when I'm debugging a gnarly algorithmic problem for a client, or when I'm working through a multi-step business logic refactor, Kimi consistently outperforms everything else. It's slower - three stars for speed on my testing - but the reasoning quality is noticeably better.

I had a project involving a complex state machine migration where Kimi got the architecture right on the first try, while DeepSeek and Qwen needed two or three iterations to nail it.

For my Chinese-language clients, Kimi is also excellent - five stars, tied with GLM at the top. The fluency on technical Mandarin content is genuinely impressive.

The catch is that Kimi doesn't have a budget tier. There's no "Kimi Mini" or "Kimi Lite." It's all premium pricing, all the time. So I use it sparingly - maybe 10-15% of my monthly tasks are Kimi-routed. For those tasks, the higher cost pays for itself in fewer iteration cycles and fewer billable hours wasted on bad outputs.

GLM: The Chinese-Language Champion

Zhipu AI's GLM lineup is the one that surprised me most. I'd heard of it but never really tested it until this exercise.

Models I tested:

  • GLM-4-9B at $0.01/M - same absurd pricing as Qwen's smallest
  • GLM-4.6V at modest pricing - their vision model
  • GLM-5 at $1.92/M - the flagship

GLM-5 at $1.92/M is priced aggressively against Kimi's $3.00/M, and in my testing on Chinese-language creative writing and technical content, GLM-5 actually edged out Kimi for tonal accuracy. When a Beijing-based client asked me to rewrite their app's onboarding flow in natural, idiomatic Mandarin, GLM-5 nailed it on the first shot. Kimi was close but slightly stiffer.

GLM-4-9B at one cent per million? That's the same eye-popping pricing as Qwen3-8B. For high-volume Chinese text processing - extracting data from Chinese documents, classification, short summaries - this thing is unbeatable on cost.

The weaknesses: code generation is its weakest area (three stars in my tests), and the speed is decent but not DeepSeek-fast. But for Chinese content, especially anything culturally nuanced, GLM is my first call.

My Actual Routing Setup Now

After three weeks of real client work routed through these models, here's how I've structured my billable hour cost optimization:

  • 70% of tasks โ†’ DeepSeek V4 Flash ($0.25/M) - English blog content, code explanations, general Q&A, simple refactoring
  • 20% of tasks โ†’ Qwen3-32B ($0.28/M) - when I need vision support, or when the task benefits from Qwen's broader world knowledge
  • 5% of tasks โ†’ Kimi K2.5 ($3.00/M) - complex reasoning, algorithmic work, multi-step logic
  • 5% of tasks โ†’ GLM-5 ($1.92/M) - Chinese-language client work, localization, cultural nuance

My monthly API bill dropped from $84 to about $18-$22. That's $60+ back in my pocket every month. For a freelancer running a side hustle, that's literally the difference between this being a fun hobby and being a real business.

The Real Talk: Which One Should You Pick?

If you're billing clients and watching every dollar, here's my honest breakdown:

  • Just need a cheap English workhorse? DeepSeek V4 Flash. Done. Move on.
  • Need vision and multimodal in your workflow? Qwen. Nothing else comes close in this price range.
  • Doing serious reasoning work? Kimi K2.5. Pay the premium, save the billable hours.
  • Chinese content at scale? GLM-5 for premium, GLM-4-9B for high-volume cheap work.
  • Want one provider for everything? Qwen, no contest.

How I'm Routing All of It

The thing that made this whole experiment viable was Global API. Instead of juggling four different API keys, four different dashboards, and four different billing systems, I'm using a single OpenAI-compatible endpoint at https://global-apis.com/v1. One key, one invoice, one place to track spending across all four providers. For a freelancer, that's not just convenient - that's the only way I'd actually bother maintaining four different integrations.

If you're running your own side hustle or juggling client work and want to cut your API costs without cutting quality, the math is clear: the Chinese models are ready, and they're absurdly cheap.

Comments

No comments yet. Start the discussion.