Reddit - r/MachineLearning

We'll benchmark an Open weights LLM on any GPU you choose - drop your model + hardware and we'll run it. [D]

We run HexGrid Cloud, a platform for deploying open-source models on GPUs, and we're heads-down optimizing our serving/deployment layer. To pressure-test it we're benchmarking real models under real concurrency - and instead of guessing, we'd rather run what you actually want to see.

Models Available for Benchmarking

  • Nemotron-3 Super 120B-A12B (only NVFP4)
  • Nemotron-3 Nano 30B A3B
  • Qwen-3.6 27B
  • Llama 3.3 70B Instruct
  • Gemma-4 31B
  • Devstral-Small-2-24B-Instruct-2512
  • ?? (you suggest a model to us)

We're focused on chat/instruct models for now (that's what most of our users deploy), so pick one from the list above - or suggest another open-weight chat model that fits on a single H200 (141GB).

Hardware & Quant Choices

GPU (up to H200 for this round):

  • RTX PRO 6000
  • L40S
  • H100
  • H200

Quant: FP8 / AWQ / BF16

Context length: 8K, 32K, 64K, 128K

What You Want Measured

  • Max throughput?
  • Single-stream speed?
  • Long-context prefill?

We'll run the top picks and post full results - tokens/sec, TTFT, TPOT, throughput under concurrency, and cost-per-million-tokens - config and flags included so it's reproducible. Let us know in comments.

Comments

No comments yet. Start the discussion.