DEV Community 1h ago

Jetson Nano: Ollama & Optimal Quantization

I am delighted to announce that a user reported dysfunction so that I could go down the rabbit hole of fixing it. Messing around locally is one thing, but building a tolerable app for end users has other considerations when using a ‘local’ AI. This led to interesting findings about the limitations of hardware and gaining a high-level understanding of quantization and its importance. I also explain how to creatively get around limitations.

It's called flippy card, and it's an app that helps you study via custom uploaded content. Wanna see it? It's here: Flippycard

Problems

Ollama was stuck behind Cloudflare in a local loop.
Ollama needs to be accessed outside of systemctl with config file.
Ollama was running painfully slow, and needed to set it to use GPU, not CPU.
Had to build Ollama from source and tell it to use GPU instead.

Implementations

Make sure Ollama goes through Cloudflare tunnel.
Create configuration file so that Ollama is reachable externally.
Build from source to save resources; Docker is too resource intensive for Jetson Nano to be performant.
Increase speed by switching from CPU to GPU - you need to be explicit about these things.
Keep a watchful eye on system resources to make sure the Nano is not at risk of crashing.

I also added a Notes section to my new techdocs page to start documenting implementations.

I built from source to spare resources on the Jetson Orin Nano, which took about 30 minutes. I had concerns about pushing my 8GB of RAM to its limit, which is a very valid concern. I need to keep it as lightweight as possible. As much as a Docker container appealed to me, it is slim-pickin’s on a budget.

Building from Source Steps

Step 1: Install CUDA toolkit + cmake with:

sudo apt install -y cuda-toolkit-13 cmake

Step 2: Install Go (arm64):

curl -LO https://go.dev/dl/go1.24.4.linux-arm64.tar.gz && sudo tar -C /usr/local -xzf go1.24.4.linux-arm64.tar.gz && rm go1.24.4.linux-arm64.tar.gz

Step 3: Add Go and CUDA to PATH:

echo 'export PATH=$PATH:/usr/local/go/bin:/usr/local/cuda-13/bin' >> ~/.bashrc && export PATH=$PATH:/usr/local/go/bin:/usr/local/cuda-13/bin

Step 4: Clone and build Ollama with sm_87:

git clone https://github.com/ollama/ollama /home/anna/ollama-src && cd /home/anna/ollama-src && CUDA_ARCHITECTURES=87 cmake

This whole process took roughly 30 minutes.

Efficiency

I wanted to share this process easily, so I connected the Gmail MCP server to Claude so I can send emails of the process summary to myself. Maximizing efficiency on the Jetson Nano was a good choice.

I really don't want to be managing a bunch of documentation from my server, so I've created a new place to document my journeys on my new techdocs page. I built this while Ollama was building.

Understanding resource considerations, I asked Claude to also build me a new page for me to put all my tech findings - only if it didn't push my resources on my tiny server to the limit. Knowing the process at hand was of the utmost importance. It analyzed the situation, gave me real feedback, and told me what could be done since I was already pushing the limits of hardware capabilities with my current install. This was a great insight to not crash the PC while an important process was running. Since everything was already running, I was able to add new routes easily with a few lines. Of course I realize I could have done this on another computer, but it’s fun to push the limits of hardware.

Quantization - New Topic (for me)

“Quantization is the process of reducing the precision of a digital signal, typically from a higher-precision format to a lower-precision format.” - Brian Clark, IBM

After finally getting everything to work, the request took 13 minutes and 20 seconds using the Q8_0 version. This is where I learned about quantization. I then tried another variant, Q4_K_M to improve the results.

Benchmarks

Since the bottleneck was "model doesn't fit in available GPU memory," we tested a smaller quantization of the exact same model (llama3.2:1b), rather than switching to a different, weaker model family.

Metric	Q8_0 (original)	Q4_K_M (new)
Model file size	1.5 GB	808 MB
GPU layers loaded	3–9 of 17	17 of 17 (100%)
Generation speed	~1.2–1.35 tok/s	~30.7 tok/s
Real 971-token test	13m 20s	~35–45s

That's roughly a 25x speedup, because the entire model now fits on the GPU instead of mostly running on the slow CPU path. Amazing! But what's the catch?

Issues

Q4_K_M is fast but produced malformed output sometimes. Before switching, I ran 6 back-to-back test generations with Q4_K_M to check reliability, since lower-precision quantization can be less consistent. Here's what happened:

2 of 6: perfectly valid JSON, correct structure
1 of 6: valid JSON, but used a slightly different field name than expected
3 of 6: malformed JSON (e.g., a mismatched bracket) that would have crashed the app's parser outright

That's roughly a 50–65% failure rate per attempt. I cannot knowingly ship that, even with a massive speed improvement.

Handling the Error-Prone Behavior of Q4

Rather than giving up on the faster model, automatic retry logic was added to the app itself. If the model's response is malformed, the app now silently tries again up to 3 times before showing an error.

Because each Q4_K_M attempt only takes about 30–45s, even a worst-case 3 attempts is still far faster than a single guaranteed-slow Q8_0 request, while pushing the effective success rate up to roughly 85–95%. This handles the potential parsing errors gracefully. Since Q4 is 25x faster, the client won’t really feel it.

Side Note

I noticed my Jetson had the “super” abilities after a month of setting it up. Don’t do what I did. Check for super abilities first. It’s a free download. XD

Read on DEV Community ↗ ← Back to News