Reddit - r/MachineLearning

Find the best open-source OCR models in one place at Papers with Code [P]

Overview

I've created an overview of the most important OCR benchmarks, along with the top open models, and links to their paper and code: https://paperswithcode.co/tasks/ocr.

This week, new OCR models were released by Baidu and Mistral. Baidu released Unlimited OCR, a 3B-parameter model that introduces a key innovation called Reference Sliding Window Attention (R-SWA) and builds on top of DeepSeek OCR. Mistral released OCR 4, which is available via an API.

What is OCR?

OCR, or Optical Character Recognition, is the task of digitizing PDFs or scanned documents. There's, of course, a huge interest in this task, as it enables ingestion of all company data for agentic use cases. AI agents love Markdown; it can be valuable to turn all those messy PDF documents into a standardized, machine-readable format. This enables use cases like agentic RAG (retrieval-augmented generation), which powers chatbots, both internally and for external customer support.

Navigating the Landscape

With a large number of OCR releases on Hugging Face over the last few months, it may be hard to know which one to use. Hence, I've built this page, which lists the major OCR benchmarks, along with the top-performing models and links to their code. This is obviously made available on Papers with Code, the website I'm maintaining (it's a revival of the old website, which was taken down).

Recommended Benchmarks

The top recommended benchmarks are:

  • OlmOCRBench, created by Ai2
  • OmniDocBench, created by Shanghai AI Laboratory

Current Top Recommendations

  • Chandra OCR 2 by Datalab
  • Mistral OCR v4

The former is openly available, hence you can either self-host it or use their serverless API.

Let me know which other tasks you want to see major benchmarks for now!

Cheers,
Niels
open-source @ HF

Comments

No comments yet. Start the discussion.