Peter India logo

Multimodal Language Models for Optical Character Recognition

A curated directory of multimodal language models and OCR engines — covering document AI, vision-language understanding, and optical character recognition for agentic and document processing workflows.

  1. MinerU Transforms complex documents like PDFs and Office docs into LLM-ready markdown/JSON for your agentic workflows.
  2. Boost.ai Transform customer service and internal support with a no-code AI agent platform that streamlines support, boosts revenue, and improves customer satisfaction.
  3. GLM-OCR A multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. Introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization.
  4. PaddleOCR Uses cutting-edge algorithms and proven performance in real-world applications, already powering popular open-source projects like Umi-OCR, OmniParser, MinerU, and RAGFlow.
  5. DeepSeek OCR A two-stage transformer-based document AI that compresses page images into compact vision tokens before decoding them with a high-capacity mixture-of-experts language model. Stage 1 merges a windowed SAM vision transformer with a dense CLIP-Large encoder and a 16× convolutional compressor; Stage 2 uses the DeepSeek-3B-MoE decoder (~570M active parameters per token) to reconstruct text, HTML, and figure annotations with minimal loss.
  6. Mistral OCR Digitize PDFs, scans, DOCX, PPTX, and handwritten sources. Extract to structured JSON with custom templates, parse forms, classify documents, and process images down to charts, signatures, and fine print.
  7. Qwen3.7-Plus Better OCR across more languages and complex scenes.
  8. Gemini 3.1 Flash-Lite A low-latency, cost-effective multimodal model optimized for high-frequency, lightweight tasks. Supports text, image, video, audio, and PDF inputs, and is designed for high-volume agentic workflows, simple data extraction, and applications where latency and API cost are the primary constraints.
  9. Tesseract OCR Tesseract 4 adds a new neural net (LSTM) based OCR engine focused on line recognition, while still supporting the legacy Tesseract 3 engine that works by recognizing character patterns.