Mistral AI Launches API For LLM-Based OCR Of Multimodal Documents

Now available on Mistral’s la Plateforme SaaS, Mistral OCR aims to provide an OCR solution for digitizing complex documents that interleave text and images, tables, mathematical expressions, and advanced layouts. This makes it particularly suitable for digitizing scientific research, historical documents and artifacts, user manuals, and more, the company says.

Mistral OCR uses Mistral LLMs to understand content extracted by OCR-ing a document. This helps understanding its context and the relationships between document elements, which makes it suitable for use with RAG systems taking multimodal documents as input.

According to the company’s own benchmarking, Mistral OCR outperforms other leading OCR solutions, including Google Document AI, Azure OCR, Gemini 1.5 and 2.0, and GPT-4o.

Unlike other models, Mistral OCR comprehends each element of documents—media, text, tables, equations—with unprecedented accuracy and cognition. It takes images and PDFs as input and extracts content in an ordered interleaved text and images.

Mistral AI maintains that its OCR API is the only one that extracts embedded images from documents along with text. The resulting text plus images are exported into a markdown file. Additional formats are supported for structured output, such as JSON, to chain OCR output into a more complex workflow), which can be useful to build agents.

When it comes to multilingual support, Mistral AI emphasizes its solution can parse, understand, and transcribe thousands of scripts, fonts, and languages.

Mistral OCR is already powering Mistral’s le Chat LLM-powered chat solution and will be available soon for on-premises deployments. According to the company, it can process up to 2000 pages per minute on a single node.

To use Mistral OCR API in Python, you install the mistralai package, which provides support for authentication and for using all capabilities provided by Mistral API. To process a file, you need to upload it first, as shown in the following snippet:

# Upload PDF file to Mistral's OCR service
uploaded_file = client.files.upload(
 file={
 "file_name": pdf_file.stem,
 "content": pdf_file.read_bytes(),
 },
 purpose="ocr",
)

# Get URL for the uploaded file
signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)

# Process PDF with OCR, including embedded images
pdf_response = client.ocr.process(
 document=DocumentURLChunk(document_url=signed_url.url),
 model="mistral-ocr-latest",
 include_image_base64=True
)

# Convert response to JSON format
response_dict = json.loads(pdf_response.model_dump_json())

The API is currently limited to files that do not exceed 50MB in size or 1,000 pages in length. The price is set to 1,000 pages/USD or 2,000 pages/USD when using batch OCR.

Mistral AI Launches API for LLM-Based OCR of Multimodal Documents

Leave a Reply

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Leave a Reply