Nvidia Ingest is a new microservice aimed at processing document content and extracting metadata into a well-defined JSON schema. Ingest is able to process PDFs, Word, and PowerPoint documents and extract structured information from tables, charts, images, and text using optical character recognition.
To use Nvidia Ingest, you provide it with a JSON job description of the payload to ingest. You can then retrieve the results as a JSON dictionary with metadata for all extracted objects, processing annotations, and timing/trace information.
Nvidia has not provided figures about Ingest performance but says it is scalable and can use multiple processing methods to improve accuracy or increase throughput. For PDF documents, Ingest can use pdfium, Unstructured.io, or Adobe Content Extraction Services.
For example, using nv-ingest-cli
, the command line tool used to interact with Nvidia Ingest, you specify how to process a document using the --task
argument, which includes an extract_method
option:
nv-ingest-cli
...
--task='extract:{"document_type": "pdf", "extract_method": "pdfium", "extract_text": true, "extract_images": true, "extract_tables": true, "extract_tables_method": "yolox"}'
...
Nvidia explicitly states that you cannot use Ingest to create a pipeline to carry through a sequence of operations on the documents in the payload. Yet, you can run various pre- or post-processing transformations, including text splitting and chunking, filtering, embedding generation, and image offloading. This means you can use multiple --task
arguments for the same nv-ingest-cli
execution. For example, you can add a dedup
(de-duplication) step by using:
nv-ingest-cli
...
--task='extract:{...}
--task='dedup:{"content_type": "image", "filter": true}'
...
The tool can be used on a single document specified with the --doc
argument or on a set of documents simultaneously by providing a JSON-formatted dictionary describing the batch payload.
All extracted data are stored in an output directory containing a subdirectory for each document type, e.g., image, text, structured, etc. Each ingested document generates a JSON metadata file with the extracted content; source metadata including source name, location, type, etc.; and content metadata. Content metadata includes both general and type-specific content metadata. For example, for images, you get the image type, any caption, the location, size, and so on; for text, you get a summary, a list of keywords, the language, etc.; for tables, you get the format, location, the content as text, any caption or title, etc.
Nvidia Ingest requires a number of supporting services, both from Nvidia and open-source projects, including redis, yolox, otel-collector for open telemetry, prometheus, grafana, and more. They are packaged as a Docker Compose application to make deployment easier. It also requires support for CUDA and the Nvidia Container Toolkit and a minimum of two H100 or A100 GPUs with at least 80GM memory.