Trending:
AI & Machine Learning

IBM's Docling hits 42,000 GitHub stars for structured document extraction in RAG pipelines

Docling, IBM Research's open-source document parser, processes complex enterprise PDFs into structured JSON for retrieval-augmented generation systems. The library gained 42,000 GitHub stars and 1.5M monthly PyPI downloads by preserving tables, hierarchies, and layout - addressing accuracy gaps traditional text extractors create in RAG pipelines.

IBM's Docling Gains Traction for Enterprise RAG Pipelines

Docling, IBM Research's open-source document processing library, has gained significant adoption for retrieval-augmented generation implementations. The tool addresses a persistent problem: traditional PDF extractors mangle complex layouts, turning structured tables into unusable text strings.

The numbers suggest real usage. 42,000 GitHub stars. 1.5 million monthly PyPI downloads. 2,400 organizations implementing it. IBM donated the project to the Linux Foundation's AI & Data Foundation in April 2025. It's now embedded in IBM Granite, Red Hat InstructLab, Watsonx.ai, and OpenSearch pipelines.

What It Actually Does

Docling converts PDFs, DOCX, HTML, images, and audio into structured JSON or Markdown while preserving document semantics. Tables stay tables. Section hierarchies remain intact. Multi-column layouts get read correctly, not across columns.

The library uses models trained on 81,000 labeled pages for layout analysis. That training shows in table extraction accuracy - a metric where basic PDF parsers typically fail. It handles financial reports, technical specifications, and other documents where structure matters.

IBM has processed 2.1 million Common Crawl PDFs with it. They're planning to run it across 1.8 billion documents for Granite multimodal training.

The RAG Pipeline Context

RAG systems retrieve relevant document chunks to provide context for LLM responses. The quality of chunk boundaries directly affects answer accuracy. Split a table mid-row, and your system can't answer "What was Q3 revenue?" reliably.

Docling chunks by semantic units - complete sections, full tables, intact paragraphs with their headers. Each chunk includes metadata: section hierarchy, page number, content type, document position. That metadata enables filtering ("search only tables") or prioritization ("boost executive summary matches").

A Pathway integration added real-time multimodal RAG capabilities, though their documentation notes you may need additional token splitters for long passages.

Alternative Approaches

The market offers options: PyMuPDF for speed, Unstructured for format variety, LlamaParse for LLM-powered extraction, MarkItDown for simpler conversions. Microsoft Research recently released MarkItDown for lightweight document conversion.

Docling's differentiator is hierarchical structure preservation at scale. Whether that matters depends on your documents. For simple text PDFs, simpler tools work fine. For financial reports, technical documentation, or multi-format processing pipelines, the structured output appears to justify the additional complexity.

IBM Research presented a PyData Global 2025 tutorial on RAG integration. The code is Apache-licensed. Whether it becomes infrastructure or gets replaced by the next approach remains to be seen.