- Published on
Text extraction from documents with Kreuzberg
- Authors
- Name
- Arunabh Bora
- @arunabh223
How does Kreuzberg help?
Kreuzberg is a lightweight Python library for text extraction from documents. It provides a unified asynchronous interface for handling multiple document formats. Kreuzberg has only two dependencies: Tesseract and Pandoc. All processing is done locally, without the need for API calls. This makes it ideal for use in serverless functions and Dockerized applications.
The official repository can be found here: Kreuzberg Github repository
Kreuzberg under the hood
Kreuzberg is equipped with a powerful suite of tools for handling diverse document types. For PDF processing, it uses pdfium2 for searchable PDFs and Tesseract OCR for scanned content. Its document conversion capabilities include Pandoc for various document and markup formats, python-pptx for PowerPoint files, html-to-markdown for converting HTML content, and calamine for Excel spreadsheets with multi-sheet support. On the text processing side, Kreuzberg offers smart encoding detection, along with seamless handling of both Markdown and plain text files.
Therefore, it can handle multiple file formats — pdf, docx, pptx, rtf, pub, html, txt, xlsx, jpg etc.
The library offers function for both single file processing and batch processing (both sync and async).
All extraction functions return an ExtractionResult object or a list of object (for batch functions). The ExtractionResult object has the following attributes:
- content: The extracted text (str)
- mime_type: Output format ("text/plain" or "text/markdown" for Pandoc conversions)
- metadata: A metadata dictionary. Currently this dictionary is only populated when extracting documents using pandoc.
Observations
In both the single and batch extraction tests, Kreuzberg successfully extracted content from PDF files. Its performance on scanned documents was impressive, handling OCR well.
However, I noticed limitations when extracting tables. The tool flattened the table structure, losing the row and column organization. This made it difficult to analyze tabular data, especially in documents where preserving the layout was essential for proper interpretation.
Despite this, Kreuzberg is a solid choice for local, API-free document extraction — especially if your primary goal is text extraction rather than structured data.