Portable Document Format (PDF) is a file format for presenting documents independently of software, hardware, or operating systems.
- Parxy - A PDF parsers gateway to use different parsers using a unified API.
- Docling - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
- SmolDocling - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
- Filimoa/open-parse - Improved file parsing for LLMs.
- VikParuchuri/surya - OCR, layout analysis, reading order, table recognition in 90+ languages.
- UniModal4Reasoning/StructEqTable-Deploy - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
- huridocs/pdf-document-layout-analysis - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
- Reducto - Document Ingestion API.
- adithya-s-k/omniparse - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
- lumina-ai-inc/chunkr - Vision model based PDF chunking.
- lumina-ai-inc/PaddleOCR - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
- allenai/olmocr - Toolkit for linearizing PDFs for LLM datasets/training.
- opendatalab/PDF-Extract-Kit - A comprehensive toolkit for high-quality PDF content extraction.
- smalot/pdfparser - A standalone PHP library, provides various tools to extract data from a PDF file.
- Unstructured-IO/unstructured - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
- PyMuPDF4LLM - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
- CatchTheTornado/pdf-extract-api - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
- climatepolicyradar/navigator-document-parser - Parsing PDFs and websites containing laws and policies.
- shipsaas/docking - Shared-microservice that takes over the document templates management & render/export PDF.
- WeasyPrint - Generate PDF using html and CSS.
- qpdf/qpdf - A content-preserving PDF document transformer.
- Stirling-Tools/Stirling-PDF - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
- unjs/unpdf - Utilities to work with PDFs in Node.js, browser and workers.
- PdfRest - PDF Api to create, shrink and compress.
- Gotenberg - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
- Smallpdf - Set of tools to extract and manipulate PDF content.
- typst/typst - A new markup-based typesetting system that is powerful and easy to learn.
- Vexlio - Tool to create diagrams and export in SVG or PDF.
- renamed.to - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
- veraPDF - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).
- mozilla/pdf.js - PDF Reader in JavaScript.
- agentcooper/react-pdf-highlighter - Set of React components for PDF annotation.
- Sioyek - PDF viewer with a focus on technical books and research papers (desktop app).
- tpn/pdfs - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
- pdf-association/pdf-corpora - An index of PDF-centric corpora.
- DS4SD/DocLayNet: DocLayNet - A large human-annotated dataset for document-layout analysis.
- gipplab/pdf-benchmark - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
- DocBank Dataset - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.
See Contributing for details.