Awesome PDF

Portable Document Format (PDF) is a file format for presenting documents independently of software, hardware, or operating systems.

Parsers, OCR and extraction

Parxy - A PDF parsers gateway to use different parsers using a unified API.
Docling - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
SmolDocling - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
Filimoa/open-parse - Improved file parsing for LLMs.
VikParuchuri/surya - OCR, layout analysis, reading order, table recognition in 90+ languages.
UniModal4Reasoning/StructEqTable-Deploy - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
huridocs/pdf-document-layout-analysis - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
Reducto - Document Ingestion API.
adithya-s-k/omniparse - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
lumina-ai-inc/chunkr - Vision model based PDF chunking.
lumina-ai-inc/PaddleOCR - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
allenai/olmocr - Toolkit for linearizing PDFs for LLM datasets/training.
opendatalab/PDF-Extract-Kit - A comprehensive toolkit for high-quality PDF content extraction.
smalot/pdfparser - A standalone PHP library, provides various tools to extract data from a PDF file.
Unstructured-IO/unstructured - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
PyMuPDF4LLM - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
CatchTheTornado/pdf-extract-api - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
climatepolicyradar/navigator-document-parser - Parsing PDFs and websites containing laws and policies.

Creation and production

shipsaas/docking - Shared-microservice that takes over the document templates management & render/export PDF.
WeasyPrint - Generate PDF using html and CSS.
qpdf/qpdf - A content-preserving PDF document transformer.
Stirling-Tools/Stirling-PDF - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
unjs/unpdf - Utilities to work with PDFs in Node.js, browser and workers.
PdfRest - PDF Api to create, shrink and compress.
Gotenberg - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
Smallpdf - Set of tools to extract and manipulate PDF content.
typst/typst - A new markup-based typesetting system that is powerful and easy to learn.
Vexlio - Tool to create diagrams and export in SVG or PDF.
renamed.to - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
veraPDF - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).

Readers and viewers

mozilla/pdf.js - PDF Reader in JavaScript.
agentcooper/react-pdf-highlighter - Set of React components for PDF annotation.
Sioyek - PDF viewer with a focus on technical books and research papers (desktop app).

Datasets

tpn/pdfs - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
pdf-association/pdf-corpora - An index of PDF-centric corpora.
DS4SD/DocLayNet: DocLayNet - A large human-annotated dataset for document-layout analysis.
gipplab/pdf-benchmark - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
DocBank Dataset - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.

Contributing

See Contributing for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
README.md		README.md
code-of-conduct.md		code-of-conduct.md
contributing.md		contributing.md
licence		licence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome PDF

Contents

Parsers, OCR and extraction

Creation and production

Readers and viewers

Datasets

Contributing

About

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome PDF

Contents

Parsers, OCR and extraction

Creation and production

Readers and viewers

Datasets

Contributing

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!