Skip to content

OneOffTech/awesome-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Awesome PDF Awesome

Portable Document Format (PDF) is a file format for presenting documents independently of software, hardware, or operating systems.

Contents


Parsers, OCR and extraction

  • Parxy - A PDF parsers gateway to use different parsers using a unified API.
  • Docling - Simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem.
  • SmolDocling - A multimodal Image-Text-to-Text model for efficient document conversion, compatible with Docling.
  • Filimoa/open-parse - Improved file parsing for LLMs.
  • VikParuchuri/surya - OCR, layout analysis, reading order, table recognition in 90+ languages.
  • UniModal4Reasoning/StructEqTable-Deploy - A High-efficiency Open-source Toolkit for Table-to-Latex Task.
  • huridocs/pdf-document-layout-analysis - A Docker-based service for analyzing PDF document layouts, enabling segmentation and classification of elements like text, titles, images, and tables.
  • Reducto - Document Ingestion API.
  • adithya-s-k/omniparse - A platform that ingests and parses unstructured data into structured data optimized for GenAI applications.
  • lumina-ai-inc/chunkr - Vision model based PDF chunking.
  • lumina-ai-inc/PaddleOCR - Lightweight multilingual OCR toolkit supporting 80+ languages, built on PaddlePaddle.
  • allenai/olmocr - Toolkit for linearizing PDFs for LLM datasets/training.
  • opendatalab/PDF-Extract-Kit - A comprehensive toolkit for high-quality PDF content extraction.
  • smalot/pdfparser - A standalone PHP library, provides various tools to extract data from a PDF file.
  • Unstructured-IO/unstructured - Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
  • PyMuPDF4LLM - Aimed to make it easier to extract PDF content in the format you need for LLM & RAG environments. It supports Markdown extraction as well as LlamaIndex document output.
  • CatchTheTornado/pdf-extract-api - Document (PDF) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown.
  • climatepolicyradar/navigator-document-parser - Parsing PDFs and websites containing laws and policies.

Creation and production

  • shipsaas/docking - Shared-microservice that takes over the document templates management & render/export PDF.
  • WeasyPrint - Generate PDF using html and CSS.
  • qpdf/qpdf - A content-preserving PDF document transformer.
  • Stirling-Tools/Stirling-PDF - A locally hosted web-based PDF manipulation tool using Docker that supports splitting, merging, converting, reorganizing, compressing, and more.
  • unjs/unpdf - Utilities to work with PDFs in Node.js, browser and workers.
  • PdfRest - PDF Api to create, shrink and compress.
  • Gotenberg - A Docker-powered stateless API for creating PDF files from templates in various formats, e.g., html, markdown, word, excel.
  • Smallpdf - Set of tools to extract and manipulate PDF content.
  • typst/typst - A new markup-based typesetting system that is powerful and easy to learn.
  • Vexlio - Tool to create diagrams and export in SVG or PDF.
  • renamed.to - AI-powered tool that renames files based on the content, accessible as a web app, command line, and for integration within your application.
  • veraPDF - Verify compliance with PDF/A and PDF/UA specification (via Open Preservation Foundation).

Readers and viewers

Datasets

  • tpn/pdfs - Technically-oriented PDF collection (papers, specs, decks, manuals, etc).
  • pdf-association/pdf-corpora - An index of PDF-centric corpora.
  • DS4SD/DocLayNet: DocLayNet - A large human-annotated dataset for document-layout analysis.
  • gipplab/pdf-benchmark - A benchmark of PDF information extraction tools using a multi-task and multi-domain evaluation framework for academic documents.
  • DocBank Dataset - A large-scale dataset built with weak supervision, enabling models to integrate textual and layout information for downstream tasks.

Contributing

See Contributing for details.

About

A curated list of amazingly libraries, services and resources to work with PDF files

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Contributors