Skip to content

feat: Add support for legacy .doc file format#1624

Open
yuchengpersonal wants to merge 1 commit intomicrosoft:mainfrom
yuchengpersonal:feat/add-doc-support
Open

feat: Add support for legacy .doc file format#1624
yuchengpersonal wants to merge 1 commit intomicrosoft:mainfrom
yuchengpersonal:feat/add-doc-support

Conversation

@yuchengpersonal
Copy link
Copy Markdown

Summary

This PR adds support for converting legacy Microsoft Word (.doc) files (Word 97-2003 format) to Markdown, addressing Issue #23.

Changes

  • New Converter: Added _doc_converter.py with DocConverter class
  • Dual Backend Support:
    • Primary: textract (pure Python, cross-platform)
    • Fallback: antiword (system command, if available)
  • Integration: Registered DocConverter in MarkItDown class
  • Dependencies: Added [doc] optional dependency group with textract
  • Documentation: Updated README.md to reflect .doc support

Usage

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: markitdown[doc] in /usr/local/lib/python3.11/dist-packages (0.1.5)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.11/dist-packages (from markitdown[doc]) (4.14.3)
Requirement already satisfied: charset-normalizer in /usr/local/lib/python3.11/dist-packages (from markitdown[doc]) (3.4.6)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.11/dist-packages (from markitdown[doc]) (0.7.1)
Requirement already satisfied: magika~=0.6.1 in /usr/local/lib/python3.11/dist-packages (from markitdown[doc]) (0.6.3)
Requirement already satisfied: markdownify in /usr/local/lib/python3.11/dist-packages (from markitdown[doc]) (1.2.2)
Requirement already satisfied: requests in /usr/local/lib/python3.11/dist-packages (from markitdown[doc]) (2.32.5)
Requirement already satisfied: click>=8.1.7 in /usr/local/lib/python3.11/dist-packages (from magika~=0.6.1->markitdown[doc]) (8.3.1)
Requirement already satisfied: onnxruntime>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from magika~=0.6.1->markitdown[doc]) (1.24.4)
Requirement already satisfied: numpy>=1.24 in /usr/local/lib/python3.11/dist-packages (from magika~=0.6.1->markitdown[doc]) (2.4.3)
Requirement already satisfied: python-dotenv>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from magika~=0.6.1->markitdown[doc]) (1.2.2)
Requirement already satisfied: soupsieve>=1.6.1 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->markitdown[doc]) (2.8.3)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/local/lib/python3.11/dist-packages (from beautifulsoup4->markitdown[doc]) (4.15.0)
Requirement already satisfied: six<2,>=1.15 in /usr/local/lib/python3.11/dist-packages (from markdownify->markitdown[doc]) (1.17.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests->markitdown[doc]) (3.11)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests->markitdown[doc]) (2.6.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests->markitdown[doc]) (2026.2.25)
Requirement already satisfied: flatbuffers in /usr/local/lib/python3.11/dist-packages (from onnxruntime>=1.17.0->magika~=0.6.1->markitdown[doc]) (25.12.19)
Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from onnxruntime>=1.17.0->magika~=0.6.1->markitdown[doc]) (26.0)
Requirement already satisfied: protobuf in /usr/local/lib/python3.11/dist-packages (from onnxruntime>=1.17.0->magika~=0.6.1->markitdown[doc]) (7.34.0)
Requirement already satisfied: sympy in /usr/local/lib/python3.11/dist-packages (from onnxruntime>=1.17.0->magika~=0.6.1->markitdown[doc]) (1.14.0)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy->onnxruntime>=1.17.0->magika~=0.6.1->markitdown[doc]) (1.3.0)

Technical Details

  • Handles Word 97-2003 binary format (.doc)
  • MIME type detection: application/msword
  • Graceful fallback between textract and antiword
  • Proper cleanup of temporary files
  • Clear error messages when dependencies are missing

Fixes #23

- Add DocConverter class to convert Word 97-2003 (.doc) files
- Support both textract (Python) and antiword (system) backends
- Register converter in MarkItDown class
- Add [doc] optional dependency in pyproject.toml
- Update README.md to document .doc support

Fixes microsoft#23

Co-Authored-By: yuchengpersonal <yuchengpersonal@users.noreply.github.com>
@sahabatmotortelukmakmur-cell
Copy link
Copy Markdown

C#
//#1624 (comment)

@yuchengpersonal
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for .doc extensions

2 participants