An automated Azure-based document processing pipeline that leverages Azure Document Intelligence and OpenAI LLM for intelligent document classification and analysis.
This project implements a serverless document processing system that:
- Ingests PDF documents from Azure Blob Storage or http post
- Extracts content using Azure Document Intelligence (prebuilt-layout model)
- Classifies documents using Azure OpenAI LLM (GPT-4)
- Publishes results to Microsoft Farbic Lakehouse using Azure Event Hub
The system is built as an Azure Function that automatically triggers on new document uploads or http requets, enabling a fully automated, scalable processing workflow.
- Azure Functions - Serverless compute for document processing orchestration
- Azure Blob Storage - Document ingestion and storage
- Azure Document Intelligence - OCR and document layout analysis
- Azure OpenAI - LLM-based document classification
- Azure Event Hub - Event streaming and results publishing
Document Upload β Blob Trigger/http trigger β Document Intelligence β LLM Classification β Event Hub
(Input) (Function) (Content Extraction) (Document Type) (Output)
DocumentProcessingSystem/
βββ function_app.py # Azure Function entry point with blob trigger
βββ run_DocumentIntelligence.py # Document Intelligence API integration
βββ run_LLMClasscification.py # Azure OpenAI classification logic
βββ run_FabricEventHub.py # Event Hub publisher
βββ code_testing.ipynb # Testing and development notebook
βββ requirements.txt # Python dependencies
βββ host.json # Azure Functions configuration
βββ local.settings.json # Local environment settings
βββ README.md # This file
- Python 3.9+
- Azure Functions Core Tools
- Azure CLI
- An Azure subscription with:
- Storage Account (Blob Storage)
- Document Intelligence resource
- Azure OpenAI resource
- Event Hub namespace
-
Clone the repository
git clone <repository-url> cd DocumentProcessingSystem
-
Create a Python virtual environment
python -m venv .venv .venv\Scripts\activate # On Windows
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
Create or update
local.settings.jsonwith your Azure credentials:{ "IsEncrypted": false, "Values": { "AzureWebJobsStorage": "DefaultEndpointsProtocol=https;...", "Eventhub_endpoint": "your-eventhub-namespace.servicebus.windows.net", "Eventhub_name": "your-eventhub-name", "docintelligenceendpoint": "https://your-region.api.cognitive.microsoft.com/", "docintelligencekey": "your-document-intelligence-key", "openai_endpoint": "https://your-resource.openai.azure.com/", "openai_key": "your-openai-key", "rgdocumentprocessinb772_STORAGE": "your-blob-storage-connection-string" } }
-
Start the Azure Functions runtime
func start
-
Upload a test document
Upload a PDF to the
document-processing-dropzone/Input/blob container -
Monitor execution
Check the function logs in the terminal for processing status
- Model:
prebuilt-layout(for general document layout analysis) - API Version:
2024-11-30 - Poll Interval: 2 seconds (configurable in
function_app.py) - Max Wait: 60 seconds
Supported document classifications (defined in run_LLMClasscification.py):
- Medical Aid / Medical Scheme Certificate
- Employee Tax Certificate
- Retirement Annuity Certificate
- Investment Income Certificate
- Medical Expenses
- Travel Log Book
- Other
- Model:
gpt-4.1 - API Version:
2024-12-01-preview - Provider: Azure OpenAI
| Package | Purpose |
|---|---|
azure-functions |
Azure Functions SDK |
azure-storage-blob |
Blob Storage integration |
azure-eventhub |
Event Hub integration (For Fabric Lakehouse storage) |
azure-identity |
Azure authentication |
requests |
HTTP client for Document Intelligence API |
openai |
Azure OpenAI SDK |
cryptography |
Encryption utilities |
- Monitors the
document-processing-dropzone/Input/container - Automatically triggers on PDF upload
- Monitors the http endpoint for the function app
- Post requests with APIKey to the endpoint will trigger the process e.g. http://localhost:7071/api/func_document_processing?code===
- Pdf file must form part of the binary body
- Converts PDF to base64 encoding
- Posts to Document Intelligence API for layout analysis
- Polls for completion (up to 60 seconds)
- Extracts structured content and metadata
- Processes extracted text content
- Uses Azure OpenAI to classify document type
- Generates confidence scores and reasoning
- Packages results with document metadata
- Publishes to Event Hub for downstream processing
- Enables real-time data consumption and analytics
Use code_testing.ipynb for:
- Unit testing individual components
- Testing API endpoints
- Debugging extraction and classification logic
- Manual workflow validation
-
Create Azure Functions resource
az functionapp create --resource-group <rg-name> \ --consumption-plan-location <region> \ --runtime python --runtime-version 3.11 \ --functions-version 4 \ --name <function-app-name>
-
Deploy the function
func azure functionapp publish <function-app-name>
-
Configure application settings
az functionapp config appsettings set \ --name <function-app-name> \ --resource-group <rg-name> \ --settings <setting-key>=<setting-value>
- Azure Functions integrated logging
- Document Intelligence API response tracking
- Event Hub message publishing verification
- Application Insights integration (optional)
- β Embeddings for large documents not yet implemented
- β No chunking strategy for documents exceeding 2 MB item limits
- Implement document chunking for large files
- Add vector embeddings for semantic search
- Migrate to Document Intelligence Python SDK
- Implement dead-letter handling for failed documents
- Add comprehensive error tracking and alerting
- Use Azure Key Vault for sensitive credentials (recommended)
- Enable managed identities for Azure service authentication
- Restrict blob container access with appropriate RBAC
- Validate input documents before processing
- Monitor and audit Event Hub consumers
Stored in local.settings.json
| Variable | Description |
|---|---|
Eventhub_endpoint |
Event Hub namespace endpoint |
Eventhub_name |
Event Hub instance name |
docintelligenceendpoint |
Document Intelligence API endpoint |
docintelligencekey |
Document Intelligence API key |
openai_endpoint |
Azure OpenAI API endpoint |
openai_key |
Azure OpenAI API key |
AzureWebJobsStorage |
Blob Storage connection string |
rgdocumentprocessinb772_STORAGE |
Blob Storage connection for function trigger |
For issues or questions, please open an issue in the repository or contact the Andrew Schleiss.