-
Notifications
You must be signed in to change notification settings - Fork 24
Open
Description
Problem Statement
In morango/sync/stream/serialize.py, there are comments indicating that Django's bulk_update performs poorly. The current implementation falls back to individual save() calls for store and counter updates, which is inefficient for large datasets.
Current Implementation Issues
The WriteSink.consume() method currently:
- Uses
bulk_createwithignore_conflicts=Truefor initial creation - Falls back to individual
save()calls for updates (lines 315-317, 333-335) - Has TODO comments asking "bulk_update performs poorly-- is there a better way?"
Research Findings
Django bulk_update Performance
- Django's
bulk_updategenerates largeCASE(WHEN(...))statements - Performance degrades quadratically with dataset size
- For 100,000 records: ~647 seconds with Django's bulk_update
Alternative Approaches
1. django-fast-update Library
The django-fast-update library demonstrates significant performance improvements:
Key Features:
- Uses
UPDATE FROM VALUESSQL pattern - 8x+ faster than Django's bulk_update
- Linear performance scaling vs quadratic
- Supports SQLite 3.15+, PostgreSQL, MariaDB 10.2+, MySQL 5.7+
- Falls back gracefully for unsupported databases
Performance Comparison:
| Method | 100k records | Notes |
|---|---|---|
| Django bulk_update | ~647s | Quadratic scaling |
| django-fast-update | ~80s | Linear scaling |
| copy_update (PG only) | ~40s | Best for large datasets |
2. SQL VALUES Approach
UPDATE target_table
SET field1 = source.field1, field2 = source.field2
FROM (VALUES
(id1, val1a, val1b),
(id2, val2a, val2b)
) AS source(id, field1, field2)
WHERE target_table.id = source.id3. TEMP TABLE Approach
-- Create temp table with update data
CREATE TEMP TABLE temp_updates AS
SELECT id, new_field1, new_field2 FROM source_data;
-- Perform join-based update
UPDATE target_table
SET field1 = temp_updates.new_field1,
field2 = temp_updates.new_field2
FROM temp_updates
WHERE target_table.id = temp_updates.id;Proposed Solution
Create a vendored utility module morango.utils.bulk_update that:
1. Implements VALUES-based approach:
def fast_bulk_update(model_class, objects, fields, batch_size=500):
"""
Perform bulk update using VALUES approach.
Falls back to Django's bulk_update if VALUES not supported.
"""
if not objects:
return
# Try VALUES approach first
try:
return _values_based_update(model_class, objects, fields, batch_size)
except (NotImplementedError, DatabaseError):
# Fall back to Django's implementation
model_class.objects.bulk_update(objects, fields, batch_size=batch_size)2. Database-Specific Implementations:
- SQLite/PostgreSQL: VALUES approach
- Other databases: Graceful fallback to Django's bulk_update
3. Batch Processing:
- Process in configurable batch sizes (default 500)
- Handle SQLite variable limits automatically
- Memory-efficient streaming for large datasets
Implementation Plan
-
Phase 1 - Research & Prototyping
- Create proof-of-concept VALUES implementation
- Benchmark against current approach
- Test with SQLite and PostgreSQL
-
Phase 2 - Integration
- Replace individual
save()calls inWriteSink.consume() - Add configuration for batch size
- Maintain backward compatibility
- Replace individual
-
Phase 3 - Testing & Optimization
- Performance testing with realistic datasets
- Edge case handling (empty datasets, single records)
- Database compatibility testing
Expected Benefits
- Performance: 5-10x improvement for bulk updates
- Scalability: Linear vs quadratic complexity
- Maintainability: Cleaner code without individual save() loops
- Compatibility: Works with existing Django models
Open Questions
- Should we implement both VALUES and TEMP TABLE approaches?
- What batch size works best for our typical workloads?
- Are there any Morango-specific constraints we need to consider?
- Should this be a separate utility or integrated directly into the sync pipeline?
References
- Streaming sync serialization #287 (comment)
- django-fast-update
- Django bulk_update PR #9606
- django-bulk-load
- django-pg-bulk-update
Context
Branch: release-v0.9.x
AI usage
This issue was written by AI after it did some research, at my prompting.
Reactions are currently unavailable