Skip to content

Investigation: Django bulk_update performance issues #291

@bjester

Description

@bjester

Problem Statement

In morango/sync/stream/serialize.py, there are comments indicating that Django's bulk_update performs poorly. The current implementation falls back to individual save() calls for store and counter updates, which is inefficient for large datasets.

Current Implementation Issues

The WriteSink.consume() method currently:

  1. Uses bulk_create with ignore_conflicts=True for initial creation
  2. Falls back to individual save() calls for updates (lines 315-317, 333-335)
  3. Has TODO comments asking "bulk_update performs poorly-- is there a better way?"

Research Findings

Django bulk_update Performance

  • Django's bulk_update generates large CASE(WHEN(...)) statements
  • Performance degrades quadratically with dataset size
  • For 100,000 records: ~647 seconds with Django's bulk_update

Alternative Approaches

1. django-fast-update Library

The django-fast-update library demonstrates significant performance improvements:

Key Features:

  • Uses UPDATE FROM VALUES SQL pattern
  • 8x+ faster than Django's bulk_update
  • Linear performance scaling vs quadratic
  • Supports SQLite 3.15+, PostgreSQL, MariaDB 10.2+, MySQL 5.7+
  • Falls back gracefully for unsupported databases

Performance Comparison:

Method 100k records Notes
Django bulk_update ~647s Quadratic scaling
django-fast-update ~80s Linear scaling
copy_update (PG only) ~40s Best for large datasets

2. SQL VALUES Approach

UPDATE target_table
SET field1 = source.field1, field2 = source.field2
FROM (VALUES 
    (id1, val1a, val1b),
    (id2, val2a, val2b)
) AS source(id, field1, field2)
WHERE target_table.id = source.id

3. TEMP TABLE Approach

-- Create temp table with update data
CREATE TEMP TABLE temp_updates AS
SELECT id, new_field1, new_field2 FROM source_data;

-- Perform join-based update
UPDATE target_table
SET field1 = temp_updates.new_field1,
    field2 = temp_updates.new_field2
FROM temp_updates
WHERE target_table.id = temp_updates.id;

Proposed Solution

Create a vendored utility module morango.utils.bulk_update that:

1. Implements VALUES-based approach:

def fast_bulk_update(model_class, objects, fields, batch_size=500):
    """
    Perform bulk update using VALUES approach.
    Falls back to Django's bulk_update if VALUES not supported.
    """
    if not objects:
        return
    
    # Try VALUES approach first
    try:
        return _values_based_update(model_class, objects, fields, batch_size)
    except (NotImplementedError, DatabaseError):
        # Fall back to Django's implementation
        model_class.objects.bulk_update(objects, fields, batch_size=batch_size)

2. Database-Specific Implementations:

  • SQLite/PostgreSQL: VALUES approach
  • Other databases: Graceful fallback to Django's bulk_update

3. Batch Processing:

  • Process in configurable batch sizes (default 500)
  • Handle SQLite variable limits automatically
  • Memory-efficient streaming for large datasets

Implementation Plan

  1. Phase 1 - Research & Prototyping

    • Create proof-of-concept VALUES implementation
    • Benchmark against current approach
    • Test with SQLite and PostgreSQL
  2. Phase 2 - Integration

    • Replace individual save() calls in WriteSink.consume()
    • Add configuration for batch size
    • Maintain backward compatibility
  3. Phase 3 - Testing & Optimization

    • Performance testing with realistic datasets
    • Edge case handling (empty datasets, single records)
    • Database compatibility testing

Expected Benefits

  • Performance: 5-10x improvement for bulk updates
  • Scalability: Linear vs quadratic complexity
  • Maintainability: Cleaner code without individual save() loops
  • Compatibility: Works with existing Django models

Open Questions

  1. Should we implement both VALUES and TEMP TABLE approaches?
  2. What batch size works best for our typical workloads?
  3. Are there any Morango-specific constraints we need to consider?
  4. Should this be a separate utility or integrated directly into the sync pipeline?

References

Context

Branch: release-v0.9.x

AI usage

This issue was written by AI after it did some research, at my prompting.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions