Investigation: Django bulk_update performance issues

## Problem Statement

In `morango/sync/stream/serialize.py`, there are comments indicating that Django's `bulk_update` performs poorly. The current implementation falls back to individual `save()` calls for store and counter updates, which is inefficient for large datasets.

## Current Implementation Issues

The `WriteSink.consume()` method currently:
1. Uses `bulk_create` with `ignore_conflicts=True` for initial creation
2. Falls back to individual `save()` calls for updates (lines 315-317, 333-335)
3. Has TODO comments asking "bulk_update performs poorly-- is there a better way?"

## Research Findings

### Django bulk_update Performance
- Django's `bulk_update` generates large `CASE(WHEN(...))` statements
- Performance degrades quadratically with dataset size
- For 100,000 records: ~647 seconds with Django's bulk_update

### Alternative Approaches

#### 1. django-fast-update Library
The [django-fast-update](https://github.com/netzkolchose/django-fast-update) library demonstrates significant performance improvements:

**Key Features:**
- Uses `UPDATE FROM VALUES` SQL pattern
- 8x+ faster than Django's bulk_update
- Linear performance scaling vs quadratic
- Supports SQLite 3.15+, PostgreSQL, MariaDB 10.2+, MySQL 5.7+
- Falls back gracefully for unsupported databases

**Performance Comparison:**
| Method | 100k records | Notes |
|--------|-------------|-------|
| Django bulk_update | ~647s | Quadratic scaling |
| django-fast-update | ~80s | Linear scaling |
| copy_update (PG only) | ~40s | Best for large datasets |

#### 2. SQL VALUES Approach
```sql
UPDATE target_table
SET field1 = source.field1, field2 = source.field2
FROM (VALUES 
    (id1, val1a, val1b),
    (id2, val2a, val2b)
) AS source(id, field1, field2)
WHERE target_table.id = source.id
```

#### 3. TEMP TABLE Approach
```sql
-- Create temp table with update data
CREATE TEMP TABLE temp_updates AS
SELECT id, new_field1, new_field2 FROM source_data;

-- Perform join-based update
UPDATE target_table
SET field1 = temp_updates.new_field1,
    field2 = temp_updates.new_field2
FROM temp_updates
WHERE target_table.id = temp_updates.id;
```

## Proposed Solution

Create a vendored utility module `morango.utils.bulk_update` that:

### 1. Implements VALUES-based approach:
```python
def fast_bulk_update(model_class, objects, fields, batch_size=500):
    """
    Perform bulk update using VALUES approach.
    Falls back to Django's bulk_update if VALUES not supported.
    """
    if not objects:
        return
    
    # Try VALUES approach first
    try:
        return _values_based_update(model_class, objects, fields, batch_size)
    except (NotImplementedError, DatabaseError):
        # Fall back to Django's implementation
        model_class.objects.bulk_update(objects, fields, batch_size=batch_size)
```

### 2. Database-Specific Implementations:
- **SQLite/PostgreSQL**: VALUES approach
- **Other databases**: Graceful fallback to Django's bulk_update

### 3. Batch Processing:
- Process in configurable batch sizes (default 500)
- Handle SQLite variable limits automatically
- Memory-efficient streaming for large datasets

## Implementation Plan

1. **Phase 1 - Research & Prototyping**
   - Create proof-of-concept VALUES implementation
   - Benchmark against current approach
   - Test with SQLite and PostgreSQL

2. **Phase 2 - Integration**
   - Replace individual `save()` calls in `WriteSink.consume()`
   - Add configuration for batch size
   - Maintain backward compatibility

3. **Phase 3 - Testing & Optimization**
   - Performance testing with realistic datasets
   - Edge case handling (empty datasets, single records)
   - Database compatibility testing

## Expected Benefits

- **Performance**: 5-10x improvement for bulk updates
- **Scalability**: Linear vs quadratic complexity
- **Maintainability**: Cleaner code without individual save() loops
- **Compatibility**: Works with existing Django models

## Open Questions

1. Should we implement both VALUES and TEMP TABLE approaches?
2. What batch size works best for our typical workloads?
3. Are there any Morango-specific constraints we need to consider?
4. Should this be a separate utility or integrated directly into the sync pipeline?

## References
- https://github.com/learningequality/morango/pull/287#discussion_r2914828484
- [django-fast-update](https://github.com/netzkolchose/django-fast-update)
- [Django bulk_update PR #9606](https://github.com/django/django/pull/9606)
- [django-bulk-load](https://github.com/cedar-team/django-bulk-load)
- [django-pg-bulk-update](https://github.com/M1ha-Shvn/django-pg-bulk-update)

## Context
Branch: `release-v0.9.x`



## AI usage
This issue was written by AI after it did some research, at my prompting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Django bulk_update performance issues #291

Problem Statement

Current Implementation Issues

Research Findings

Django bulk_update Performance

Alternative Approaches

1. django-fast-update Library

2. SQL VALUES Approach

3. TEMP TABLE Approach

Proposed Solution

1. Implements VALUES-based approach:

2. Database-Specific Implementations:

3. Batch Processing:

Implementation Plan

Expected Benefits

Open Questions

References

Context

AI usage

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	100k records	Notes
Django bulk_update	~647s	Quadratic scaling
django-fast-update	~80s	Linear scaling
copy_update (PG only)	~40s	Best for large datasets

Investigation: Django bulk_update performance issues #291

Description

Problem Statement

Current Implementation Issues

Research Findings

Django bulk_update Performance

Alternative Approaches

1. django-fast-update Library

2. SQL VALUES Approach

3. TEMP TABLE Approach

Proposed Solution

1. Implements VALUES-based approach:

2. Database-Specific Implementations:

3. Batch Processing:

Implementation Plan

Expected Benefits

Open Questions

References

Context

AI usage

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions