Streaming sync serialization by bjester · Pull Request #287 · learningequality/morango

bjester · 2026-02-13T00:00:26Z

Summary

This is step one in a complete revitalization of the sync pipeline
Adds new stream utilities for managing the processing of sync data in a streaming fashion
- I looked at several external libraries. Finding a combo that was simple, but still supported python 3.6, was a real challenge. The closest I found was streamz, which I unfortunately opted against because it uses tornado
Refactors _serialize_into_store logic into individual classes built upon foundational stream utilities-- so much better for unit testing!
Reorganizes some dependent code into locations for shared access and no circular references
Adds typing-extensions for backported future typing features
Updates MorangoProfileController to use sync_filter kwarg instead of filter-- always bothered me it shadowed the built-in
Adds unit tests for new stream utilities and converted serialization code-- the serialization process as a whole has pretty good coverage
Replaces usage of _serialize_into_store with new serialize_into_store streaming replacement
The new approach does not use bulk_update as Django was observed to spend excessive time with it

Improvements

The changes were evaluated by installing the local version into Kolibri. A dedicated command was created within Kolibri to run solely the serialization step, and then the performance of that command was benchmarked.

Further investigation will be required to determine how to reduce the increased duration.

Case 1: existing large dataset

Kolibri was launched with a pre-existing database containing data for about 18,000 users.

Version: # users	Memory Graph	Peak Mem	Duration
Before: 18k		325.7 MB	12.49 sec
After: 18k		93.5 MB	39.50 sec

Case 2: artificial 500 users

Kolibri's generateuserdata command was used to generate data for 500 users, which is the maximum the command currently supports.

Version: # users	Memory Graph	Peak Mem	Duration
Before: 500		217.9MB	5.99 sec
After: 500		87.5 MB	12.21 sec

Case 3: large dataset reduced -- 1000 users

Since the generateuserdata command currently can only generate up to 500 users, the existing large dataset was trimmed down to 1000 users. After manually deleting the other users, kolibri manage was executed (no-op) to trigger Kolibri's FK integrity check which deletes the broken records. Note, this probably takes longer due to the deletions, which provides additional insights into the process, even though the deletion processing has not really changed.

Version: # users	Memory Graph	Peak Mem	Duration
Before: 1000		308.1 MB	44.54 sec
After: 1000		87.9 MB	23.63 sec

Case 4: large dataset reduced -- 5000 users

Again, the existing large dataset was trimmed down, this time to 5000 users. Same situation with regards to deletion behavior as in Case (3)

Version: # users	Memory Graph	Peak Mem	Duration
Before: 5000		339.7 MB	55.88 sec
After: 5000		92.3 MB	28.92 sec

How AI was used

To look for stream libraries
Multiple models/providers were used to prototype the stream utilities
To verify and correct type hinting
To add comments, edited afterwards
To create tests for streaming utilities (simplistic)
To bootstrap tests for the serialization stream utils, heavily refactored by me
To generate documentation

TODO

Have tests been written for the new code?
Has documentation been written/updated?
New dependencies (if any) added to requirements file

Reviewer guidance

Install the branch locally to Kolibri and perform some syncs with another local Kolibri

Issues addressed

Closes #192

rtibbles

Implementation makes sense to me, and I can follow the mapping from existing operation code to the new stream architecture. The minimal changes to the existing operations tests give confidence against regressions.

The only thing I got hung up on was the names of the abstract base classes!

morango/sync/stream/core.py

docs/architecture/index.rst

morango/sync/stream/core.py

rtibbles · 2026-03-10T22:25:30Z

morango/sync/stream/serialize.py

+                    stores_to_update.append(created_store)
+
+        if stores_to_update:
+            # TODO: bulk_update performs poorly-- is there a better way?


This library claims 8x speed up over bulk_update - but also doesn't seem to be hugely well maintained, so might be useful to look at for inspiration rather than usage! https://github.com/netzkolchose/django-fast-update

Thanks! Yeah something like updating from a temp table should work better and bypass a lot of CASE handling Django does with bulk_update

rtibbles · 2026-03-10T22:30:03Z

setup.py

        "djangorestframework>3.10",
        "django-ipware==4.0.2",
        "requests",
+        "typing-extensions==4.1.1",


I assume this was purposeful, but flagging that this is precisely the same version of typing-extensions that Kolibri bundles (although it's still not quite clear to me what requires it, as it's not a direct dependency).

It's used for the Literal typing here. This version is the latest to support 3.6. It's possible this requirement definition could be improved so it ensures we're using the latest version for a version of python instead?

Yes - although testing it against the version it will be bundled with in Kolibri does have its merits, so I think this is fine!

bjester force-pushed the streaming-sync branch from f7d7b60 to 7762fff Compare February 13, 2026 00:05

rtibbles self-assigned this Feb 24, 2026

bjester force-pushed the streaming-sync branch from 3bc3ec8 to 8898125 Compare February 25, 2026 21:43

bjester marked this pull request as ready for review February 25, 2026 22:14

rtibbles reviewed Mar 10, 2026

View reviewed changes

bjester force-pushed the streaming-sync branch 2 times, most recently from ee40836 to 68f27cd Compare March 19, 2026 14:44

bjester added 9 commits March 19, 2026 07:47

Prepare for more advanced and optimized iteration through app models

42ce525

Move transaction helper to shared location

7250381

Move model helper to shared location

2046712

Add foundational structure for streaming sync processing

06d3987

Add test for model utility

fdc8ca9

Refactor serialize step into new structure

f138206

Migrate to serialize stream

84c4421

Pin typing extensions to version that still supports 3.6

53e5fb0

Add documentation for streaming arch

f9f6ac8

bjester force-pushed the streaming-sync branch from 68f27cd to f9f6ac8 Compare March 19, 2026 14:48

bjester mentioned this pull request Mar 19, 2026

Investigation: Django bulk_update performance issues #291

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming sync serialization#287

Streaming sync serialization#287
bjester wants to merge 9 commits intolearningequality:release-v0.9.xfrom
bjester:streaming-sync

bjester commented Feb 13, 2026 •

edited

Loading

Uh oh!

rtibbles left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rtibbles Mar 10, 2026

Uh oh!

bjester Mar 19, 2026

Uh oh!

rtibbles Mar 10, 2026

Uh oh!

bjester Mar 19, 2026

Uh oh!

rtibbles Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bjester commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Improvements

Case 1: existing large dataset

Case 2: artificial 500 users

Case 3: large dataset reduced -- 1000 users

Case 4: large dataset reduced -- 5000 users

How AI was used

TODO

Reviewer guidance

Issues addressed

Uh oh!

rtibbles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rtibbles Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

bjester Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

rtibbles Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

bjester Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

rtibbles Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bjester commented Feb 13, 2026 •

edited

Loading