Streaming sync serialization#287
Streaming sync serialization#287bjester wants to merge 9 commits intolearningequality:release-v0.9.xfrom
Conversation
f7d7b60 to
7762fff
Compare
3bc3ec8 to
8898125
Compare
rtibbles
left a comment
There was a problem hiding this comment.
Implementation makes sense to me, and I can follow the mapping from existing operation code to the new stream architecture. The minimal changes to the existing operations tests give confidence against regressions.
The only thing I got hung up on was the names of the abstract base classes!
| stores_to_update.append(created_store) | ||
|
|
||
| if stores_to_update: | ||
| # TODO: bulk_update performs poorly-- is there a better way? |
There was a problem hiding this comment.
This library claims 8x speed up over bulk_update - but also doesn't seem to be hugely well maintained, so might be useful to look at for inspiration rather than usage! https://github.com/netzkolchose/django-fast-update
There was a problem hiding this comment.
Thanks! Yeah something like updating from a temp table should work better and bypass a lot of CASE handling Django does with bulk_update
| "djangorestframework>3.10", | ||
| "django-ipware==4.0.2", | ||
| "requests", | ||
| "typing-extensions==4.1.1", |
There was a problem hiding this comment.
I assume this was purposeful, but flagging that this is precisely the same version of typing-extensions that Kolibri bundles (although it's still not quite clear to me what requires it, as it's not a direct dependency).
There was a problem hiding this comment.
It's used for the Literal typing here. This version is the latest to support 3.6. It's possible this requirement definition could be improved so it ensures we're using the latest version for a version of python instead?
There was a problem hiding this comment.
Yes - although testing it against the version it will be bundled with in Kolibri does have its merits, so I think this is fine!
ee40836 to
68f27cd
Compare
Summary
streamz, which I unfortunately opted against because it uses tornado_serialize_into_storelogic into individual classes built upon foundational stream utilities-- so much better for unit testing!typing-extensionsfor backported future typing featuresMorangoProfileControllerto usesync_filterkwarg instead offilter-- always bothered me it shadowed the built-in_serialize_into_storewith newserialize_into_storestreaming replacementbulk_updateas Django was observed to spend excessive time with itImprovements
The changes were evaluated by installing the local version into Kolibri. A dedicated command was created within Kolibri to run solely the serialization step, and then the performance of that command was benchmarked.
Further investigation will be required to determine how to reduce the increased duration.
Case 1: existing large dataset
Kolibri was launched with a pre-existing database containing data for about 18,000 users.
Case 2: artificial 500 users
Kolibri's
generateuserdatacommand was used to generate data for 500 users, which is the maximum the command currently supports.Case 3: large dataset reduced -- 1000 users
Since the
generateuserdatacommand currently can only generate up to 500 users, the existing large dataset was trimmed down to 1000 users. After manually deleting the other users,kolibri managewas executed (no-op) to trigger Kolibri's FK integrity check which deletes the broken records. Note, this probably takes longer due to the deletions, which provides additional insights into the process, even though the deletion processing has not really changed.Case 4: large dataset reduced -- 5000 users
Again, the existing large dataset was trimmed down, this time to 5000 users. Same situation with regards to deletion behavior as in Case (3)
How AI was used
TODO
Reviewer guidance
Issues addressed
Closes #192