feat: compact vocabulary — single-allocation id→token store for BPE#2011
feat: compact vocabulary — single-allocation id→token store for BPE#2011ArthurZucker wants to merge 2 commits intomainfrom
Conversation
Replace `vocab_r: AHashMap<u32, String>` in the BPE model with
`CompactVocab`: all token strings are concatenated into one `Vec<u8>`
buffer and indexed by a dense `Vec<u32>` offset array.
Reverse lookup (id → token) is now two array reads instead of a
hash-table probe, with better cache locality for sequential decoding.
- `CompactVocab::get(id)` — O(1), zero allocation, no pointer chasing
- Serialize/deserialize as the same `{"token": id}` JSON format
- Sparse ids (gaps) supported via empty-slice sentinel
- `OrderedVocabIter` no longer needed for BPE serialization
Benchmark results — compact-vocab vs mainRan
ReadingSingle-threaded encode and no-cache paths show a consistent 2–3% improvement — consistent with The cached batch result (+7.5%) is the outlier and is suspicious: the no-cache batch goes the other way (−2.9%), which is the path that calls Train throughput is flat (±1%), as expected — SummaryThe change is at worst neutral and shows a measurable win on the hot single-threaded decode path. The forward map ( |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
/benchmark |
Python Benchmark resultsCommit:
|
Rust Benchmark resultsCommit:
|
Summary
vocab_r: AHashMap<u32, String>inBPEwithCompactVocabVec<u8>buffer, indexed by aVec<u32>offset arrayid_to_tokenis now two array reads instead of a hash-table probe{"token": id}JSON — no format changeBenchmark
Results on compact-vocab branch — baseline (main) bench running, will update with delta.
Notes
CompactVocab; WordLevel and WordPiece still useAHashMap<u32, String>vocab: AHashMap<String, u32>) is unchanged; a follow-up could eliminate string duplication there too using a custom hasher over slices of the data buffer