Skip to content

feat: compact vocabulary — single-allocation id→token store for BPE#2011

Open
ArthurZucker wants to merge 2 commits intomainfrom
compact-vocab
Open

feat: compact vocabulary — single-allocation id→token store for BPE#2011
ArthurZucker wants to merge 2 commits intomainfrom
compact-vocab

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Summary

  • Replaces vocab_r: AHashMap<u32, String> in BPE with CompactVocab
  • All token strings concatenated into one Vec<u8> buffer, indexed by a Vec<u32> offset array
  • id_to_token is now two array reads instead of a hash-table probe
  • Serializes/deserializes as the same {"token": id} JSON — no format change
  • Sparse ids (gaps) handled via empty-slice sentinel

Benchmark

Results on compact-vocab branch — baseline (main) bench running, will update with delta.

Benchmark Time Throughput
GPT2 encode 1.079 s 5.74 MiB/s
GPT2 encode batch 833 ms 7.43 MiB/s
GPT2 encode, no cache 1.277 s 4.85 MiB/s
GPT2 encode batch, no cache 234 ms 26.4 MiB/s
Train small 26.2 ms 277 KiB/s
Train large 791 ms 7.82 MiB/s

Notes

  • Draft — only BPE uses CompactVocab; WordLevel and WordPiece still use AHashMap<u32, String>
  • Forward map (vocab: AHashMap<String, u32>) is unchanged; a follow-up could eliminate string duplication there too using a custom hasher over slices of the data buffer

Replace `vocab_r: AHashMap<u32, String>` in the BPE model with
`CompactVocab`: all token strings are concatenated into one `Vec<u8>`
buffer and indexed by a dense `Vec<u32>` offset array.

Reverse lookup (id → token) is now two array reads instead of a
hash-table probe, with better cache locality for sequential decoding.

- `CompactVocab::get(id)` — O(1), zero allocation, no pointer chasing
- Serialize/deserialize as the same `{"token": id}` JSON format
- Sparse ids (gaps) supported via empty-slice sentinel
- `OrderedVocabIter` no longer needed for BPE serialization
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Benchmark results — compact-vocab vs main

Ran cargo bench --bench bpe_benchmark independently on both branches (same machine, sequential runs).

Benchmark main compact-vocab Δ
GPT2 encode 1.102 s 1.079 s −2.1%
GPT2 encode batch 775 ms 833 ms +7.5%
GPT2 encode, no cache 1.312 s 1.277 s −2.7%
GPT2 encode batch, no cache 241 ms 234 ms −2.9%
Train small 25.9 ms 26.2 ms +1.2%
Train large 784 ms 791 ms +0.9%

Reading

Single-threaded encode and no-cache paths show a consistent 2–3% improvement — consistent with vocab_r lookups hitting the compact buffer instead of chasing hash-table pointers.

The cached batch result (+7.5%) is the outlier and is suspicious: the no-cache batch goes the other way (−2.9%), which is the path that calls vocab_r more heavily. Likely thermal noise between the two runs rather than a real regression; worth a second run with criterion's --save-baseline / --load-baseline flags on the same binary to be sure.

Train throughput is flat (±1%), as expected — vocab_r is rarely touched during training.

Summary

The change is at worst neutral and shows a measurable win on the hot single-threaded decode path. The forward map (vocab: AHashMap<String, u32>) is the next candidate if we want bigger gains.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

@github-actions
Copy link
Copy Markdown

Python Benchmark results

Commit: 7ded22bff1c832b2202fc89ecb3e14f558b19693

Benchmark Baseline (ms) This run (ms) Δ
test_async_encode_batch 1305.2 1271.8 -2.6%
test_async_encode_batch_fast 1054.6 1019.7 -3.3%
test_decode_batch 2.4 2.2 -7.9%
test_encode 2545.9 2461.7 -3.3%
test_encode_batch 1301.0 1276.6 -1.9%
test_encode_batch_multithreaded 1289.6 1252.4 -2.9%
test_encode_fast 1043.3 1020.2 -2.2%
test_from_file_albert 45.4 40.5 -10.8%
test_from_file_llama3 408.7 396.6 -2.9%
test_from_file_roberta 76.1 66.9 -12.1%
test_from_str_llama3 389.0 370.6 -4.7%
test_to_str_llama3 107.2 68.0 -36.5%
test_train_bpe_small 16.2 16.3 +0.6%

@github-actions
Copy link
Copy Markdown

Rust Benchmark results

Commit: 7ded22bff1c832b2202fc89ecb3e14f558b19693

Benchmark Baseline (ns/iter) This run (ns/iter) Δ
bpe-gpt2/encode 1815016018 1802834354 0%
bpe-gpt2/encode-batch 883721924 849979190 -3%
bpe-gpt2/encode-batch-no-cache 1024733230 998831888 -2%
bpe-gpt2/encode-no-cache 2345818394 2377053988 +1%
llama3/concurrent-4t 76814529 50898491 -33%
llama3/encode 1754898015 1793877596 +2%
llama3/encode-batch 867783684 848523552 -2%
llama3/encode-char-offsets 1067309310 1052536465 -1%
llama3/encode-fast 1672139715 1727118766 +3%
serialization/bpe-from-file-gpt2 47651117 46570285 -2%
serialization/deserialize-llama3 405279321 404986170 0%
serialization/deserialize-roberta 74238789 73957908 0%
serialization/from-file-albert 36663177 36320704 0%
serialization/from-file-llama3 371594895 378331389 +1%
serialization/from-file-roberta 62753817 63528225 +1%
serialization/save-llama3 109097437 73693961 -32%
train/bpe-small 17622182 17160729 -2%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants