feat: compact vocabulary — single-allocation id→token store for BPE by ArthurZucker · Pull Request #2011 · huggingface/tokenizers

ArthurZucker · 2026-04-08T12:05:50Z

Summary

Replaces vocab_r: AHashMap<u32, String> in BPE with CompactVocab
All token strings concatenated into one Vec<u8> buffer, indexed by a Vec<u32> offset array
id_to_token is now two array reads instead of a hash-table probe
Serializes/deserializes as the same {"token": id} JSON — no format change
Sparse ids (gaps) handled via empty-slice sentinel

Benchmark

Results on compact-vocab branch — baseline (main) bench running, will update with delta.

Benchmark	Time	Throughput
GPT2 encode	1.079 s	5.74 MiB/s
GPT2 encode batch	833 ms	7.43 MiB/s
GPT2 encode, no cache	1.277 s	4.85 MiB/s
GPT2 encode batch, no cache	234 ms	26.4 MiB/s
Train small	26.2 ms	277 KiB/s
Train large	791 ms	7.82 MiB/s

Notes

Draft — only BPE uses CompactVocab; WordLevel and WordPiece still use AHashMap<u32, String>
Forward map (vocab: AHashMap<String, u32>) is unchanged; a follow-up could eliminate string duplication there too using a custom hasher over slices of the data buffer

Replace `vocab_r: AHashMap<u32, String>` in the BPE model with `CompactVocab`: all token strings are concatenated into one `Vec<u8>` buffer and indexed by a dense `Vec<u32>` offset array. Reverse lookup (id → token) is now two array reads instead of a hash-table probe, with better cache locality for sequential decoding. - `CompactVocab::get(id)` — O(1), zero allocation, no pointer chasing - Serialize/deserialize as the same `{"token": id}` JSON format - Sparse ids (gaps) supported via empty-slice sentinel - `OrderedVocabIter` no longer needed for BPE serialization

ArthurZucker · 2026-04-08T12:07:36Z

Benchmark results — compact-vocab vs main

Ran cargo bench --bench bpe_benchmark independently on both branches (same machine, sequential runs).

Benchmark	main	compact-vocab	Δ
GPT2 encode	1.102 s	1.079 s	−2.1%
GPT2 encode batch	775 ms	833 ms	+7.5%
GPT2 encode, no cache	1.312 s	1.277 s	−2.7%
GPT2 encode batch, no cache	241 ms	234 ms	−2.9%
Train small	25.9 ms	26.2 ms	+1.2%
Train large	784 ms	791 ms	+0.9%

Reading

Single-threaded encode and no-cache paths show a consistent 2–3% improvement — consistent with vocab_r lookups hitting the compact buffer instead of chasing hash-table pointers.

The cached batch result (+7.5%) is the outlier and is suspicious: the no-cache batch goes the other way (−2.9%), which is the path that calls vocab_r more heavily. Likely thermal noise between the two runs rather than a real regression; worth a second run with criterion's --save-baseline / --load-baseline flags on the same binary to be sure.

Train throughput is flat (±1%), as expected — vocab_r is rarely touched during training.

Summary

The change is at worst neutral and shows a measurable win on the hot single-threaded decode path. The forward map (vocab: AHashMap<String, u32>) is the next candidate if we want bigger gains.

HuggingFaceDocBuilderDev · 2026-04-08T12:08:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-04-10T10:45:48Z

/benchmark

github-actions · 2026-04-10T10:48:45Z

Python Benchmark results

Commit: 7ded22bff1c832b2202fc89ecb3e14f558b19693

Benchmark	Baseline (ms)	This run (ms)	Δ
test_async_encode_batch	1305.2	1271.8	-2.6%
test_async_encode_batch_fast	1054.6	1019.7	-3.3%
test_decode_batch	2.4	2.2	-7.9%
test_encode	2545.9	2461.7	-3.3%
test_encode_batch	1301.0	1276.6	-1.9%
test_encode_batch_multithreaded	1289.6	1252.4	-2.9%
test_encode_fast	1043.3	1020.2	-2.2%
test_from_file_albert	45.4	40.5	-10.8%
test_from_file_llama3	408.7	396.6	-2.9%
test_from_file_roberta	76.1	66.9	-12.1%
test_from_str_llama3	389.0	370.6	-4.7%
test_to_str_llama3	107.2	68.0	-36.5%
test_train_bpe_small	16.2	16.3	+0.6%

github-actions · 2026-04-10T10:52:17Z

Rust Benchmark results

Commit: 7ded22bff1c832b2202fc89ecb3e14f558b19693

Benchmark	Baseline (ns/iter)	This run (ns/iter)	Δ
bpe-gpt2/encode	1815016018	1802834354	0%
bpe-gpt2/encode-batch	883721924	849979190	-3%
bpe-gpt2/encode-batch-no-cache	1024733230	998831888	-2%
bpe-gpt2/encode-no-cache	2345818394	2377053988	+1%
llama3/concurrent-4t	76814529	50898491	-33%
llama3/encode	1754898015	1793877596	+2%
llama3/encode-batch	867783684	848523552	-2%
llama3/encode-char-offsets	1067309310	1052536465	-1%
llama3/encode-fast	1672139715	1727118766	+3%
serialization/bpe-from-file-gpt2	47651117	46570285	-2%
serialization/deserialize-llama3	405279321	404986170	0%
serialization/deserialize-roberta	74238789	73957908	0%
serialization/from-file-albert	36663177	36320704	0%
serialization/from-file-llama3	371594895	378331389	+1%
serialization/from-file-roberta	62753817	63528225	+1%
serialization/save-llama3	109097437	73693961	-32%
train/bpe-small	17622182	17160729	-2%

Merge branch 'main' into compact-vocab

7ded22b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: compact vocabulary — single-allocation id→token store for BPE#2011

feat: compact vocabulary — single-allocation id→token store for BPE#2011
ArthurZucker wants to merge 2 commits intomainfrom
compact-vocab

ArthurZucker commented Apr 8, 2026

Uh oh!

ArthurZucker commented Apr 8, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 8, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Apr 8, 2026

Summary

Benchmark

Notes

Uh oh!

ArthurZucker commented Apr 8, 2026

Benchmark results — compact-vocab vs main

Reading

Summary

Uh oh!

HuggingFaceDocBuilderDev commented Apr 8, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

github-actions bot commented Apr 10, 2026

Python Benchmark results

Uh oh!

github-actions bot commented Apr 10, 2026

Rust Benchmark results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants