WIP: MDEV-38975 Phase 1 — Promote wide VARCHAR to BLOB for HEAP internal temp tables#4812
Draft
arcivanov wants to merge 5 commits intoMariaDB:10.11from
Draft
WIP: MDEV-38975 Phase 1 — Promote wide VARCHAR to BLOB for HEAP internal temp tables#4812arcivanov wants to merge 5 commits intoMariaDB:10.11from
arcivanov wants to merge 5 commits intoMariaDB:10.11from
Conversation
Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine tables by storing blob data in variable-length continuation record chains within the existing `HP_BLOCK` structure. **Continuation runs**: blob data is split across contiguous sequences of `recbuffer`-sized records. Each run stores a 10-byte header (`next_cont` pointer + `run_rec_count`) in the first record; inner records (rec 1..N-1) have no flags byte — full `recbuffer` payload. Runs are linked via `next_cont` pointers. Individual runs are capped at 65,535 records (`uint16` format limit); larger blobs are automatically split into multiple runs. **Zero-copy reads**: single-run blobs return pointers directly into `HP_BLOCK` records, avoiding `blob_buff` reassembly entirely: - Case A (`run_rec_count == 1`): return `chain + HP_CONT_HEADER_SIZE` - Case B (`HP_ROW_CONT_ZEROCOPY` flag): return `chain + recbuffer` - Case C (multi-run): walk chain, reassemble into `blob_buff` `HP_INFO::has_zerocopy_blobs` tracks zero-copy state; used by `heap_update()` to refresh the caller's record buffer after freeing old chains, preventing dangling pointers. **Free list scavenging**: on insert, the free list is walked read-only (peek) tracking contiguous groups in descending address order (LIFO). Qualifying groups (>= `min_run_records`) are unlinked and used. The first non-qualifying group terminates the scan — remaining data is allocated from the block tail. The free list is never disturbed when no qualifying group is found. **Record counting**: new `HP_SHARE::total_records` tracks all physical records (primary + continuation). `HP_SHARE::records` remains logical (primary-only) to preserve linear hash bucket mapping correctness. **Scan/check batch-skip**: `heap_scan()` and `heap_check_heap()` read `run_rec_count` from rec 0 and skip entire continuation runs at once. **Hash functions**: `hp_rec_hashnr()`, `hp_rec_key_cmp()`, `hp_key_cmp()`, `hp_make_key()` updated to handle `HA_BLOB_PART` key segments — reading actual blob data via pointer dereference or chain materialization. **SQL layer**: `choose_engine()` no longer rejects HEAP for blob tables (replaced `blob_fields` check with `reclength > HA_MAX_REC_LENGTH`). `remove_duplicates()` routes HEAP+blob to `remove_dup_with_compare()`. `ha_heap::remember_rnd_pos()` / `restart_rnd_next()` implemented for DISTINCT deduplication support. Fixed undefined behavior in `test_if_cheaper_ordering()` where `select_limit/fanout` could overflow to infinity — capped at `HA_POS_ERROR`. https://jira.mariadb.org/browse/MDEV-38975
The free-list allocator's minimum contiguous run threshold (`min_run_records`) could exceed the total records a small blob actually needs, making free-list reuse impossible on narrow tables. For example, with `recbuffer=16` the 128-byte floor produced `min_run_records=8`, but a 32-byte blob only needs 3 records. Any contiguous free-list group of 3 would be rejected, forcing unnecessary tail allocation. Cap both `min_run_bytes` at `data_len` and `min_run_records` at `total_records_needed` so small blobs can reuse free-list slots when a sufficient contiguous group exists.
… in hash chain traversal `hp_search()`, `hp_search_next()`, `hp_delete_key()`, and `find_unique_row()` walk hash chains calling `hp_key_cmp()` or `hp_rec_key_cmp()` for every entry. For blob key segments, each comparison triggers `hp_materialize_one_blob()` which reassembles blob data from continuation chain records. Since each `HASH_INFO` already stores `hash_of_key`, compare it against the search key's hash before the full key comparison. When hashes differ the keys are guaranteed different, skipping the expensive materialization. This pattern already existed in `hp_write_key()` for duplicate detection but was missing from the four read/delete paths. `HP_INFO::last_hash_of_key` is added so `hp_search_next()` can reuse the hash computed by `hp_search()` without recomputing it.
Rebuild HEAP index key from `record[0]` when the index has blob key segments, because `Field_blob::new_key_field()` returns `Field_varstring` (2B length + inline data) while HEAP's `hp_hashnr`/`hp_key_cmp` expect `hp_make_key` format (4B length + data pointer). Precompute `HP_KEYDEF::has_blob_seg` flag during table creation to avoid per-call loop through key segments.
**VARCHAR-to-BLOB promotion for HEAP engine** (`HEAP_CONVERT_IF_BIGGER_TO_BLOB = 32`):
Promote VARCHAR fields wider than 32 bytes to BLOB when creating HEAP
internal temporary tables. This avoids wasting memory on fixed-width
rows (HEAP stores VARCHAR at declared maximum width). The promotion is
transparent: the SQL layer sees VARCHAR, only the HEAP storage uses
blob continuation chains internally.
**Key changes:**
1. `sql/sql_type.cc` — `varstring_type_handler()`: thread-local
`creating_heap_tmp_table` flag triggers `blob_type_handler()` for
fields exceeding the HEAP threshold
2. `sql/sql_select.cc` — `finalize()`:
- Set `key_part_flag` from `field->key_part_flag()` in GROUP BY key setup
(was unconditionally 0, masking blob key handling)
- `key_field_length` override for `new_key_field()` when blob
`key_length()` returns 0
- Skip packed rows and ensure non-zero first byte for HEAP tables
- Skip null-bits helper key part for HEAP (handles NULLs per-segment)
3. `storage/heap/ha_heap.cc`:
- `rebuild_blob_key()`: materializes blob data from continuation
chains back into `record[0]` for correct hash rebuilds
- `heap_prepare_hp_create_info()`: blob key segment setup with
`key_part->length` widening to `max_data_length()` for DISTINCT path;
uses `field->key_part_flag()` instead of `key_part->key_part_flag`
to avoid corruption from uninitialized flags in SJ weedout and
expression cache paths
4. `storage/heap/hp_hash.c`:
- Blob-aware `hp_hashnr()`, `hp_rec_hashnr()`, `hp_key_cmp()`,
`hp_rec_key_cmp()`, `hp_make_key()` with proper PAD SPACE handling
- Hash pre-check (`hash_of_key`) to skip expensive blob
materialization for non-matching hash chain entries
- VARCHAR `hp_make_key()` rewrite: always writes 2-byte length prefix
regardless of `bit_start` (fixes key format mismatch for promoted
TINYBLOB with `pack_length=1`)
5. `storage/heap/hp_create.c`:
- `HA_BLOB_PART` validation against `blob_descs` array (strips
spurious blob flags from non-blob fields)
- Blob segment `bit_start`/`length` normalization
- Continuation header size enforcement for blob tables
6. `storage/heap/hp_blob.c` (new):
- Blob continuation chain read/write/free/materialize operations
- Run-based storage with configurable slot reuse threshold
7. `sql/sql_expression_cache.cc` — disable expression cache for HEAP
tables with blob fields (key format incompatibility)
8. `sql/item_sum.cc`, `sql/item_func.cc` — blob-aware overflow-to-disk
and FULLTEXT engine swap
**Test coverage:**
- `heap.heap_blob`, `heap.heap_blob_ops`, `heap.heap_blob_groupby`,
`heap.heap_geometry`, `heap.blob_dedup` — HEAP blob functionality
- `main.sj_mat_debug`, `main.blob_sj_test` — SJ materialization
- Unit tests in `hp_test_hash-t` and `hp_test_rebuild_blob_key-t`
covering blob hash/compare, key rebuild, key_part_flag corruption
|
Alexey Botchkov seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Depends on: branch
MDEV-38975(HEAP BLOB/TEXT support) — this PR is a single commit on top of that branch and requires it to be merged first.This change has the exact same semantics and side effects as the existing
CONVERT_IF_BIGGER_TO_BLOB(512 chars) promotion, just at a lower HEAP-specific threshold of 32 bytes (not characters). The threshold is in bytes because HEAP waste is in bytes: the sameVARCHAR(100)wastes 100 bytes in latin1, 300 in utf8mb3, and 400 in utf8mb4. A 32-byte floor ensures promotion helps at any encoding.Why: HEAP uses fixed-width rows, so every VARCHAR allocates its full declared
octet_lengthper row regardless of actual data. For I_S tables likeCOLUMNS(12,560 bytes/row baseline), promoting VARCHARs wider than 32 bytes to BLOBs collapses the primary record to 312 bytes and stores actual data in continuation chains, reducing per-row memory by ~75%.Promotion paths:
add_schema_fields()creating_heap_tmp_tableflag invarstring_type_handler()Side effects (identical to
CONVERT_IF_BIGGER_TO_BLOB):CREATE TABLE ... AS SELECTfrom promoted temp tables produces BLOB columns where VARCHAR was expected. This is a pre-existing limitation ofCONVERT_IF_BIGGER_TO_BLOBtype leakage into permanent tables, not new behavior introduced here. Fixing this across all thresholds has a large blast radius and should be a separate MDEV.MYSQL_TYPE_BLOBinstead ofMYSQL_TYPE_VARCHARfor promoted columns.DESCoutput showstext/tinytext/mediumtext/longtextinstead ofvarchar(N)for promoted columns (cosmetic, result files re-recorded).innodb.innodb-mdev-7408test adapted to use explicitCREATE TABLE+INSERT ... SELECTinstead ofCREATE TABLE ... AS SELECTto avoid InnoDB stopword type validation failure.Test plan
innodb.innodb-mdev-7408passesheap.*suite passes (12/12)innodb_fts.*suite passes (98/98)subselect_mat,subselect_sj_mat,sj_mat_debug, etc.)