Skip to content
/ server Public

WIP: MDEV-38975 Phase 1 — Promote wide VARCHAR to BLOB for HEAP internal temp tables#4812

Draft
arcivanov wants to merge 5 commits intoMariaDB:10.11from
arcivanov:MDEV-38975-Phase1-varchar-blob-promotion
Draft

WIP: MDEV-38975 Phase 1 — Promote wide VARCHAR to BLOB for HEAP internal temp tables#4812
arcivanov wants to merge 5 commits intoMariaDB:10.11from
arcivanov:MDEV-38975-Phase1-varchar-blob-promotion

Conversation

@arcivanov
Copy link
Contributor

Summary

Depends on: branch MDEV-38975 (HEAP BLOB/TEXT support) — this PR is a single commit on top of that branch and requires it to be merged first.

This change has the exact same semantics and side effects as the existing CONVERT_IF_BIGGER_TO_BLOB (512 chars) promotion, just at a lower HEAP-specific threshold of 32 bytes (not characters). The threshold is in bytes because HEAP waste is in bytes: the same VARCHAR(100) wastes 100 bytes in latin1, 300 in utf8mb3, and 400 in utf8mb4. A 32-byte floor ensures promotion helps at any encoding.

Why: HEAP uses fixed-width rows, so every VARCHAR allocates its full declared octet_length per row regardless of actual data. For I_S tables like COLUMNS (12,560 bytes/row baseline), promoting VARCHARs wider than 32 bytes to BLOBs collapses the primary record to 312 bytes and stores actual data in continuation chains, reducing per-row memory by ~75%.

Promotion paths:

  • I_S temp tables via add_schema_fields()
  • Expression-path temp tables via creating_heap_tmp_table flag in varstring_type_handler()

Side effects (identical to CONVERT_IF_BIGGER_TO_BLOB):

  • CREATE TABLE ... AS SELECT from promoted temp tables produces BLOB columns where VARCHAR was expected. This is a pre-existing limitation of CONVERT_IF_BIGGER_TO_BLOB type leakage into permanent tables, not new behavior introduced here. Fixing this across all thresholds has a large blast radius and should be a separate MDEV.
  • Result set metadata shows MYSQL_TYPE_BLOB instead of MYSQL_TYPE_VARCHAR for promoted columns.
  • Sysschema DESC output shows text/tinytext/mediumtext/longtext instead of varchar(N) for promoted columns (cosmetic, result files re-recorded).
  • innodb.innodb-mdev-7408 test adapted to use explicit CREATE TABLE + INSERT ... SELECT instead of CREATE TABLE ... AS SELECT to avoid InnoDB stopword type validation failure.

Test plan

  • All sysschema tests pass (89/89)
  • innodb.innodb-mdev-7408 passes
  • heap.* suite passes (12/12)
  • innodb_fts.* suite passes (98/98)
  • Subselect tests pass (subselect_mat, subselect_sj_mat, sj_mat_debug, etc.)
  • Full CI run

Allow BLOB/TEXT/JSON/GEOMETRY columns in MEMORY (HEAP) engine tables
by storing blob data in variable-length continuation record chains
within the existing `HP_BLOCK` structure.

**Continuation runs**: blob data is split across contiguous sequences
of `recbuffer`-sized records. Each run stores a 10-byte header
(`next_cont` pointer + `run_rec_count`) in the first record; inner
records (rec 1..N-1) have no flags byte — full `recbuffer` payload.
Runs are linked via `next_cont` pointers. Individual runs are capped
at 65,535 records (`uint16` format limit); larger blobs are
automatically split into multiple runs.

**Zero-copy reads**: single-run blobs return pointers directly into
`HP_BLOCK` records, avoiding `blob_buff` reassembly entirely:
- Case A (`run_rec_count == 1`): return `chain + HP_CONT_HEADER_SIZE`
- Case B (`HP_ROW_CONT_ZEROCOPY` flag): return `chain + recbuffer`
- Case C (multi-run): walk chain, reassemble into `blob_buff`
`HP_INFO::has_zerocopy_blobs` tracks zero-copy state; used by
`heap_update()` to refresh the caller's record buffer after freeing
old chains, preventing dangling pointers.

**Free list scavenging**: on insert, the free list is walked read-only
(peek) tracking contiguous groups in descending address order (LIFO).
Qualifying groups (>= `min_run_records`) are unlinked and used. The
first non-qualifying group terminates the scan — remaining data is
allocated from the block tail. The free list is never disturbed when
no qualifying group is found.

**Record counting**: new `HP_SHARE::total_records` tracks all physical
records (primary + continuation). `HP_SHARE::records` remains logical
(primary-only) to preserve linear hash bucket mapping correctness.

**Scan/check batch-skip**: `heap_scan()` and `heap_check_heap()` read
`run_rec_count` from rec 0 and skip entire continuation runs at once.

**Hash functions**: `hp_rec_hashnr()`, `hp_rec_key_cmp()`, `hp_key_cmp()`,
`hp_make_key()` updated to handle `HA_BLOB_PART` key segments — reading
actual blob data via pointer dereference or chain materialization.

**SQL layer**: `choose_engine()` no longer rejects HEAP for blob tables
(replaced `blob_fields` check with `reclength > HA_MAX_REC_LENGTH`).
`remove_duplicates()` routes HEAP+blob to `remove_dup_with_compare()`.
`ha_heap::remember_rnd_pos()` / `restart_rnd_next()` implemented for
DISTINCT deduplication support. Fixed undefined behavior in
`test_if_cheaper_ordering()` where `select_limit/fanout` could overflow
to infinity — capped at `HA_POS_ERROR`.

https://jira.mariadb.org/browse/MDEV-38975
The free-list allocator's minimum contiguous run threshold
(`min_run_records`) could exceed the total records a small blob
actually needs, making free-list reuse impossible on narrow tables.

For example, with `recbuffer=16` the 128-byte floor produced
`min_run_records=8`, but a 32-byte blob only needs 3 records.
Any contiguous free-list group of 3 would be rejected, forcing
unnecessary tail allocation.

Cap both `min_run_bytes` at `data_len` and `min_run_records` at
`total_records_needed` so small blobs can reuse free-list slots
when a sufficient contiguous group exists.
… in hash chain traversal

`hp_search()`, `hp_search_next()`, `hp_delete_key()`, and
`find_unique_row()` walk hash chains calling `hp_key_cmp()` or
`hp_rec_key_cmp()` for every entry. For blob key segments, each
comparison triggers `hp_materialize_one_blob()` which reassembles
blob data from continuation chain records.

Since each `HASH_INFO` already stores `hash_of_key`, compare it
against the search key's hash before the full key comparison. When
hashes differ the keys are guaranteed different, skipping the
expensive materialization. This pattern already existed in
`hp_write_key()` for duplicate detection but was missing from the
four read/delete paths.

`HP_INFO::last_hash_of_key` is added so `hp_search_next()` can
reuse the hash computed by `hp_search()` without recomputing it.
Rebuild HEAP index key from `record[0]` when the index has blob key
segments, because `Field_blob::new_key_field()` returns `Field_varstring`
(2B length + inline data) while HEAP's `hp_hashnr`/`hp_key_cmp` expect
`hp_make_key` format (4B length + data pointer).

Precompute `HP_KEYDEF::has_blob_seg` flag during table creation to avoid
per-call loop through key segments.
**VARCHAR-to-BLOB promotion for HEAP engine** (`HEAP_CONVERT_IF_BIGGER_TO_BLOB = 32`):

Promote VARCHAR fields wider than 32 bytes to BLOB when creating HEAP
internal temporary tables. This avoids wasting memory on fixed-width
rows (HEAP stores VARCHAR at declared maximum width). The promotion is
transparent: the SQL layer sees VARCHAR, only the HEAP storage uses
blob continuation chains internally.

**Key changes:**

1. `sql/sql_type.cc` — `varstring_type_handler()`: thread-local
   `creating_heap_tmp_table` flag triggers `blob_type_handler()` for
   fields exceeding the HEAP threshold

2. `sql/sql_select.cc` — `finalize()`:
   - Set `key_part_flag` from `field->key_part_flag()` in GROUP BY key setup
     (was unconditionally 0, masking blob key handling)
   - `key_field_length` override for `new_key_field()` when blob
     `key_length()` returns 0
   - Skip packed rows and ensure non-zero first byte for HEAP tables
   - Skip null-bits helper key part for HEAP (handles NULLs per-segment)

3. `storage/heap/ha_heap.cc`:
   - `rebuild_blob_key()`: materializes blob data from continuation
     chains back into `record[0]` for correct hash rebuilds
   - `heap_prepare_hp_create_info()`: blob key segment setup with
     `key_part->length` widening to `max_data_length()` for DISTINCT path;
     uses `field->key_part_flag()` instead of `key_part->key_part_flag`
     to avoid corruption from uninitialized flags in SJ weedout and
     expression cache paths

4. `storage/heap/hp_hash.c`:
   - Blob-aware `hp_hashnr()`, `hp_rec_hashnr()`, `hp_key_cmp()`,
     `hp_rec_key_cmp()`, `hp_make_key()` with proper PAD SPACE handling
   - Hash pre-check (`hash_of_key`) to skip expensive blob
     materialization for non-matching hash chain entries
   - VARCHAR `hp_make_key()` rewrite: always writes 2-byte length prefix
     regardless of `bit_start` (fixes key format mismatch for promoted
     TINYBLOB with `pack_length=1`)

5. `storage/heap/hp_create.c`:
   - `HA_BLOB_PART` validation against `blob_descs` array (strips
     spurious blob flags from non-blob fields)
   - Blob segment `bit_start`/`length` normalization
   - Continuation header size enforcement for blob tables

6. `storage/heap/hp_blob.c` (new):
   - Blob continuation chain read/write/free/materialize operations
   - Run-based storage with configurable slot reuse threshold

7. `sql/sql_expression_cache.cc` — disable expression cache for HEAP
   tables with blob fields (key format incompatibility)

8. `sql/item_sum.cc`, `sql/item_func.cc` — blob-aware overflow-to-disk
   and FULLTEXT engine swap

**Test coverage:**
- `heap.heap_blob`, `heap.heap_blob_ops`, `heap.heap_blob_groupby`,
  `heap.heap_geometry`, `heap.blob_dedup` — HEAP blob functionality
- `main.sj_mat_debug`, `main.blob_sj_test` — SJ materialization
- Unit tests in `hp_test_hash-t` and `hp_test_rebuild_blob_key-t`
  covering blob hash/compare, key rebuild, key_part_flag corruption
@CLAassistant
Copy link

CLAassistant commented Mar 15, 2026

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
10 out of 17 committers have signed the CLA.

✅ ayurchen
✅ dbart
✅ janlindstrom
✅ sjaakola
✅ ParadoxV5
✅ arcivanov
✅ gkodinov
✅ hadeer-r
✅ mariadb-TafzeelShams
✅ nadaelsayed11
❌ Alexey Botchkov
❌ dr-m
❌ vuvova
❌ mariadb-poojalamba
❌ midenok
❌ plampio
❌ Thirunarayanan


Alexey Botchkov seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@grooverdan grooverdan changed the base branch from main to 10.11 March 16, 2026 03:28
@grooverdan grooverdan requested a review from montywi March 16, 2026 03:29
@gkodinov gkodinov added the External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements. label Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

External Contribution All PRs from entities outside of MariaDB Foundation, Corporation, Codership agreements.

Development

Successfully merging this pull request may close these issues.

3 participants