MDEV-37949: Implement innodb_log_archive, innodb_log_recovery_start, innodb_log_recovery_target, …#4405
MDEV-37949: Implement innodb_log_archive, innodb_log_recovery_start, innodb_log_recovery_target, …#4405
Conversation
|
|
4a9a384 to
9d14e2c
Compare
1ab3880 to
254bfaa
Compare
b9728c2 to
091c2b3
Compare
1112487 to
998c03a
Compare
4d2e8ad to
99959b5
Compare
| } | ||
| sql_print_error("[FATAL] InnoDB: Failed to create %s of %" PRIu64 | ||
| " bytes", path.c_str(), resize_target); | ||
| abort(); |
There was a problem hiding this comment.
Please check this crash :-
# 2026-03-15T18:38:27 [1183625] | #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
# 2026-03-15T18:38:27 [1183625] | #1 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
# 2026-03-15T18:38:27 [1183625] | #2 __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
# 2026-03-15T18:38:27 [1183625] | #3 0x0000721e7924526e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
# 2026-03-15T18:38:27 [1183625] | #4 0x0000721e792288ff in __GI_abort () at ./stdlib/abort.c:79
# 2026-03-15T18:38:27 [1183625] | #5 0x0000565800824e4f in log_t::archive_new_write (this=this@entry=0x56580204cd00 <log_sys>, buf=0x721e60c00200 "\003", buf@entry=0x721e60c00000 "\001\020\303", length=1536, length@entry=2048, offset=offset@entry=134217216) at /data/Server/MDEV-37949B/storage/innobase/log/log0log.cc:1239
# 2026-03-15T18:38:27 [1183625] | #6 0x00005658008256f6 in log_t::write_buf<(log_t::resizing_and_latch)1> (this=0x56580204cd00 <log_sys>) at /data/Server/MDEV-37949B/storage/innobase/log/log0log.cc:1672
# 2026-03-15T18:38:27 [1183625] | #7 log_writer () at /data/Server/MDEV-37949B/storage/innobase/log/log0log.cc:1819
# 2026-03-15T18:38:27 [1183625] | #8 0x00005658008227cb in log_write_up_to (lsn=lsn@entry=138401032, durable=durable@entry=true, callback=<optimized out>, callback@entry=0x721e786665f0) at /data/Server/MDEV-37949B/storage/innobase/log/log0log.cc:1778
# 2026-03-15T18:38:27 [1183625] | #9 0x00005658009586ba in trx_flush_log_if_needed (lsn=138401032, trx=trx@entry=0x721e63202300) at /data/Server/MDEV-37949B/storage/innobase/trx/trx0trx.cc:1280
# 2026-03-15T18:38:27 [1183625] | #10 0x000056580095c381 in trx_t::commit_in_memory (mtr=0x721e78666680, this=0x721e63202300) at /data/Server/MDEV-37949B/storage/innobase/trx/trx0trx.cc:1486
# 2026-03-15T18:38:27 [1183625] | #11 trx_t::commit_persist (this=this@entry=0x721e63202300) at /data/Server/MDEV-37949B/storage/innobase/trx/trx0trx.cc:1601
# 2026-03-15T18:38:27 [1183625] | #12 0x000056580095c4bc in trx_t::commit (this=this@entry=0x721e63202300) at /data/Server/MDEV-37949B/storage/innobase/trx/trx0trx.cc:1610
# 2026-03-15T18:38:27 [1183625] | #13 0x00005658009fc06a in dict_stats_save (table=table@entry=0x6a1e44c4cb00, index_id=index_id@entry=0) at /data/Server/MDEV-37949B/storage/innobase/dict/dict0stats.cc:3046
# 2026-03-15T18:38:27 [1183625] | #14 0x000056580079ba0e in ha_innobase::statistics_init (this=this@entry=0x6a1e44cb0980, table=table@entry=0x6a1e44c4cb00, recalc=recalc@entry=true) at /data/Server/MDEV-37949B/storage/innobase/handler/ha_innodb.cc:5783
# 2026-03-15T18:38:27 [1183625] | #15 0x00005658007b7b00 in ha_innobase::reload_statistics (this=0x6a1e44cb0980) at /data/Server/MDEV-37949B/storage/innobase/handler/ha_innodb.cc:1541
# 2026-03-15T18:38:27 [1183625] | #16 0x000056580079bb13 in innodb_notify_tabledef_changed (handler=<optimized out>) at /data/Server/MDEV-37949B/storage/innobase/handler/ha_innodb.cc:1555
# 2026-03-15T18:38:27 [1183625] | #17 0x0000565800390d7a in notify_tabledef_changed (table_list=table_list@entry=0x6a1e4441b2e0) at /data/Server/MDEV-37949B/sql/sql_table.cc:7794
# 2026-03-15T18:38:27 [1183625] | #18 0x00005658003a4eec in mysql_inplace_alter_table (thd=thd@entry=0x6a1e44002c98, table_list=0x6a1e4441b2e0, table=table@entry=0x6a1e44c4c0b8, altered_table=altered_table@entry=0x721e786680e0, ha_alter_info=ha_alter_info@entry=0x721e78668030, target_mdl_request=target_mdl_request@entry=0x721e78668910, ddl_log_state=0x721e78667ff0, trigger_param=0x721e786684d0, alter_ctx=0x721e78669560, partial_alter=@0x721e78667f6e: false, start_alter_id=@0x721e78667f78: 0, if_exists=false) at /data/Server/MDEV-37949B/sql/sql_table.cc:8214
# 2026-03-15T18:38:27 [1183625] | #19 0x00005658003a8e82 in mysql_alter_table (thd=thd@entry=0x6a1e44002c98, new_db=new_db@entry=0x6a1e44007840, new_name=<optimized out>, create_info=create_info@entry=0x721e7866a400, table_list=<optimized out>, table_list@entry=0x6a1e4441b2e0, recreate_info=recreate_info@entry=0x721e7866a250, alter_info=<optimized out>, order_num=<optimized out>, order=<optimized out>, ignore=<optimized out>, if_exists=<optimized out>) at /data/Server/MDEV-37949B/sql/sql_table.cc:11621
# 2026-03-15T18:38:27 [1183625] | #20 0x0000565800419027 in Sql_cmd_alter_table::execute (this=<optimized out>, thd=0x6a1e44002c98) at /data/Server/MDEV-37949B/sql/sql_alter.cc:701
# 2026-03-15T18:38:27 [1183625] | #21 0x00005658002e2735 in mysql_execute_command (thd=thd@entry=0x6a1e44002c98, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /data/Server/MDEV-37949B/sql/sql_parse.cc:5916
# 2026-03-15T18:38:27 [1183625] | #22 0x00005658002e32c1 in mysql_parse (thd=thd@entry=0x6a1e44002c98, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x721e7866b400) at /data/Server/MDEV-37949B/sql/sql_parse.cc:7945
# 2026-03-15T18:38:27 [1183625] | #23 0x00005658002e49a4 in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x6a1e44002c98, packet=packet@entry=0x6a1e4400c189 "ALTER TABLE test.extra_t2 DROP COLUMN col2 /* E_R checkDatabaseIntegrity QNO 7 CON_ID 8 */ ", packet_length=packet_length@entry=91, blocking=blocking@entry=true) at /data/Server/MDEV-37949B/sql/sql_parse.cc:1925
# 2026-03-15T18:38:27 [1183625] | #24 0x00005658002e5ef1 in do_command (thd=thd@entry=0x6a1e44002c98, blocking=blocking@entry=true) at /data/Server/MDEV-37949B/sql/sql_parse.cc:1433
# 2026-03-15T18:38:27 [1183625] | #25 0x0000565800411d05 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x565803fba5f8, put_in_cache=put_in_cache@entry=true) at /data/Server/MDEV-37949B/sql/sql_connect.cc:1497
# 2026-03-15T18:38:27 [1183625] | #26 0x0000565800411f11 in handle_one_connection (arg=0x565803fba5f8) at /data/Server/MDEV-37949B/sql/sql_connect.cc:1409
# 2026-03-15T18:38:27 [1183625] | #27 0x0000721e7929ca94 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:447
# 2026-03-15T18:38:27 [1183625] | #28 0x0000721e79329a34 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
(rr) frame 5
#5 0x0000565800824e4f in log_t::archive_new_write (this=this@entry=0x56580204cd00 <log_sys>, buf=0x721e60c00200 "\003", buf@entry=0x721e60c00000 "\001", <incomplete sequence \303>, length=1536,
length@entry=2048, offset=offset@entry=134217216) at /data/Server/MDEV-37949B/storage/innobase/log/log0log.cc:1239
1239 abort();
(rr) info args
this = 0x56580204cd00 <log_sys>
buf = 0x721e60c00200 "\003"
length = 1536
offset = 134217216
(rr) info locals
__PRETTY_FUNCTION__ = "void log_t::archive_new_write(const byte*, size_t, lsn_t)"
path = "ib_00000000083fd000.log"
success = false
file = <optimized out>
(rr) x/64bx buf
0x721e60c00200: 0x03 0x00 0x3d 0x30 0x12 0x03 0x2e 0x38
0x721e60c00208: 0x03 0x02 0xa0 0x32 0x00 0x00 0x00 0x2e
0x721e60c00210: 0x34 0x03 0x2e 0x19 0x06 0xb3 0x2a 0x20
0x721e60c00218: 0x72 0x34 0x03 0x2e 0x43 0x02 0x34 0x03
0x721e60c00220: 0x2e 0x3f 0x03 0xa1 0x02 0xb2 0x13 0x56
0x721e60c00228: 0xc3 0x1e 0x04 0xff 0xd3 0x02 0x06 0x0b
0x721e60c00230: 0x34 0x03 0x2e 0x49 0x01 0xb7 0x00 0x00
0x721e60c00238: 0x00 0x00 0x2e 0x00 0x2c 0xd3 0x00 0x06
(rr) print length
$1 = 1536
(rr) print offset
$2 = 134217216
(rr) info registers
Missing register name
(rr) list
1234 os_file_close(file);
1235 IF_WIN(DeleteFile(path.c_str()), unlink(path.c_str()));
1236 }
1237 sql_print_error("[FATAL] InnoDB: Failed to create %s of %" PRIu64
1238 " bytes", path.c_str(), resize_target);
1239 abort();
1240 }
RR trace is present on sdp :-
/data/results/1773602901/TBR-2397-MDEV-37949
There was a problem hiding this comment.
This is not the first time you file this documented behaviour as a bug. Quoting the first comment in MDEV-37949:
What to do if space allocation fails
InnoDB recovery will read records from one or more log files. Only after the log-based recovery is completed, may InnoDB start writing log records, just like under normal operation.
It is safest to block any InnoDB writes when we fail to allocate space for a log file when innodb_log_archive=ON, and simplest to crash the server. This is very disruptive, but a situation that InnoDB would be guaranteed to recover from.
If we silently executedSET GLOBAL innodb_log_archive=OFF, that could ruin a "perfect streak" and prevent point-in-time recovery. I do not want to introduce a difficult-to-handle regression, such as the open bug MDEV-27803 which involves a failure to allocate space for the binlog.
There was a problem hiding this comment.
We have errno=EEXIST rather than ENOSPC here. That is, a log file by that name ib_00000000083fd000.log already existed. The server that was killed during workload had actually created such a file but not written anything to it:
(rr) when
Completed event: 1000230
(rr) backtrace
Thread 5 (Thread 1186009.1186110 (mariadbd)):
#0 0x0000000070000002 in syscall_traced ()
#1 0x00007ff6db96d018 in _raw_syscall () at /home/ubuntu/rr/src/preload/raw_syscall.S:120
#2 0x00007ff6db966909 in traced_raw_syscall (call=0x7ff6d83fffa0) at /home/ubuntu/rr/src/preload/syscallbuf.c:379
#3 0x00007ff6db96a0bc in sys_futex (call=<optimized out>) at /home/ubuntu/rr/src/preload/syscallbuf.c:2085
#4 syscall_hook_internal (call=call@entry=0x7ff6d83fffa0) at /home/ubuntu/rr/src/preload/syscallbuf.c:4203
#5 0x00007ff6db96cda4 in syscall_hook (call=0x7ff6d83fffa0) at /home/ubuntu/rr/src/preload/syscallbuf.c:4387
#6 syscall_hook (call=0x7ff6d83fffa0) at /home/ubuntu/rr/src/preload/syscallbuf.c:4371
#7 0x00007ff6db966323 in _syscall_hook_trampoline () at /home/ubuntu/rr/src/preload/syscall_hook.S:308
#8 0x00007ff6db96638d in __morestack () at /home/ubuntu/rr/src/preload/syscall_hook.S:443
#9 0x00007ff6db9663a9 in _syscall_hook_trampoline_48_3d_00_f0_ff_ff () at /home/ubuntu/rr/src/preload/syscall_hook.S:462
#10 0x00007ff6db11c689 in fallocate64 (fd=fd@entry=0x5d, mode=mode@entry=0x0, offset=offset@entry=0x0, len=len@entry=0x80000000) at ../sysdeps/unix/sysv/linux/fallocate64.c:27
#11 0x00005b2346705af4 in os_file_set_size (name=0x77f6b803baf0 "ib_00000000083fd000.log", file=file@entry=0x5d, size=0x80000000, is_sparse=is_sparse@entry=0x0) at /data/Server/MDEV-37949B/storage/innobase/os/os0file.cc:1557
#12 0x00005b23466bcdd5 in log_t::archive_new_write (this=this@entry=0x5b2347ee4d00 <log_sys>, buf=0x77f6be40a800 "\275", buf@entry=0x77f6be400000 "?\200\006\032\300\024\203G-0", length=0x5aa00, length@entry=0x65200, offset=offset@entry=0x7ff5800) at /usr/include/c++/13/bits/basic_string.h:222
#13 0x00005b23466bd6f6 in log_t::write_buf<(log_t::resizing_and_latch)1> (this=0x5b2347ee4d00 <log_sys>) at /data/Server/MDEV-37949B/storage/innobase/log/log0log.cc:1672
#14 log_writer () at /data/Server/MDEV-37949B/storage/innobase/log/log0log.cc:1819
#15 0x00005b23466ba7cb in log_write_up_to (lsn=0x8457923, lsn@entry=0x8426f3c, durable=durable@entry=0x1, callback=<optimized out>, callback@entry=0x0) at /data/Server/MDEV-37949B/storage/innobase/log/log0log.cc:1778
#16 0x00005b23468564dc in buf_page_t::flush (this=this@entry=0x77f6c0009c00, space=space@entry=0x77f6b801fac0) at /data/Server/MDEV-37949B/storage/innobase/buf/buf0flu.cc:855
#17 0x00005b2346858de3 in buf_do_flush_list_batch (max_n=max_n@entry=0x60, lsn=lsn@entry=0xffffffffffffffff) at /data/Server/MDEV-37949B/storage/innobase/buf/buf0flu.cc:1538
#18 0x00005b234685917b in buf_flush_list_holding_mutex (max_n=0x60, lsn=lsn@entry=0xffffffffffffffff) at /data/Server/MDEV-37949B/storage/innobase/buf/buf0flu.cc:1595
#19 0x00005b234685a36a in buf_flush_page_cleaner () at /data/Server/MDEV-37949B/storage/innobase/buf/buf0flu.cc:2875
The simulated fallocate(2) system call is failing several times with EINTR, and os_file_set_size() is retrying, all the time until the SIGKILL arrives. It looks like the file size would have remained at 0 bytes. I confirmed that with the following command:
tar tvJf /data/results/1773602901/TBR-2397-MDEV-37949/archive.tar.xz|grep ib_00000000083fd000.log-rw-rw---- saahil/dev 0 2026-03-15 18:37 ./1/fbackup/data/ib_00000000083fd000.log
-rw-rw---- saahil/dev 0 2026-03-15 18:37 ./1/data/ib_00000000083fd000.log
We must allow the last log file to be overwritten, in case the server had previously been killed before the file had been extended to a valid size.
There was a problem hiding this comment.
I think that the proper way to recover from this is that when the last archive file is below the minimum size (4 MiB), on the first write that goes beyond the current archive file, we will extend the invalid file to the current innodb_log_file_size. I can reproduce this failure with the following test:
diff --git a/mysql-test/suite/innodb/t/log_archive.test b/mysql-test/suite/innodb/t/log_archive.test
index 3f1781e74ef..d36b8c30453 100644
--- a/mysql-test/suite/innodb/t/log_archive.test
+++ b/mysql-test/suite/innodb/t/log_archive.test
@@ -195,6 +195,9 @@ print FILE $FILE_MODIFY;
close(FILE) or die "Unable to close $file\n";
EOF
+--write_file $DATADIR/ib_ffffffffffc00000.log
+EOF
+
--let $restart_parameters= --innodb-log-recovery-start=0 --innodb-log-buffer-size=5m
--source include/start_mysqld.inc
2026-03-16 11:15:52 0 [Warning] InnoDB: ignoring .//ib_ffffffffffc00000.log
2026-03-16 11:15:52 0 [ERROR] [FATAL] InnoDB: Failed to create ib_ffffffffffc00000.log of 4194304 bytes
There was a problem hiding this comment.
It turns out that my test for this is crashing on FreeBSD. I will analyze and fix that tomorrow.
|
|
||
| fil_flush_file_spaces(); | ||
|
|
||
| if (UNIV_UNLIKELY(recv_sys.rpo != 0)) |
There was a problem hiding this comment.
can we check this one at the start of function?
There was a problem hiding this comment.
Can you document the complete lifecycle of recv_sys.rpo? When is it set, when is it cleared, and what operations are forbidden in this mode?
There was a problem hiding this comment.
RPO is an abbreviation of "recovery point objective". The related parameter is read-only:
static MYSQL_SYSVAR_UINT64_T(log_recovery_target, recv_sys.rpo,
PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY,
"recovery point objective (end LSN; 0=unlimited)",
nullptr, nullptr, 0, 0, std::numeric_limits<uint64_t>::max(), 0);The purpose of this parameter is to make the server read-only, with the only exception that the log will be applied to the data files.
Description
MDEV-37949: Implement
innodb_log_archiveThe InnoDB write-ahead log file in the old
innodb_log_archive=OFFformat is namedib_logfile0, pre-allocated toinnodb_log_file_sizeand written as a ring buffer. This is good for write performance and space management, but unsuitable for arbitrary point-in-time recovery or for facilitating efficient incremental backup.innodb_log_archive=ON: A new format where InnoDB will create and preallocate filesib_%016x.log, instead of writing a circular fileib_logfile0. Each file will be pre-allocated toinnodb_log_file_size(between 4M and 4G; we impose a stricter upper limit of 4 GiB forinnodb_log_archive=ON). Once a log fills up, we will create and pre-allocate another log file, to which log records will be written. Upon the completion of the first log checkpoint in a recently created log file, the old log file will be marked read-only, signaling that there will be no further writes to that file, and that the file may safely be moved to long-term storage.The file name includes the log sequence number (LSN) at file offset 12288 (
log_t::START_OFFSET). Limiting the file size to 4 GiB allows us to identify each checkpoint by storing a 32-bit big-endian offset into the optionalFILE_MODIFYand the mandatoryFILE_CHECKPOINTrecords, between 12288 and the end of the file.The
innodb_encrypt_logformat is identified by storing the encryption information at the start of the log file. The first 32-bit value will be 1, which is an invalid checkpoint offset. Eachinnodb_log_archive=ONlog must use the same encryption parameters. Changinginnodb_encrypt_logor related parameters is only possible by settinginnodb_log_archive=OFFand restarting the server, which will permanently lose the history of the archived log.The maximum number of log checkpoints that the
innodb_log_archive=ONfile header can represent is limited to 12288/4=3072 when usinginnodb_encrypt_log=OFF. If we run out of slots in a log file, each subsequently completed checkpoint in that log file will overwrite the last slot in the checkpoint header, until we switch to the next log.innodb_log_recovery_start: The checkpoint LSN to start recovery from. This will be useful when recovering from an archived log. This is useful for restoring an incremental backup (applying InnoDB log files that were copied since the previous restore).innodb_log_recovery_target: The requested LSN to end recovery at. When this is set, all persistent InnoDB tables will be read-only, and no writes to the log are allowed. The intended purpose of this setting is to prepare an incremental backup, as well as to allow data retrieval as of a particular logical point of time.Setting
innodb_log_recovery_target>0 is much like settinginnodb_read_only=ON, with the exception that the data files may be written to by crash recovery, and locking reads will conflict with any incomplete transactions as necessary, and all transaction isolation levels will work normally (not hard-wired toREAD UNCOMMITTED).The status variable
innodb_lsn_archivedwill reflect the LSN since when a complete InnoDB log archive is available. Its initial value will be that of the new parameterinnodb_log_archive_start. If that variable is 0 (the default), theinnodb_lsn_archivedwill be recovered from the available log files. Ifinnodb_log_archive=OFF,innodb_lsn_archivedwill be adjusted to the latest checkpoint every time a log checkpoint is executed. Ifinnodb_log_archive=ON, the valueshould not change.
Statements like
will take effect as soon as possible, possibly after a log checkpoint has been completed. The log file will be renamed between
ib_logfile0andib_%016x.logas appropriate.When
innodb_log_archive=ON, the settingSET GLOBAL innodb_log_file_sizewill affect subsequently created log files when the file that is being currently written is running out. If we are switching log files exactly at the same time, then a somewhat misleading error message "innodb_log_file_size change is already in progress" will be issued.When
innodb_read_only=ON,innodb_log_recovery_targetwill be set to the current LSN. This ensures that it suffices to check only one of these variables when blocking writes to persistent tables.See the commit message for a more detailed description of the changed data structures and functions.
Release Notes
See the previous section.
How can this PR be tested?
The
mariabackupsuite must be skipped when settinginnodb_log_archive=ON, because themariadb-backuptool will only support the oldinnodb_log_archive=OFFformat (copying fromib_logfile0).Unfortunately, all
--suite=encryptiontests that useinnodb_encrypt_logmust be skipped when usinginnodb_log_archive. This is because the server would have to be reinitialized; we do not allow changing the format of an archived log on startup (such as adding or removing encryption). This combination is covered by the testinnodb.log_file_size_online,encrypted.no_checkpoint_prepare.inc: A new file, to prepare for subsequent inclusion ofno_checkpoint_end.inc. We will invoke the server to parse the log and to determine the latest checkpoint.A number of tests that would fail when the parameter
innodb_log_recovery_start=12288is present, which is forcingrecovery to start from the beginning of the history (the database creation), have been adjusted with
explicit
--innodb-log-recovery-start=0to override that:innodb_log_recovery_start=0.innodb_undo_tablespaceswould fail in crash recovery, because crash recovery assumes that the undo tablespace ID that is available from the undo* files corresponds with the start of the log. This is an unforunate design bug which we cannot fix easily.Basing the PR against the correct MariaDB version
mainbranch.This is a new feature, but for now based on the 11.4 branch so that any unrelated errors that may be found during testing can be fixed rather quickly. Merges to the
mainbranch may be blocked for weeks at a time.PR quality check