Skip to content

Fix resumable binlog position after interrupt#168

Merged
fjordan merged 5 commits intomasterfrom
feature/fix-resuming-binlog-position
Apr 14, 2020
Merged

Fix resumable binlog position after interrupt#168
fjordan merged 5 commits intomasterfrom
feature/fix-resuming-binlog-position

Conversation

@fjordan
Copy link
Copy Markdown
Contributor

@fjordan fjordan commented Apr 8, 2020

Alternative to #160 (and also makes use of the same added test with slight modifications).

As described in the other PR, attempting to resume Ghostferry from an invalid binlog position will result in an error as the preceding TableMapEvent is required to properly translate RowsEvents.

This adds a resumable position and interface to the DMLEvent that the InlineVerifier and BinlogWriter persist during interruption to be reused during resume.

The last resumable position is stored as a member of the BinlogStreamer and is periodically updated as the BinlogStreamer sees new resumable events (GTIDEvents and XIDEvents). This position is then sent along with the corresponding DMLEvent. The resumable position is made available on the DMLEvents using a new interface method - ResumableBinlogPosition.

/cc @Shopify/pods

@fjordan fjordan force-pushed the feature/fix-resuming-binlog-position branch 2 times, most recently from 2f562cb to b3cd656 Compare April 8, 2020 20:36
Comment thread binlog_streamer.go
Copy link
Copy Markdown
Contributor

@hkdsun hkdsun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how simply you solved this ❤️ couple comments and then good-to-go I think

And excellent test case by @kolbitsch-lastline, thank you for the contribution!

Comment thread binlog_streamer.go Outdated
}

dmlEvs, err := NewBinlogDMLEvents(table, ev, pos)
s.lastResumableBinlogPosition.Name = s.lastStreamedBinlogPosition.Name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, what? I don't think this should be legal 😄 jokes aside, mutating any binlog position sounds like a recipe for bugs down the line.. it's a bit like randomizing the coordinates you put in a GPS.. you might end up in middle of nowhere!

Would a test fail if I removed this line?

Copy link
Copy Markdown
Contributor Author

@fjordan fjordan Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't recommend removing it 😄.

Instead of just duplicating all the places where lastStreamedBinlogPosition is set to also set this value, we just reuse the value like is already being done for the pos value in line 262.

Do you think it makes more sense to put it closer to the pos assignment? I don't think it would make much difference.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see what's going on. This could use some cleanup, I'll PR something based off your branch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up moving this assignment to the RotateEvent and in the initial connect further up. That allows us to remove it from this line here.

Comment thread binlog_streamer.go
// so there's no way to handle this for us.
continue
case *replication.XIDEvent, *replication.GTIDEvent:
s.lastResumableBinlogPosition.Pos = uint32(ev.Header.LogPos)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document in code comments what got us to this? 🙏

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do 👍

@fjordan fjordan force-pushed the feature/fix-resuming-binlog-position branch from 2bb8be2 to 2df4712 Compare April 13, 2020 19:53
@fjordan fjordan force-pushed the feature/fix-resuming-binlog-position branch from 2df4712 to 38f9c93 Compare April 13, 2020 19:54
Copy link
Copy Markdown
Contributor

@hkdsun hkdsun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's get Shuhao's +1 and 🚢

Comment thread binlog_streamer.go
//
// As a result, the following case will set the last resumable position for
// interruption to EITHER the start (if using GTIDs) or the end of the
// last transaction
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Succinct explanation. Thank you!

Comment thread binlog_streamer.go
binlogStreamer *replication.BinlogStreamer
lastStreamedBinlogPosition mysql.Position
lastResumableBinlogPosition mysql.Position
targetBinlogPosition mysql.Position
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to do this now, but this was brought up the other day: Since we will eventually start streaming binlog from the target for verification, maybe we should rename this to "stopAtBinlogPosition", or something to that effect.

I just wanted to mention this so we don't forget about it.


ghostferry.on_status(Ghostferry::Status::AFTER_BINLOG_APPLY) do
# while we are emitting events in the loop above, try to inject a shutdown
# signal, hoping to interrupt between applying an INSERT and receiving the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this test will be known to racy? Will it make CI more flaky?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has been consistently passing in my experience and hasn’t been flaky


logrus.SetLevel(logrus.DebugLevel)
if os.Getenv("CI") == "true" {
logrus.SetLevel(logrus.ErrorLevel)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remind me what this is for? I don't recall anymore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log output from the binlog events is seemingly causing CI to truncate the output. This just lowers the logging level unless needed for the tests

Comment thread binlog_streamer.go
case *replication.RotateEvent:
// This event is needed because we need to update the last successful
// binlog position.
s.lastResumableBinlogPosition.Pos = uint32(e.Position)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we need to set the .Name here as well? This is a RotateEvent so the file should change, right? That would get rid of the slightly cryptic s.lastResumableBinlogPosition.Name line below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on our conversation, we decided to set the name here and in the ConnectBinlogStreamerToMysqlFrom above 👍

@fjordan fjordan force-pushed the feature/fix-resuming-binlog-position branch from c2a79fd to d2dc8cc Compare April 13, 2020 22:18
Comment thread binlog_streamer.go
// This event is needed because we need to update the last successful
// binlog position.
s.lastResumableBinlogPosition.Pos = uint32(e.Position)
s.lastResumableBinlogPosition.Name = string(e.NextLogName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I thought we agreed we can't resume from a RotateEvent's position?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place where MySQL replication gives us the file name of the binlog. This value is therefore cached until the next Rotate event, which would give us the next file name. So we have to set this filename here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, why can't we use a separate currentFilename variable to cache the current filename?

the current filename needs a clear separation from the resumable position.

@fjordan fjordan merged commit 509aeda into master Apr 14, 2020
@fjordan fjordan deleted the feature/fix-resuming-binlog-position branch April 14, 2020 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants