mirror of
https://git.openafs.org/openafs.git
synced 2025-01-31 05:27:44 +00:00
ubik: Try to detect VOTE_Beacon errors
Currently the way ubik dbsites vote for each other is via the "return value" of the Beacon VOTE RPC. Since this is really an Rx abort, this can easily collide with actual errors on the wire, such as rxkad errors. Try to detect these by detecting vote times that are very different than the current timestamp (more than an hour in the future or past), and treat it like a network error. If we do not do this, a single site reporting an error can cause us to never reach quorum, since we calculate our sync site expiration based on the oldest 'yes' vote, which for most known Rx aborts will be far in the past. Reviewed-on: http://gerrit.openafs.org/8486 Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Derrick Brashear <shadow@your-file-system.com> (cherry picked from commit 4d4668b1618a2bd5b94ed4620464787f42d11cab) Change-Id: Iaca12506a35e924631754b638f99cb12faa84479 Reviewed-on: http://gerrit.openafs.org/8946 Reviewed-by: Andrew Deason <adeason@sinenomine.net> Reviewed-by: Derrick Brashear <shadow@your-file-system.com> Reviewed-by: Simon Wilkinson <simonxwilkinson@gmail.com> Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de> Tested-by: BuildBot <buildbot@rampaginggeek.com>
This commit is contained in:
parent
538fdd3860
commit
a369628d2c
@ -428,6 +428,22 @@ ubeacon_Interact(void *dummy)
|
||||
ts = servers[multi_i];
|
||||
ts->lastBeaconSent = temp;
|
||||
code = multi_error;
|
||||
|
||||
if (code > 0 && ((code < temp && code < temp - 3600) ||
|
||||
(code > temp && code > temp + 3600))) {
|
||||
/* if we reached here, supposedly the remote host voted
|
||||
* for us based on a computation from over an hour ago in
|
||||
* the past, or over an hour in the future. this is
|
||||
* unlikely; what actually probably happened is that the
|
||||
* call generated some error and was aborted. this can
|
||||
* happen due to errors with the rx security class in play
|
||||
* (rxkad, rxgk, etc). treat the host as if we got a
|
||||
* timeout, since this is not a valid vote. */
|
||||
ubik_print("assuming distant vote time %d from %s is an error; marking host down\n",
|
||||
(int)code, afs_inet_ntoa_r(ts->addr[0], hoststr));
|
||||
code = -1;
|
||||
}
|
||||
|
||||
/* note that the vote time (the return code) represents the time
|
||||
* the vote was computed, *not* the time the vote expires. We compute
|
||||
* the latter down below if we got enough votes to go with */
|
||||
|
Loading…
x
Reference in New Issue
Block a user