ubik: Try to detect VOTE_Beacon errors

Currently the way ubik dbsites vote for each other is via the "return
value" of the Beacon VOTE RPC. Since this is really an Rx abort, this
can easily collide with actual errors on the wire, such as rxkad
errors.

Try to detect these by detecting vote times that are very different
than the current timestamp (more than an hour in the future or past),
and treat it like a network error.

If we do not do this, a single site reporting an error can cause us to
never reach quorum, since we calculate our sync site expiration based
on the oldest 'yes' vote, which for most known Rx aborts will be far
in the past.

Reviewed-on: http://gerrit.openafs.org/8486
Reviewed-by: Jeffrey Altman <jaltman@your-file-system.com>
Tested-by: BuildBot <buildbot@rampaginggeek.com>
Reviewed-by: Derrick Brashear <shadow@your-file-system.com>
(cherry picked from commit 4d4668b1618a2bd5b94ed4620464787f42d11cab)

Change-Id: Iaca12506a35e924631754b638f99cb12faa84479
Reviewed-on: http://gerrit.openafs.org/8946
Reviewed-by: Andrew Deason <adeason@sinenomine.net>
Reviewed-by: Derrick Brashear <shadow@your-file-system.com>
Reviewed-by: Simon Wilkinson <simonxwilkinson@gmail.com>
Reviewed-by: Stephan Wiesand <stephan.wiesand@desy.de>
Tested-by: BuildBot <buildbot@rampaginggeek.com>
This commit is contained in:
Andrew Deason 2012-11-20 14:18:47 -06:00 committed by Stephan Wiesand
parent 538fdd3860
commit a369628d2c

View File

@ -428,6 +428,22 @@ ubeacon_Interact(void *dummy)
ts = servers[multi_i];
ts->lastBeaconSent = temp;
code = multi_error;
if (code > 0 && ((code < temp && code < temp - 3600) ||
(code > temp && code > temp + 3600))) {
/* if we reached here, supposedly the remote host voted
* for us based on a computation from over an hour ago in
* the past, or over an hour in the future. this is
* unlikely; what actually probably happened is that the
* call generated some error and was aborted. this can
* happen due to errors with the rx security class in play
* (rxkad, rxgk, etc). treat the host as if we got a
* timeout, since this is not a valid vote. */
ubik_print("assuming distant vote time %d from %s is an error; marking host down\n",
(int)code, afs_inet_ntoa_r(ts->addr[0], hoststr));
code = -1;
}
/* note that the vote time (the return code) represents the time
* the vote was computed, *not* the time the vote expires. We compute
* the latter down below if we got enough votes to go with */