mirror of
https://git.openafs.org/openafs.git
synced 2025-01-31 13:38:01 +00:00
LINUX: Block non-fatal signals when sleeping
Currently, when sleeping on LINUX, we either block all signals (afs_osi_Sleep), or do not block signals at all (afs_osi_SleepSig). The only caller of afs_osi_SleepSig() is afs_read(), with the intention that a user can interrupt the process while it's waiting for data from an unresponsive server. This can cause problems when afs_read() is called during a paging request. In this case, if we are interrupted by a signal, we'll return an error (EINTR) from afs_linux_readpage(), which causes a SIGBUS to be sent to the process if it's not already dying. If we're interrupted by a fatal signal, the process is already dying so this is fine. But if we're interrupted by a non-fatal signal with an installed signal handler, the SIGBUS almost always causes the process to die immediately (and maybe dump core), where it otherwise would have been fine. This can be very confusing to a user when it happens, since it's not immediately obvious that AFS was involved at all; the dumped core often just shows the SIGBUS generated during a mundane memory load or store. This situation is most easily seen when running golang out of /afs, but has also been seen with git and other programs. Anything that makes heavy use of signals while data is being fetched from /afs is likely to trigger the behavior. This problem in general may not be specific to Linux, since the relevant code path is in cross-platform code (afs_read -> afs_osi_SleepSig). But notably, this does not happen on Solaris[1]; other platforms may have their own quirks that prevent this from becoming a problem. To avoid this for Linux, block all signals except SIGKILL when we're sleeping via afs_osi_SleepSig(). This allows the process to be killed if needed, but prevents interruption from any non-fatal signal. Ideally we would put our process in TASK_KILLABLE state, using functions like wait_event_killable() or wait_event_state() instead of blocking signals. But these functions have evolved over time, making this approach complex or even impossible for various Linux versions in our current design. Future commits may improve this; for now, do the simpler fix and just block signals. We could theoretically still allow non-fatal signals to interrupt sleeps when called from a syscall like read(). But this is difficult with the current structure of our Linux integration (syscall i/o is implemented on top of the paging system, like most Linux filesystems), and other filesystems tend to not do this. Also, interrupting a stalled afs_read() in general doesn't currently work very well anyway. The only callers of afs_osi_SleepSig() look like this, to wait for a background fetch (BOP_FETCH) to complete: while (!code && tdc->mflags & DFFetchReq) { /* other locks, etc */ ReleaseReadLock(&tdc->lock); code = afs_osi_SleepSig(&tdc->validPos); ObtainReadLock(&tdc->lock); } A signal will cause afs_osi_SleepSig() to return, but then we must wait to get tdc->lock. The background BOP_FETCH operation will keep tdc->lock write-locked for the entire fetch from the fileserver (afs_CacheFetchProc()), so we won't be able to continue until the fetch completes. Future commits may improve this, but for now just avoid the unnecessary SIGBUS errors. [1] On Solaris, the equivalent code path (afs_GetOnePage()) does not go through afs_read() with noLock==0, but instead calls afs_GetDCache() to handle fetching data. It does call afs_ustrategy() to fill a page, which does technically call afs_read(), but with noLock==1, and so avoids the case where we submit a BOP_FETCH and wait for it. Fetching data from the network happens in the afs_GetDCache() call, and so does not use afs_osi_SleepSig(). Change-Id: Ic9c0c35bd1a2a8fd1901d91dae10cb7719194d25 Reviewed-on: https://gerrit.openafs.org/15637 Reviewed-by: Cheyenne Wills <cwills@sinenomine.net> Reviewed-by: Mark Vitale <mvitale@sinenomine.net> Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
This commit is contained in:
parent
4dfae3fe35
commit
addee42959
@ -132,7 +132,7 @@ afs_addevent(char *event)
|
|||||||
#define relevent(evp) ((evp)->refcount--)
|
#define relevent(evp) ((evp)->refcount--)
|
||||||
|
|
||||||
static int
|
static int
|
||||||
afs_linux_sleep(void *event, int aintok)
|
afs_linux_sleep(void *event, int killable)
|
||||||
{
|
{
|
||||||
struct afs_event *evp;
|
struct afs_event *evp;
|
||||||
int seq;
|
int seq;
|
||||||
@ -151,22 +151,22 @@ afs_linux_sleep(void *event, int aintok)
|
|||||||
|
|
||||||
AFS_GUNLOCK();
|
AFS_GUNLOCK();
|
||||||
|
|
||||||
if (!aintok) {
|
|
||||||
SIG_LOCK(current);
|
SIG_LOCK(current);
|
||||||
saved_set = current->blocked;
|
saved_set = current->blocked;
|
||||||
|
if (killable) {
|
||||||
|
siginitsetinv(¤t->blocked, sigmask(SIGKILL));
|
||||||
|
} else {
|
||||||
sigfillset(¤t->blocked);
|
sigfillset(¤t->blocked);
|
||||||
|
}
|
||||||
RECALC_SIGPENDING(current);
|
RECALC_SIGPENDING(current);
|
||||||
SIG_UNLOCK(current);
|
SIG_UNLOCK(current);
|
||||||
}
|
|
||||||
|
|
||||||
code = wait_event_freezable(evp->cond, seq != evp->seq);
|
code = wait_event_freezable(evp->cond, seq != evp->seq);
|
||||||
|
|
||||||
if (!aintok) {
|
|
||||||
SIG_LOCK(current);
|
SIG_LOCK(current);
|
||||||
current->blocked = saved_set;
|
current->blocked = saved_set;
|
||||||
RECALC_SIGPENDING(current);
|
RECALC_SIGPENDING(current);
|
||||||
SIG_UNLOCK(current);
|
SIG_UNLOCK(current);
|
||||||
}
|
|
||||||
|
|
||||||
AFS_GLOCK();
|
AFS_GLOCK();
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user