mirror of
https://git.openafs.org/openafs.git
synced 2025-01-31 05:27:44 +00:00
util: Clear owner when unlocking recursive mutex
A race condition where the pthread_recursive_mutex_t::owner that is maintained by AFS doesn’t match with the thread that is trying to unlock. This leads to AFS file server and ptserver crash due to assertion failure where it was trying to unlock the grmutex. We saw the race more often when our customer migrated their machines from Power8 to Power9 systems and increased the SMT value from 2 to 4. fileserver Assertion failed! file keys.c, line 911. ptserver Assertion failed! file userok.c, line 78. File: keys.c 889 int 890 afsconf_GetKeyByTypes(struct afsconf_dir *dir, afsconf_keyType type, 891 int kvno, int subType,struct afsconf_typedKey **key) 892 { 893 int code = 0; 894 struct subTypeList *subTypeEntry; 895 896 LOCK_GLOBAL_MUTEX; 897 … 910 out: 911 UNLOCK_GLOBAL_MUTEX; <<<< 912 return code; 913 } Consider a following situation, cpu0 , cpu1 and T0, T1 and T2 are the cpus and timestamps respectively, T0: thread1 locks grmutex performs some operations and unlocks the same, thus has itself set as pthread_recursive_mutex_t::owner. Since presently we do not reset it, thus, pthread_recursive_mutex_t::owner = thread0. T1: thread0 starts on cpu0. T2: thread1 starts on cpu1. T3: thread0 tries to lock AFS grmutex and acquires corresponding pthread_mutex, now before thread0 updates pthread_recursive_mutex_t::owner, a context switch happens. T3: thread1 on cpu1 tries to acquire grmutex and sees itself as the pthread_recursive_mutex_t::owner, possibly as it was not reset and updated yet. So thread1 thinks itself as the owner and proceeds. T4: thread0 updates the pthread_recursive_mutex_t::owner this time it is also synced across the cpu caches. T5: thread1 tries to unlock the grmutex and crashes because now it’s not the owner of the mutex. Debugging: We implemented a circular log to store certain values related to grmutex which helped in debugging us this further. ({ \ time_t t; \ time(&t); \ LOG_EVENT("%s: Unlocking TID %u: %s:%d owner %lu " \ "locked %d pthread_self %u times_inside %d\n", \ ctime(&t), (unsigned)grmutex.mut.__data.__owner,\ __func__ , __LINE__, \ grmutex.owner, grmutex.locked, (unsigned)pthread_self(), \ grmutex.times_inside); \ opr_Verify(pthread_recursive_mutex_unlock(&grmutex)==0); \ }) $614 = "Mon Sep 11 19:35:34 2023\n: Locking TID 136896: afsconf_GetKeyByTypes:896 owner 140735030161776 locked 1 pthread_self 2305880432 times_inside 1\n\000 2\n", $615 = "Mon Sep 11 19:35:34 2023\n: Unlocking TID 136896: afsconf_IsLocalRealmMatch:602 owner 140735030161776 locked 1 pthread_self 1836773744 times_inside 2\n", $617 = "Mon Sep 11 19:35:34 2023\n: Unlocking TID 136896: afsconf_GetKeyByTypes:911 owner 140735030161776 locked 1 pthread_self 2305880432 times_inside 1\n\000\061\n", Solution: This problem was resolved after resetting thread_recursive_mutex_t::owner in global mutex unlock function. Thanks to Todd DeSantis for helping with debugging, review and verification of this problem. Signed-off-by: Indira Sawant <indira.sawant@ibm.com> Reviewed-on: https://gerrit.openafs.org/15604 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Andrew Deason <adeason@sinenomine.net> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> (cherry picked from commit e4fda3481dc9ec651377493afbc95bd40f4f1fb2) Change-Id: I400892121d1b1f63adcd6848e774ede1c4ec5da9 Reviewed-on: https://gerrit.openafs.org/15609 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Mark Vitale <mvitale@sinenomine.net> Reviewed-by: Andrew Deason <adeason@sinenomine.net> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
This commit is contained in:
parent
6edf9d350c
commit
1f63ffef47
@ -74,6 +74,7 @@ pthread_recursive_mutex_unlock(pthread_recursive_mutex_t * mut)
|
||||
mut->times_inside--;
|
||||
if (mut->times_inside == 0) {
|
||||
mut->locked = 0;
|
||||
mut->owner = 0;
|
||||
rc = pthread_mutex_unlock(&mut->mut);
|
||||
}
|
||||
} else {
|
||||
|
Loading…
x
Reference in New Issue
Block a user