mirror of
https://git.openafs.org/openafs.git
synced 2025-01-18 23:10:58 +00:00
e4fda3481d
A race condition where the pthread_recursive_mutex_t::owner that is maintained by AFS doesn’t match with the thread that is trying to unlock. This leads to AFS file server and ptserver crash due to assertion failure where it was trying to unlock the grmutex. We saw the race more often when our customer migrated their machines from Power8 to Power9 systems and increased the SMT value from 2 to 4. fileserver Assertion failed! file keys.c, line 911. ptserver Assertion failed! file userok.c, line 78. File: keys.c 889 int 890 afsconf_GetKeyByTypes(struct afsconf_dir *dir, afsconf_keyType type, 891 int kvno, int subType,struct afsconf_typedKey **key) 892 { 893 int code = 0; 894 struct subTypeList *subTypeEntry; 895 896 LOCK_GLOBAL_MUTEX; 897 … 910 out: 911 UNLOCK_GLOBAL_MUTEX; <<<< 912 return code; 913 } Consider a following situation, cpu0 , cpu1 and T0, T1 and T2 are the cpus and timestamps respectively, T0: thread1 locks grmutex performs some operations and unlocks the same, thus has itself set as pthread_recursive_mutex_t::owner. Since presently we do not reset it, thus, pthread_recursive_mutex_t::owner = thread0. T1: thread0 starts on cpu0. T2: thread1 starts on cpu1. T3: thread0 tries to lock AFS grmutex and acquires corresponding pthread_mutex, now before thread0 updates pthread_recursive_mutex_t::owner, a context switch happens. T3: thread1 on cpu1 tries to acquire grmutex and sees itself as the pthread_recursive_mutex_t::owner, possibly as it was not reset and updated yet. So thread1 thinks itself as the owner and proceeds. T4: thread0 updates the pthread_recursive_mutex_t::owner this time it is also synced across the cpu caches. T5: thread1 tries to unlock the grmutex and crashes because now it’s not the owner of the mutex. Debugging: We implemented a circular log to store certain values related to grmutex which helped in debugging us this further. ({ \ time_t t; \ time(&t); \ LOG_EVENT("%s: Unlocking TID %u: %s:%d owner %lu " \ "locked %d pthread_self %u times_inside %d\n", \ ctime(&t), (unsigned)grmutex.mut.__data.__owner,\ __func__ , __LINE__, \ grmutex.owner, grmutex.locked, (unsigned)pthread_self(), \ grmutex.times_inside); \ opr_Verify(pthread_recursive_mutex_unlock(&grmutex)==0); \ }) $614 = "Mon Sep 11 19:35:34 2023\n: Locking TID 136896: afsconf_GetKeyByTypes:896 owner 140735030161776 locked 1 pthread_self 2305880432 times_inside 1\n\000 2\n", $615 = "Mon Sep 11 19:35:34 2023\n: Unlocking TID 136896: afsconf_IsLocalRealmMatch:602 owner 140735030161776 locked 1 pthread_self 1836773744 times_inside 2\n", $617 = "Mon Sep 11 19:35:34 2023\n: Unlocking TID 136896: afsconf_GetKeyByTypes:911 owner 140735030161776 locked 1 pthread_self 2305880432 times_inside 1\n\000\061\n", Solution: This problem was resolved after resetting thread_recursive_mutex_t::owner in global mutex unlock function. Thanks to Todd DeSantis for helping with debugging, review and verification of this problem. Change-Id: Ibe01518094388080a143e31c70ab7ce0ddfca702 Signed-off-by: Indira Sawant <indira.sawant@ibm.com> Reviewed-on: https://gerrit.openafs.org/15604 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Andrew Deason <adeason@sinenomine.net> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu> |
||
---|---|---|
build-tools | ||
doc | ||
src | ||
tests | ||
.gitignore | ||
.gitreview | ||
.mailmap | ||
.splintrc | ||
acinclude.m4 | ||
CODING | ||
configure-libafs.ac | ||
configure.ac | ||
CONTRIBUTING | ||
INSTALL | ||
libafsdep | ||
LICENSE | ||
Makefile-libafs.in | ||
Makefile.in | ||
NEWS | ||
NTMakefile | ||
README | ||
README-WINDOWS | ||
regen.sh |
AFS is a distributed file system that enables users to share and access all of the files stored in a network of computers as easily as they access the files stored on their local machines. The file system is called distributed for this exact reason: files can reside on many different machines, but are available to users on every machine. OpenAFS 1.0 was originally released by IBM under the terms of the IBM Public License 1.0 (IPL10). For details on IPL10 see the LICENSE file in this directory. The current OpenAFS distribution is licensed under a combination of the IPL10 and many other licenses as granted by the relevant copyright holders. The LICENSE file in this directory contains more details, thought it is not a comprehensive statement. See INSTALL for information about building and installing OpenAFS on various platforms. See CODING for developer information and guidelines. See NEWS for recent changes to OpenAFS.