Go to file
Indira Sawant e4fda3481d util: Clear owner when unlocking recursive mutex
A race condition where the pthread_recursive_mutex_t::owner that is maintained
by AFS doesn’t match with the thread that is trying to unlock.

This leads to AFS file server and ptserver crash due to assertion failure
where it was trying to unlock the grmutex.

We saw the race more often when our customer migrated their machines from
Power8 to Power9 systems and increased the SMT value from 2 to 4.

fileserver        Assertion failed! file keys.c, line 911.
ptserver          Assertion failed! file userok.c, line 78.

File: keys.c

 889 int
 890 afsconf_GetKeyByTypes(struct afsconf_dir *dir, afsconf_keyType type,
 891                       int kvno, int subType,struct afsconf_typedKey **key)
 892 {
 893     int code = 0;
 894     struct subTypeList *subTypeEntry;
 895
 896     LOCK_GLOBAL_MUTEX;
 897
…
 910 out:
 911     UNLOCK_GLOBAL_MUTEX;   <<<<
 912     return code;
 913 }

Consider a following situation,
cpu0 , cpu1 and T0, T1 and T2 are the cpus and timestamps respectively,

T0: thread1 locks grmutex performs some operations and unlocks the same,
thus has itself set as pthread_recursive_mutex_t::owner. Since presently we do
not reset it, thus, pthread_recursive_mutex_t::owner = thread0.
T1: thread0 starts on cpu0.
T2: thread1 starts on cpu1.
T3: thread0 tries to lock AFS grmutex and acquires corresponding pthread_mutex,
now before thread0 updates pthread_recursive_mutex_t::owner, a context switch
happens.
T3: thread1 on cpu1 tries to acquire grmutex and sees itself as the
pthread_recursive_mutex_t::owner, possibly as it was not reset and updated yet.
So thread1 thinks itself as the owner and proceeds.
T4: thread0 updates the pthread_recursive_mutex_t::owner this time it is also
synced across the cpu caches.
T5: thread1 tries to unlock the grmutex and crashes because now it’s not the
owner of the mutex.

Debugging:

We implemented a circular log to store certain values related to grmutex which
helped in debugging us this further.

({  \
   time_t t; \
   time(&t); \
   LOG_EVENT("%s: Unlocking TID %u: %s:%d owner %lu " \
	     "locked %d  pthread_self %u times_inside %d\n", \
              ctime(&t), (unsigned)grmutex.mut.__data.__owner,\
	      __func__    , __LINE__, \
              grmutex.owner, grmutex.locked, (unsigned)pthread_self(), \
 	      grmutex.times_inside); \
   opr_Verify(pthread_recursive_mutex_unlock(&grmutex)==0); \
})

$614 =   "Mon Sep 11 19:35:34 2023\n: Locking TID 136896:
afsconf_GetKeyByTypes:896 owner 140735030161776 locked 1
pthread_self 2305880432 times_inside 1\n\000 2\n",

$615 =   "Mon Sep 11 19:35:34 2023\n: Unlocking TID 136896:
afsconf_IsLocalRealmMatch:602 owner 140735030161776 locked 1
pthread_self 1836773744 times_inside 2\n",

$617 =   "Mon Sep 11 19:35:34 2023\n: Unlocking TID 136896:
afsconf_GetKeyByTypes:911 owner 140735030161776 locked 1
pthread_self 2305880432 times_inside 1\n\000\061\n",

Solution:

This problem was resolved after resetting thread_recursive_mutex_t::owner in
global mutex unlock function.

Thanks to Todd DeSantis for helping with debugging, review and verification of
this problem.

Change-Id: Ibe01518094388080a143e31c70ab7ce0ddfca702
Signed-off-by: Indira Sawant <indira.sawant@ibm.com>
Reviewed-on: https://gerrit.openafs.org/15604
Tested-by: BuildBot <buildbot@rampaginggeek.com>
Reviewed-by: Andrew Deason <adeason@sinenomine.net>
Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
2023-12-28 18:05:40 -05:00
build-tools build: package ltmain.sh in the libafs_tree 2022-07-21 10:36:03 -04:00
doc doc: Fix the AFS::ukernel man page title 2023-07-12 23:48:30 -04:00
src util: Clear owner when unlocking recursive mutex 2023-12-28 18:05:40 -05:00
tests util: Avoid bad ascii[1] in volutil_GetPartitionID 2023-09-17 00:46:03 -04:00
.gitignore Remove alpha_dux/alpha_osf references 2018-09-22 17:05:26 -04:00
.gitreview Add .gitreview 2018-02-04 15:34:55 -05:00
.mailmap git: add a mailmap file 2016-09-25 21:05:23 -04:00
.splintrc
acinclude.m4 dir: Introduce struct DirEntryFlex 2023-11-09 12:24:58 -05:00
CODING gcc: Avoid false positive use-after-free in crypto 2023-07-05 16:40:35 -04:00
configure-libafs.ac autoconf: Remove/update obsolete autoconf macros 2021-12-02 11:57:47 -05:00
configure.ac autoconf: Remove/update obsolete autoconf macros 2021-12-02 11:57:47 -05:00
CONTRIBUTING Correct our contributor's code of conduct 2020-09-04 10:01:28 -04:00
INSTALL configure: Add platform rs_aix73 2023-03-01 23:08:39 -05:00
libafsdep Move build support files into build-tools 2010-07-14 20:40:36 -07:00
LICENSE cf: Make local copy of ax_gcc_func_attribute.m4 2020-07-24 08:35:59 -04:00
Makefile-libafs.in Fix libafs_tree's cross-architecture support 2010-05-24 20:28:41 -07:00
Makefile.in build: clean up some more generated files 2023-12-23 14:59:17 -05:00
NEWS Update NEWS for OpenAFS 1.9.1 2021-03-18 21:48:27 -04:00
NTMakefile Remove rpctestlib 2021-06-10 12:59:53 -04:00
README Tweak grammar in README 2015-12-28 19:32:17 -05:00
README-WINDOWS Update windows build documentation 2013-07-02 15:14:09 -07:00
regen.sh Use autoconf-archive m4 from src/external 2020-05-08 11:30:36 -04:00

AFS is a distributed file system that enables users to share and
access all of the files stored in a network of computers as easily as
they access the files stored on their local machines. The file system is
called distributed for this exact reason: files can reside on many
different machines, but are available to users on every machine.

OpenAFS 1.0 was originally released by IBM under the terms of the
IBM Public License 1.0 (IPL10).  For details on IPL10 see the LICENSE
file in this directory.  The current OpenAFS distribution is licensed
under a combination of the IPL10 and many other licenses as granted by
the relevant copyright holders.  The LICENSE file in this directory
contains more details, thought it is not a comprehensive statement.

See INSTALL for information about building and installing OpenAFS
on various platforms.

See CODING for developer information and guidelines.

See NEWS for recent changes to OpenAFS.