Indira Sawant 1f63ffef47 util: Clear owner when unlocking recursive mutex
A race condition where the pthread_recursive_mutex_t::owner that is maintained
by AFS doesn’t match with the thread that is trying to unlock.

This leads to AFS file server and ptserver crash due to assertion failure
where it was trying to unlock the grmutex.

We saw the race more often when our customer migrated their machines from
Power8 to Power9 systems and increased the SMT value from 2 to 4.

fileserver        Assertion failed! file keys.c, line 911.
ptserver          Assertion failed! file userok.c, line 78.

File: keys.c

 889 int
 890 afsconf_GetKeyByTypes(struct afsconf_dir *dir, afsconf_keyType type,
 891                       int kvno, int subType,struct afsconf_typedKey **key)
 892 {
 893     int code = 0;
 894     struct subTypeList *subTypeEntry;
 895
 896     LOCK_GLOBAL_MUTEX;
 897
…
 910 out:
 911     UNLOCK_GLOBAL_MUTEX;   <<<<
 912     return code;
 913 }

Consider a following situation,
cpu0 , cpu1 and T0, T1 and T2 are the cpus and timestamps respectively,

T0: thread1 locks grmutex performs some operations and unlocks the same,
thus has itself set as pthread_recursive_mutex_t::owner. Since presently we do
not reset it, thus, pthread_recursive_mutex_t::owner = thread0.
T1: thread0 starts on cpu0.
T2: thread1 starts on cpu1.
T3: thread0 tries to lock AFS grmutex and acquires corresponding pthread_mutex,
now before thread0 updates pthread_recursive_mutex_t::owner, a context switch
happens.
T3: thread1 on cpu1 tries to acquire grmutex and sees itself as the
pthread_recursive_mutex_t::owner, possibly as it was not reset and updated yet.
So thread1 thinks itself as the owner and proceeds.
T4: thread0 updates the pthread_recursive_mutex_t::owner this time it is also
synced across the cpu caches.
T5: thread1 tries to unlock the grmutex and crashes because now it’s not the
owner of the mutex.

Debugging:

We implemented a circular log to store certain values related to grmutex which
helped in debugging us this further.

({  \
   time_t t; \
   time(&t); \
   LOG_EVENT("%s: Unlocking TID %u: %s:%d owner %lu " \
	     "locked %d  pthread_self %u times_inside %d\n", \
              ctime(&t), (unsigned)grmutex.mut.__data.__owner,\
	      __func__    , __LINE__, \
              grmutex.owner, grmutex.locked, (unsigned)pthread_self(), \
 	      grmutex.times_inside); \
   opr_Verify(pthread_recursive_mutex_unlock(&grmutex)==0); \
})

$614 =   "Mon Sep 11 19:35:34 2023\n: Locking TID 136896:
afsconf_GetKeyByTypes:896 owner 140735030161776 locked 1
pthread_self 2305880432 times_inside 1\n\000 2\n",

$615 =   "Mon Sep 11 19:35:34 2023\n: Unlocking TID 136896:
afsconf_IsLocalRealmMatch:602 owner 140735030161776 locked 1
pthread_self 1836773744 times_inside 2\n",

$617 =   "Mon Sep 11 19:35:34 2023\n: Unlocking TID 136896:
afsconf_GetKeyByTypes:911 owner 140735030161776 locked 1
pthread_self 2305880432 times_inside 1\n\000\061\n",

Solution:

This problem was resolved after resetting thread_recursive_mutex_t::owner in
global mutex unlock function.

Thanks to Todd DeSantis for helping with debugging, review and verification of
this problem.

Signed-off-by: Indira Sawant <indira.sawant@ibm.com>
Reviewed-on: https://gerrit.openafs.org/15604
Tested-by: BuildBot <buildbot@rampaginggeek.com>
Reviewed-by: Andrew Deason <adeason@sinenomine.net>
Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
(cherry picked from commit e4fda3481dc9ec651377493afbc95bd40f4f1fb2)

Change-Id: I400892121d1b1f63adcd6848e774ede1c4ec5da9
Reviewed-on: https://gerrit.openafs.org/15609
Tested-by: BuildBot <buildbot@rampaginggeek.com>
Reviewed-by: Mark Vitale <mvitale@sinenomine.net>
Reviewed-by: Andrew Deason <adeason@sinenomine.net>
Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
2024-01-07 16:36:37 -05:00
2023-08-17 13:13:55 -04:00
2018-02-09 21:48:12 -05:00
2016-09-25 21:05:23 -04:00
2003-05-28 19:18:08 +00:00
2023-07-06 10:43:20 -04:00
2023-07-06 10:43:20 -04:00
2023-04-13 16:58:38 -04:00
2023-08-17 13:23:40 -04:00
2023-07-06 10:43:20 -04:00
2020-01-25 15:53:31 -05:00
2015-12-28 19:32:17 -05:00

AFS is a distributed file system that enables users to share and
access all of the files stored in a network of computers as easily as
they access the files stored on their local machines. The file system is
called distributed for this exact reason: files can reside on many
different machines, but are available to users on every machine.

OpenAFS 1.0 was originally released by IBM under the terms of the
IBM Public License 1.0 (IPL10).  For details on IPL10 see the LICENSE
file in this directory.  The current OpenAFS distribution is licensed
under a combination of the IPL10 and many other licenses as granted by
the relevant copyright holders.  The LICENSE file in this directory
contains more details, thought it is not a comprehensive statement.

See INSTALL for information about building and installing OpenAFS
on various platforms.

See CODING for developer information and guidelines.

See NEWS for recent changes to OpenAFS.

Description
No description provided
Readme Multiple Licenses 164 MiB
Languages
C 72.2%
C++ 20.1%
Makefile 1.4%
Perl 1.2%
Rich Text Format 1%
Other 3.7%