This file contains a threading analysis of Ubik prepared by Jeffrey Hutzelman and sent ot the openafs-devel@openafs.org mailing list in February 2011, archived at: https://lists.openafs.org/pipermail/openafs-devel/2011-February/018329.html INTRODUCTION This document describes the structure of Ubik, with an eye toward thread-safety and concurrency. It begins by discussing the elements, usage, and locking considerations of major shared data structures. This is followed by a discussion of each major subsystem. The emphasis is on deficiencies in thread-safety and the changes needed to correct them. Most of this document focuses on the user-mode Ubik server code, which comprises the bulk of the Ubik library code and also of the complexity. A separate section near the end is dedicated to Ubik client code, which itself is intended primarily for use in user-mode applications. The OpenAFS cache manager includes its own mechanisms for accessing replicated services, including the Ubik-managed VLDB, and for tracking the servers that provide them. Occasionally, issues related to concurrency, performance, and modularity (particularly as regards support for supporting multiple independent databases in a single server process) are also discussed; however, these discussions are peripheral and are included only to record data obtained while examining the question of thread-safety and as a basis for future work. Similarly, throughout this document are detailed discussions of some parts of the code and certain data structures which, while not always directly relevant to thread-safety, were produced in the course of analyzing the Ubik library code. These are included to provide additional background and as a basis for future work in documentation of Ubik library internals and interfaces. MAJOR DATA STRUCTURES This section calls out the major shared data structures used in the Ubik server code and details what they are used for and how shared access to them is protected. Much of this text is probably suitable as a starting point for expanding on the documentation found in comments in ubik.p.h. struct ubik_dbase - database This structure contains nearly everything Ubik knows about an open database (for an explanation of "nearly", see the section below on globals), including database-wide parameters and state and a list of active transactions. The following structure elements are set only when the database structure is being initialized, after which they are never modified: + pathName - The base path used to construct database filenames + physical later method pointers (read, write, truncate, sync, stat, open, setlabel, getlabel, getnfiles), used to manipulate the physical database files on disk. These currently always point to the methods defined in phys.c, and in fact other modules contain knowledge of the internal implementation of that module. The primary synchronization mechanism used to protect this structure is the database lock (versionLock), which is manipulated via the DBHOLD() and DBRELE() macros. This lock protects the following structure elements, as well as a number of globals as described elsewhere in this document: + activeTrans - list of active transactions + version - database version (*) + tidCounter - transaction ID of latest transaction (*) + writeTidCounter - transaction ID of latest write transaction (*) + flags (*) + readers (*) These fields are currently protected by the database lock, except that the beacon thread does not hold that lock when examining them, or the global 'ubik_epochTime', which holds the corresponding epoch. The recommendation is for these fields, along with ubik_epochTime, to be additionally protected by a second lock, allowing the beacon thread to examine them without holding the database lock; see the section on the BEACON package for details. The condition variable 'version_cond' is used to signal to that the database version may have changed; it is broadcast in udisk_commit(), in SDISK_SendFile(), and from the recovery thread; it is monitored by ubik_WaitVersion(), which can be called by an application to wait for a database version change (this is not currently used in OpenAFS). This CV is associated with the database lock. When LWP is used, this condition is signalled on &ubik_dbase->version. The condition variable 'flags_cond' is used by udisk_end() to signal that DBWRITING flag has been cleared. This wakes threads waiting in ubik.c:BeginTrans() to begin a new transaction. This CV is associated with the database lock. When LWP is used, this condition is signalled on &ubik_dbase->flags. When LWP is used, signals are also sometimes sent on &urecovery_state and on &ubik_amSyncSite. However, nothing ever waits on these, and no corresponding condition variables exist when pthreads is used. The 'cachedVersion' field holds the database version represented by the application's cache; this is compared against the current database version to discover if the cache is up to date. Both cachedVersion and the application cache itself are protected by cache_lock, an AFS reader/writer lock. When the application wishes to use the cache, it calls ubik_CacheUpdate(), which verifies that the cache is current, updating it if necessary (under a write lock), and then returns with the application cache read-locked. The application can then rely on the cache's contents not to change until it ends the transaction by calling ubik_EndTrans() or ubik_AbortTrans(). When one of these functions is called, the read lock is dropped, and a write lock may be acquired temporarily to update both cachedVersion and the cache (in the case of committing a write transaction) or to clear cachedVersion, indicating the cache is invalid (in the case of aborting a transaction). In the case of ubik_EndTrans() on a write transaction, the cache is held write-locked until udisk_commit() has completed; this is also done in SDISK_Commit(), which insures that the contents of the on-disk database and page cache cannot change while an active transaction holds the cache lock. Note that both the recovery thread and SDISK_SendFile() may replace the database contents wholesale; however, these operations first abort all outstanding transactions. Prior to the introduction of ubik_BeginTransReadAnyWrite(), an application could choose to manage its own cache, updating it during write transactions and when, at the beginning of a read transaction, it was discovered to be out of date (due to a write done by another server). Under this model, the application uses Ubik transaction locks to protect not only the database but also its cache. However, the use of ubik_BeginTransReadAnyWrite() allows some transactions to proceed and read the database without acquiring any locks, which makes it unsafe to depend on Ubik transaction locks to protect the application cache. Applications which make use of this mode must set ubik_SyncWriterCacheProc to point to a procedure to be called to update the cache during ubik_EndTrans(), after udisk_commit() has succeeded and before the cache_lock is released. These applications must then refrain from ever updating the cache other than during that procedure or during the callback passed to ubik_CheckCache(). struct ubik_trans - active transaction This structure represents the state of an active transaction, including transaction metadata and state and a list of all changes made as part of the transaction. Transactions are organized into a linked list referenced by the corresponding database structure, protected by the database lock. Each transaction contains an upward pointer to the database, a transaction type, and a transaction ID, all of which are immutable for as long as the transaction exists. A transaction created via the REMOTE interface exists until ended by a call to SDISK_Abort() or SDISK_ReleaseLocks() or aborted in urecovery_CheckTid() due to a transaction ID mismatch or creation of a new transaction (note that in this last case, cleanup will be deferred if SDISK_Lock() is blocked on this transaction). In these cases, ubik_currentTrans is cleared at the same time, under the database lock, and further REMOTE calls will fail gracefully until a new transaction is started. Thus, REMOTE transactions may be considered valid only for as long as the database lock is held. Transactions created via ubik_BeginTrans(), ubik_BeginTransReadAny(), or ubik_BeginTransReadAnyWrite() exist until ended by a call to ubik_AbortTrans() or ubik_EndTrans(). They are never destroyed from within the UBIK package. Code which discovers a transaction by traversing the database's active transaction list may access the transaction only as long as the database lock is held, since once it is released, an application thread may end the transaction. Each transaction contains the following fields, which are protected by the database lock: + locktype - type of lock held by this transaction + minCommitTime - unused + seekFile - file number of database file pointer + seekPos - position of database file pointer + flags File writes are represented by 'iovec_info', an XDR vector structure (type iovec_wrt, a vector of struct ubik_iovec) containing the file, position, and length of each write, and 'iovec_data', an XDR opaque containing the corresponding data. These vectors and their contents are protected by the database lock. File truncations are represented by a linked list of struct ubik_trunc, each of which contains a file number and the new size of that file. This list, and the records contained in it, are protected by the database lock. While transactions are currently protected by the database lock, it may be desirable in the future to introduce a per-transaction lock, in order to facilitate increased concurrency. struct ubik_server - server status This structure is used to track the set of peer servers, including state related to elections and recovery. The global 'ubik_servers' points to a linked list of these structures, which is constructed during initialization. Once the list has been constructed, no entries are ever added or removed; in addition, the following fields are not changed after startup: + addr[0] - server's primary address + magic - true if this is the magic server + isClone - true if this server is a clone The following fields are used by the BEACON package to track beacons to this server and their results. Currently, they are sometimes accessed under the database lock, and sometimes without benefit of any lock. They should probably be protected by the same lock as globals private to the BEACON package. + lastVoteTime - last time this server voted yes for us + lastBeaconSent - last time a beacon was sent to this server + lastVote - true if its late vote response was yes + up - true if this server is believed to be up + beaconSinceDown - true if a beacon has successfully been sent to this server since the last time it was down The following fields are used by the RECOVERY package to track the state of each server. They are also examined and modified by the various ContactQuorum_* functions. Currently, they are sometimes accessed under the database lock, and sometimes without benefit of any lock. They should be fully protected by the database lock. + version - server's database version + currentDB - true if server's database is up-to-date The addr[] array is used to track all known addresses for the remote server; this list is used when attempting to contact a down server via an alternate address, and to identify the source of incoming server-to-server RPCs. The first element of this array is always the primary address and does not change after initialization; other elements are updated during initialization and when a remote server calls DISK_UpdateInterfaceAddr(). The system normally maintains an active Rx connection to the VOTE and DISK services on each server; pointers to these are kept in the 'vote_rxcid' and 'disk_rxcid' fields. Currently, the addr[] array, 'vote_rxcid', and 'disk_rxcid' are not protected by any lock. Because these are shared among a number of subsystems and it is usually desirable to avoid blocking on the database lock when making RPCs, it is recommended to introduce a new lock (the "server address lock") to protect these fields and the globals 'ubikSecClass' and 'ubikSecIndex', which contain the security classes used for setting up such connections. The new lock would need to be held in the following cases: + Before beginning any RPC using 'vote_rxcid' or 'disk_rxcid', for the time it takes to call rx_GetConnection() or rx_NewCall() + For the duration of ubikGetPrimaryInterfaceAddr() + In ubeacon_Interact(), when building the set of connections to be used for multi_VOTE_Beacon(). + In updateUbikNetworkAddress(), first while constructing the set of connections to be used for multi_DISK_UpdateInterfaceAddr(), and then again each time a server's list of addresses is updated. However, the lock must NOT be held while waiting for the RPC to complete, to avoid deadlock when multiple servers are starting at the same time. + For the duration of SDISK_UpdateInterfaceAddr() + For the duration of ubeacon_ReinitServer(), with the optional exception of the call to afsconf_UpToDate(). + In src/ubik/recovery.c:DoProbe(), first while constructing the set of connections to be used for multi_DISK_Probe() and, if the probe is successful, again while destroying the old connections for the server under test and replacing them with new ones. See the section on the RECOVERY package for additional notes. GLOBAL VARIABLES The library has an unfortunate number of global variables which are not part of any data structure. Nonetheless, some of these are database-specific; see MULTI-DATABASE SUPPORT, below. Globals private to a particular subsystem are discussed in the description of that subsystem. This section discusses global variables which are shared and/or part of external interfaces. The following globals are used to convey configuration parameters from the application to the Ubik library, and are intended to be set before the first call to ubik_ServerInit(). They are not modified by Ubik, and are not protected by locks: + ubik_debugFlag - print debug messages (unused) + ubikPrimaryAddrOnly - use only primary interface + ubik_nBuffers - number of DISK package page cache buffers + ubik_SRXSecurityProc - callback to create server security class + ubik_SRXSecurityRock - argument for ubik_SRXSecurityProc + ubik_CRXSecurityProc - callback to create client security class + ubik_CRXSecurityRock - argument for ubik_CRXSecurityProc + ubik_SyncWriterCacheProc - callback to update application cache + ubik_CheckRXSecurityProc - callback to check if caller is OK + ubik_CheckRXSecurityRock - argument for ubik_CheckRXSecurityProc The following globals contain values which are computed during Ubik initialization and not modified thereafter. They are not protected by any lock, which is safe so long as suitable memory barriers exist across creation of threads and Rx services (see the notes on initialization in the UBIK package section). Note that some of these values are database-specific and properly belong in the ubik_dbase structure: + ubik_servers - linked list of servers (see above) + amIClone - true if this server is a clone + ubik_callPortal - UDP port used for server-to-server RPCs + ubik_quorum - number of servers to make a quorum + ubik_dbase - pointer to the (only) database (see above) + ubik_host[] - array of local addresses + ubik_sc[] - security classes for Ubik RPC services The following globals are currently protected by the database lock, though in most cases there are existing references which do not hold the lock. Later sections of this document contain recommendations in each case to establish new locks to protect these or to correct cases where they are accessed without the correct lock. + ubik_currentTrans - pointer to REMOTE's current transaction + ubik_epochTime - epoch for transaction IDs + urecovery_state - recovery state bits + ubik_dbVersion - sync site's database version + ubik_dbTid - sync site's transaction ID + ubikSecIndex - security class index for making Ubik RPCs + ubikSecClass - security class for making Ubik RPCs The global 'ubik_amSyncSite' belongs to the BEACON package, and should be protected by the same lock as that package's private globals. In fact, it should probably be made private; the only references outside BEACON are in SVOTE_Debug() and SVOTE_DebugOld() and can easily be moved into ubeacon_Debug(). MAJOR SUBSYSTEMS This section describes each of the major Ubik server subsystems. For each subsystem, a brief overview is provided, along with a description of that subsystem's handling of synchronization (or lack thereof). Subsystems are described beginning with the lowest layer and working toward the highest. PHYS - Physical I/O This package is responsible for handling I/O to database files. It maintains a private, global array of MAXFDCACHE(4) fdcache structures, each representing an open file descriptor on a database file, either presently in use or idle and available for use. Entries are indexed by database file ID, and each contain a file descriptor number and a refcount. There is also a global 'inited' flag, and a buffer used to construct pathnames when opening database files. The static function uphys_open() takes a database pointer and fileid and attempts to open the corresponding file. On success, it returns a file descriptor; on failure, it returns the same as open(2) (with errno set). This prefers to reuse an already-open fd whose cache refcount is 0; failing that, it opens a new fd and attempts to enter it in the cache. It prefers first to use an unused cache entry, then to reclaim an idle one whose refcount is 0; if neither is possible, the newly-opened fd is returned without caching. Note that the refcount on an active entry is always exactly 1; the same entry cannot be referenced more than once. The function uphys_close() releases a file descriptor opened by uphys_open(). If the file descriptor was cached, the cache entry is dereferenced but the file is left open for future use; 0 is returned. If the file descriptor is not found in the cache, then the return value is that of an immediate call to close(2). This function also handles cleanup when a cached fd is closed after the file to which it refers is invalidated by uphys_invalidate(). This function is not static, but may as well be -- it is used only by other functions in the PHYS package. Functions in this package are called via method pointers in the database structure, which are set during initialization and not changed thereafter. These functions must be called with the database lock held; this protects the file descriptor cache and other globals mentioned above, and prevents conflicting access to database files. The requirement that functions in this package be called with the database lock held means that concurrent access to database files is not possible. This situation could be improved by the addition of a new global lock, which would be held for most of the duration of uphys_open(), uphys_close(), and uphys_invalidate(), but need _not_ be held during actual file accesses. This would eliminate the requirement that calls to this package be made with the database lock held, allowing multiple calls to be active concurrently. However, it would still be necessary for higher layers to employ appropriate synchronization as needed to maintain database consistency. DISK - Logical Database I/O and Transaction Management This package is responsible for logical database file I/O, logging, and transaction management. It maintains a private cache of database file page buffers. This entire cache, including buffer space, is allocated and initialized the first time a transaction is started; the global initd flag keeps track of whether or not this has been done. The size of this cache defaults to 20 pages, but may be overridden by setting the global 'ubik_nBuffers' (all OpenAFS Ubik services do this); this global is not protected by any lock, but is read exactly once, when the cache is initialized. Except for udisk_Debug() (see below), all udisk_* interfaces require a pointer either to the database or to an active transaction, and must be called with the database lock held. This protects the DISK package's accesses of the active transaction list and associated state variables as well as its calls into the physical access layer, including insuring that operations requiring multiple calls into that layer are performed atomically. In addition, the database lock protects the page cache hash table and LRU queue, the buffer control structures, buffer contents, the truncation record free list (freeTruncList), and global statistics. However, it may be desirable in the future to introduce a separate lock to protect these structures, in order to facilitate multiple database support. udisk_LogOpcode, udisk_LogEnd, udisk_LogTruncate, udisk_LogWriteData These are internal interfaces for writing to the transaction log. They each construct a log record and then directly call the physical layer to append it to the log; the log file is _not_ cached by the page buffer cache, and these functions do not touch the cache. Each of these must be atomic with respect to other writes to the log file, which is accomplished by insuring they are called only with the database locked. DInit, DRead, DTrunc, DRelease, DFlush, DAbort, DSync, DNew These are internal interfaces for accessing the page buffer cache. They must be called with the database lock held. DRead() and DNew() return a pointer to the page data buffer, not to the buffer control structure; this pointer is later passed to DRelease() to release the buffer. This mechanism depends on the fact that buffer control structures are allocated in the same order as the actual page data buffers. DRead() guarantees that a read transaction will always see a clean copy of the requested page, without any modifications made by an in-progress write transaction. However, this is true only if the next layer insures that once DRead() is called with write intent on a page, that no further calls to DRead() are made for that page until the buffer is written and DRelease() is called with the dirty flag. Presently, this is the case because udisk_read() and udisk_write(), which are the only callers of DRead(), are both called with the database lock held. udisk_read, udisk_truncate, udisk_write, udisk_begin, udisk_commit, udisk_abort, udisk_end, udisk_Invalidate These are external interfaces provided by the DISK package to higher layers. Presently, the locking requirements of these functions and the lower-layer functions they call are satisfied by the simple rule that these must always be called with the database lock. In the future, support for multiple databases and/or a greater level of concurrency may require relaxing this rule and/or acquiring additional locks within these functions. In udisk_commit(), when committing the first write transaction after becoming sync site, the database is relabelled. In this case, the UBIK_RECLABELDB recovery state bit should be set before propagating the label change to other servers, rather than after. This is because because ContactQuorum_DISK_Setversion() will release the database lock, at which point the recovery state may be cleared by urecovery_ResetState() running in another thread. It is important that a relabelling which occurs before recovery state is cleared not result in the UBIK_RECLABELDB recovery state bit being set after; otherwise, the server may fail to correctly relabel the database the next time it becomes sync site. udisk_Debug The udisk_Debug() interface is called as part of preparing the response to Ubik debugging requests. Unlike other udisk_* APIs, this interface is not called on a specific database and does not require the database lock to be held. It reports the version data of the single database identified by the global 'ubik_dbase' and traverses the buffer cache collecting statistics, both without benefit of holding any locks. This may produce inconsistent data but does not normally risk damaging data structures or accessing invalid or uninitialized memory (but see below), and avoiding locking may be of greater benefit than guaranteeing the return of consistent data. The global 'ubik_dbase' pointer is set when the first (only) database is initialized, and the buffer cache data structures are allocated the first time a transaction is started, either locally or via the REMOTE package. Once set, the pointers used by udisk_Debug() are never changed, and the data structures they point to remain valid and are never freed. The buffer cache is traversed by iterating over the array of buffer cache elements (this is done in other places as well), not by following a linked list or other data structure that may reordered, and so should not be dangerous. At a minimum, then, it should be arranged that udisk_Debug() cannot be called before the database structure and page cache have been initialized. This means that both of these actions must be taken before any incoming RPCs are accepted. Practically, this means that DInit() should be renamed udisk_init(), and called from ubik_ServerInitCommon() during initialization. Full correctness demands that udisk_Debug actually hold the database lock while copying the database version and examining the page cache. However, it may be desirable to sacrifice this correctness in order to reduce interaction between the debugging interfaces and the rest of the system. Helper functions: + unthread - remove transaction from activeTrans + Dlru, Dmru - move page to front/end of LRUQ + MatchBuffer - check if a buffer's dbase/file/page match + FixupBucket - move page to the right bucket + newslot - find and allocate an unused buffer + DedupBuffer - throw away stale copies after page is flushed + GetTrunc, PutTrunc - trunc record allocation + FindTrunc - find active trunc record for file + DoTruncs - apply truncations LOCK - Database Transaction Locking This package is responsible for managing database locks held as part of a transaction. It supports a single read/write lock, which is represented by the global variable 'rwlock', a struct Lock. The global 'rwlockinit' records whether Lock_Init() has been called on this lock, and is protected by the database lock. Properly, this lock should be treated as belonging to the database. ulock_getLock() expects to be called with the database lock held; it depends on this to protect access to fields within the transaction structure as well as to the global 'rwlockinit'. However, note that the database lock must be temporarily released while obtaining the read/write lock, and so ulock_getLock() should not be called in the middle of a critical section. The 'await' parameter may in theory be set to 0 to effect a non-blocking lock; however, this depends on (incorrect) knowledge of the internals of struct Lock. Fortunately, existing Ubik code always sets await=1. This feature should be fixed or removed. ulock_relLock() releases the read/write lock. It expects to be called with the database lock held. Unlock ulock_getLock(), this function does not release the database lock. ulock_Debug() reports the state of the read/write lock, for debugging purposes. It depends on knowledge of the internals of struct Lock, and accesses both the Lock structure and the global rwlockinit without benefit of holding any lock. It would probably be best to eliminate 'rwlockinit' in favor of always initializing the global 'rwlock' during startup; ulock_Debug() would then have no need to examine the flag. Further, ulock_Debug() should at a minimum use the appropriate locks when looking inside struct Lock; ideally, knowledge of that structure's internals should be abstracted into appropriate interfaces in src/lwp/lock.[ch]. VOTE - Voting Logic and VOTE_* RPCs This package is responsible for keeping track of who is currently sync site and deciding how to vote and whether this server should try to become sync site. It provides the VOTE RPC package, which is primarily responsible for processing vote requests from peer servers (VOTE_Beacon), but also provides debugging interfaces and an interface (VOTE_GetSyncSite) to allow clients to discover the current sync site. Voting decisions are made using state tracked in a set of global variables, listed below. Currently there is no locking protecting these variables; LWP-based Ubik depends on the cooperative nature of LWP and on the fact that routines which examine and manipulate them do not block. Introduction of a new "vote lock" is recommended to protect the following globals: + ubik_lastYesTime - last time VOTE_Beacon voted "yes" + lastYesHost - host voted for at ubik_lastYesTime + lastYesClaim - time lastYesHost started its voting cycle + lastYesState - whether lastYesHost claimed to be sync site + lowestHost - lowest host seen + lowestTime - last time lowestHost was updated/heard from + syncHost - sync host, or 0 if none known + syncTime - last time syncHost was updated + ubik_dbVersion - sync site's database version + ubik_dbTid - sync site's transaction ID The vote lock should be initialized in uvote_Init() and held for the remainder of that function and also during of uvote_ShouldIRun(), uvote_GetSyncSite(). It should also be held for the bulk of SVOTE_Beacon() -- specifically, it should be acquired just before beginning the lowest-host calculation, and released just before calling urecovery_CheckTid(), in the event of a yes vote or else just before returning. In addition, the vote lock would protect the globals 'ubik_dbVersion' and 'ubik_dbTid', which represent this server's knowledge of the state of the database on the sync site. This variable is set in two places in the REMOTE package, which change the local and sync site database versions at the same time. It is also examined in one place in the REMOTE package, and compared to the database version in the RECOVERY package. It is imperative that when the local and remote versions are changed at the same time, these changes become visible to the RECOVERY package at the same time. Thus, both the changes and the comparison must be done under both locks. For this purpose, three new functions should be provided by the VOTE package: + A function to safely set 'ubik_dbVersion' + A function to compare 'ubik_dbVersion' to another value + A function to check whether this is sync site and compare 'ubik_dbVersion' to another value, without releasing the vote lock in between (to be used in recovery). Existing accesses to 'ubik_dbVersion' in REMOTE and RECOVERY, all of which are already made under the database lock, would be replaced with calls to the new accessors. The result of these changes will require in several places that the vote lock be acquired while the database lock is already held. Therefore, the vote lock MUST NOT be held while acquiring the database lock. SVOTE_Beacon() currently calls urecovery_CheckTid() without benefit of the database lock, which must be held when calling that function. This is fairly simple to fix; however, the vote lock must be released first. One simple approach would be to move the call to urecovery_CheckTid() to after the point where the vote lock will be released, wrapping it in DBHOLD/DBRELE, and conditionalizing the whole thing on 'vote' being nonzero. Separately, it may be worth examining whether it is appropriate to call urecovery_CheckTid() only when voting yes, and not whenever a beacon is received from a server claiming to be sync site. As with the debug functions in other packages, SVOTE_Debug() and SVOTE_DebugOld() access global variables belong to this and other modules without benefit of the appropriate locks. There is also very little abstraction here. It is probably desirable to introduce Debug interfaces for the VOTE and RECOVERY packages, to be called from these functions. Full correctness requires... + Holding the vote lock while accessing the VOTE package globals described above. + Holding the database lock while accessing the database flags and tidCounter, ubik_epochTime, ubik_currentTrans, and urecovery_state BEACON - elections, tracking whether this is sync site This package is responsible for running elections, collecting votes, and deciding whether and for how long this is the sync site. It also contains routines responsible for setting up the list of servers and acquiring alternate addresses for other servers during startup. For this purpose, it maintains a number of global variables and ubik_server structure fields. Currently, these variables and fields are sometimes accessed under the database lock, and sometimes without benefit of any lock. In order to allow this package to operate with a minimum of disruption due to database accesses, it is recommended that a new lock be established to protect these. It is particularly desirable to prevent loss of sync due to the potentially long-lived RPCs the RECOVERY package must make under the database lock when transferring full database copies between servers; for this reason, it is important that both the BEACON thread and SVOTE_Beacon() be able to operate without acquiring the database lock. The global variables and ubik_server structure fields to be protected by the new lock are the following: + ubik_amSyncSite - true if this is the sync site + syncSiteUntil - time until which this is the sync site + ts->lastVoteTime - last time this server voted yes for us + ts->lastBeaconSent - last time a beacon was sent to this server + ts->lastVote - true if its late vote response was yes + ts->up - true if this server is believed to be up + ts->beaconSinceDown - true if a beacon has successfully been sent to this server since the last time it was down The new lock would need to be held in the following cases: + In ubeacon_AmSyncSite(), from the first time 'ubik_amSyncSite' is examined until after the potential call to urecovery_ResetState(). + In ubeacon_Interact(), when building the list of servers to which to send beacons (to protect access to the 'up' flag). Note that the server address lock must also be held during this loop, and must be acquired _after_ the beacon lock. + In ubeacon_Interact(), when updating status associated with a server after receiving its beacon response. + In ubeacon_Interact(), when updating globals after an election. However, it must be released before calling urecovery_ResetState(). + In updateUbikNetworkAddress(), when marking a server down. + In the various ContactQuorum_* functions, when checking whether a server is up before calling it, and again when marking a server down after a call fails. + In ubik_EndTrans, when examining a server's 'beaconSinceDown' flag and 'lastBeaconSent' timestamp to determine whether it has recently gone down, requiring a brief hold until it either comes back up or times out. The following global variables are set during initialization and do not change thereafter: + nServers - number of voting servers + amIMagic - true if this server gets the extra half vote + ubik_singleServer - true if this is the only server Currently, all initialization of this module is done _after_ the Rx services have been created and an Rx server thread started. This is done to avoid a situation in which all servers are blocked in updateUbikNetworkAddress(), which is responsible for exchanging addresses with peer servers, while none of them is prepared to handled the DISK_UpdateInterfaceAddr call made by that function. When using LWP, this early initialization of Rx is not a problem, because the various initialization routines don't actually block until updateUbikNetworkAddress() waits for RPCs to complete. However, when using pthreads, it is a problem, because it means that incoming RPCs may be processed when initialization is not complete. To address this problem and insure orderly initialization, it is recommended to change the initialization sequence so that most work, including the bulk of BEACON package initialization, happens before Ubik's Rx services have been created. The only work that should be deferred until after that point is updateUbikNetworkAddress(), followed by starting the long-running threads belonging to the BEACON and RECOVERY packages. This requires updateUbikNetworkAddress() to be exported (and perhaps to undergo a name change). The major part of the work of this package is done in the function ubeacon_Interact(), which is the body of a long-running thread known as the beacon thread. This thread is responsible for periodically sending "beacons" to other servers to request votes (when the sending server is the sync site, these also include how long it will remain so, the database version, and the active transaction ID). Beacons are sent only from servers that are already sync site or believe they are the best candidate, as determined by uvote_ShouldIRun(). On the sync site, the beacon thread is responsible for insuring that new votes are collected frequently enough to avoid loss of quorum. This means it must not block for an extended time; therefore, it must not acquire the database lock, which in other threads may sometimes be held during long-running operations (most notably, database file transfers done under this lock by the recovery thread). Instead, a new lock is proposed (the "version lock"), which is used to allow the beacon thread to examine (but not modify) certain fields without holding the database lock. The version lock should be acquired _in addition to_ the database lock when modifying 'ubik_epochTime' or the database 'version', 'flags', 'tidCounter', and 'writeTidCounter' fields; such modifications occur at the following locations: + in ubik.c:BeginTrans(), when beginning a new transaction + in udisk_begin(), when setting DBWRITING + in udisk_commit(), when updating the database version at the end of a write transaction (note this includes the case of relabelling the database on the first write transaction after becoming sync site) + in udisk_end(), when clearing DBWRITING + in recovery.c:InitializeDB(), when setting the initial version of an empty database + in urecovery_Interact(), when labelling a new database the first time quorum is established + in urecovery_Interact(), after fetching the latest database from another server (whether successful or not; there are two cases) + in SDISK_SendFile(), when updating the database version both before and after receiving a new database from the sync site. Note the lock must _not_ be held during the transfer. + in SDISK_SetVersion() The version lock should be acquired by the beacon thread while examining these fields, shortly before calling multi_VOTE_Beacon(). Since it must release the lock before making that call, the database version will need to be explicitly copied into a local variable, as is already done with the current transaction ID. Note that if UBIK_PAUSE is defined, the DBVOTING flag poses an additional problem, as it must be modified by the beacon thread. If it is desirable to continue to support this variant, then it becomes necessary for all accesses to the database flags to be protected by the version lock. Other UBIK_PAUSE code may not have been reviewed in depth and may pose additional problems. The function ubeacon_AmSyncSite() is called both from the beacon thread and from various other points to determine whether they are running on the sync site, and thus whether various operations are safe and appropriate. Calls to this function from outside the beacon thread (often, via urecovery_AllBetter()) must be made with the database lock held, and for operations modifying the database or relying on its contents to remain constant, this lock must not then be released until the operation is complete. The does not guarantee that this will remain sync site for the duration of the operation, but does insure that any operations done after or as a result of losing sync will in fact occur after the operation holding the lock. Of course, ubeacon_Interact() itself cannot call ubeacon_AmSyncSite() with the database lock held, since it must not block on that lock when running on the sync site (otherwise sync could be lost, as described above). So, that portion of ubeacon_AmSyncSite() other than the call to urecovery_ResetState() should be abstracted out into a new function, which presumably returns separate flags indicating both whether it is in fact running on the sync site (which becomes the return value of ubeacon_AmSyncSite()) and whether a call to urecovery_ResetState() is indicated. When the latter is set, the beacon thread can acquire the database lock (since, if this is not the sync site, blocking on this lock is not a problem) and call urecovery_ResetState(). This difference in behavior is acceptable because the required invariant is that when 'ubik_amSyncSite' is cleared, it cannot become true again until urecovery_state has also cleared. This insures that once a server discovers it is not sync site, urecovery_AllBetter() cannot succeed on the basis of being sync site until sync is regained and recovery has run, insuring the database is up to date. Since only the beacon thread ever sets 'ubik_amSyncSite' true, that thread can meet the invariant by calling urecovery_ResetState() before doing anything else, while other threads must continue to hold the beacon lock to prevent the beacon thread from setting 'ubik_amSyncSite'. ubeacon_Debug() reports the value of syncSiteUntil, without benefit of obtaining any locks. It should also report the value of 'ubik_amSyncSite', allowing that global to be made private to the BEACON package. Full correctness requires that the beacon lock be held while accessing these globals. RECOVERY - Consistency, Down Server Recovery, Propagation, Log Replay This package is responsible for tracking whether the local database is current and usable and for discovering when a server comes back up after having been marked down. On the sync site, it is also responsible for tracking the database versions on remote servers and for recovering from inconsistencies. To carry out these tasks, the RECOVERY package maintains a single global, 'urecovery_state'. This is not currently effectively protected by any lock; however, it should be protected by the database lock. This is particularly important because high-level routines that manipulate the database depend on a check against this field to determine that the database is up-to-date and safe to modify, and that it will remain so until the operation is complete. The main work of this package is done by urecovery_Interact(), which is the body of a long-running thread known as the recovery thread. This thread wakes up periodically to discover if any down servers have come back up and, on the sync site, to locate the latest available database, fetch it if necessary, and make sure it has been distributed to all servers. This work is done in several steps, taken on in sequence every few seconds: + The first part of urecovery_Interact() is to identify servers which are believed to be down, and determine whether any of them have come back up. This is done every 30 seconds, even when this is not the sync site. Currently, the database lock is held for the duration of this loop, released only in DoProbe() while actually waiting for the DISK_Probe() RPC to complete. This is mostly unnecessary; the server probe loop needs this lock only to examine the database 'currentDB' flag and to clear the UBIK_RECFOUNDDB bit, and DoProbe() doesn't need it at all. However, the beacon lock is needed to examine and update the database 'up' flag, and DoProbe() needs to hold the server address lock when examining server addresses and, if the server comes back up, to replace connections. See the section on struct ubik_server for more details on this lock. + The next step is to determine whether this is the sync site, and update the UBIK_RECSYNCSITE bit appropriately. The database lock should be held for the duration of this step, as ubeacon_AmSyncSite() may end up calling urecovery_ResetState(), which requires that lock be held. If this is not the sync site, the cycle ends here. + Step three is to determine the location of the latest database. This is done by making DISK_GetVersion() calls to all active servers, keeping track of the newest version found and the server which reported it. Once this is done, the best version is compared to the local version, update the UBIK_RECHAVEDB bit (so that it is true if and only if the local version is the latest) set the UBIK_RECFOUNDDB bit to indicate the latest database has been located, and clear the UBIK_RECSENTDB bit to force any servers with out-of-date databases to be updated. In this section, it is not necessary to hold any locks while collecting versions on remote servers, though as before, it is necessary to hold the beacon lock while examining the server 'up' flag, and the address lock to obtain a reference on each connection as it is used (but this lock should not be held during the RPC!). Once all remote versions have been fetched, the database lock is needed while examining the local version and updating 'urecovery_state'. Note that write transactions may happen while versions are collected, resulting in some servers having newer versions than were detected. However, if this happens, the result is always consistent and ends with the sync site having the latest, canonical version: + If the actual latest version X.N before the write was in an epoch created by the current sync site, then that version must necessarily already be present on the sync site, since writes are always committed first on the sync site. Thus, when the sync site updates the version counter, it produces a new version X.N+1 which no other server can believe it already has; thus, there is no possibility of inconsistency. + If the actual latest version X.N before the write was in an epoch not created by the sync site, then it is possible that after quorum was established but before any writes happened, a new server came online which had a newer version X.N+1 (that was committed to less than a quorum of sites before a crash). If a write transaction begins before recovery detects this fact and fetches the version X.N+1 database, then the new write will start with the version X.N database, essentially creating a version fork. However, when this happens, the resulting database will be version Y.1 (with Y>X), because the first write on a new sync site always results in a new database. Thus, the version X.N+1 database will no longer be latest (because the local version, which is examined last, is Y.1), and will end up being discarded. This is no worse than if the server with the version X.N+1 database had not come up until completely after the transaction resulting in version Y.1. + Step four is to fetch the latest database, if the sync site does not already have it. Since this step replaces the database wholesale, there can be no active transactions while it runs, and all other database activity must be locked out for the duration of the transfer. This means the database lock must be held for this entire step, starting from the check that the UBIK_RECSYNCSITE and/or UBIK_RECFOUNDDB bits are still set. (Note that the existing comment which claims that it is impossible for UBIK_RECFOUNDDB not to be set at this point is true only if the database is not released since checking or updating that bit in step three). In addition, the address lock must be held while setting up the call to be used for DISK_GetFile(). + Step five, again after verifying that this is still the sync site and that the UBIK_RECHAVEDB bit is set, is to distribute the new database to any servers which do not already have it. Before this can be done, if the database was newly-initialized (with label epoch 1), it is relabeled with version 2.1 before it is distributed to other sites. This allows readany transactions to run on such a database (though probably not successfully, since the database is empty); it is deferred until this point in recovery to insure that if any server in the quorum contains a valid database, that version is used rather than the empty one. For this step, the database lock should be held starting at the point at which the UBIK_RECSYNCSITE and/or UBIK_RECHAVEDB bits are tested. In the event the DBWRITING flag is set, the lock must be released while waiting for the active write transaction to end. In this case, once the wait is completed, it is necessary to again check urecovery_state to determine that this is still the sync site and have an up-to-date database before proceeding. It is probably simplest for urecovery_Interact() to acquire the database lock at the start of step 2 described above, release it for the duration of the version-collection part of step 3 and while waiting for write transactions to terminate in step 5, and otherwise hold it until the end of the cycle. If no work is needed, this means the lock will be held relatively briefly; if it is necessary to fetch and or send full database versions, these operations themselves must be done with the lock held, and there is little point in releasing it between them. However, it is acceptable to release the lock between any steps described above, provided that each time the database lock is released and reacquired, 'urecovery_state' is checked to verify that no regression has occurred. Note that outside of the recovery thread, the only changes that may occur to 'urecovery_state' are that it may be completely cleared by urecovery_ResetState(), and that the UBIK_RECLABELDB bit may be set by udisk_commit(), on the sync site. The function urecovery_AllBetter() is called under the database lock by routines which need to know whether there is a usable database. As noted in the discussion of ubeacon_AmSyncSite() above, callers must then continue to hold the database lock until any database operation is complete. Further, callers which intend to write to the database currently follow a call to urecovery_AllBetter() with a call to ubeacon_AmSyncSite(), to determine whether a write is permissible. Unfortunately, this is not safe, because it is possible (though admittedly unlikely) for urecovery_AllBetter() to succeed on the basis of being a non-sync-site with an up-to-date database, and then for a subsequent call to ubeacon_AmSyncSite() to return a true result (because the latter may block on acquiring the beacon lock, during which time this may become the sync site). To insure that no write operation is performed without an up-to-date, writeable database, urecovery_AllBetter() should be modified to indicate whether a write operation is permitted, which is the case only when UBIK_RECHAVEDB is tested and found to be set. The functions urecovery_AbortAll(), urecovery_CheckTid(), and urecovery_ResetState() all expect to be called with the database lock held. The function urecovery_LostServer() need not be called with any locks; it is merely a wrapper for ubeacon_ReinitServer(). The RECOVERY package is also responsible at startup for replaying the transaction log and initializing an empty database. These tasks are handled by urecovery_Initialize() and its helpers, ReplayLog() and InitializeDB(). Because the helpers use PHYS package functions and manipulate database structure fields, urecovery_Initialize() should acquire the database lock before calling them. REMOTE - remote database I/O (DISK) RPCs This package provides the DISK RPC package, which mediates remote access to the local database copy, when another server is sync site. It also includes DISK_UpdateInterfaceAddr(), which is used during initialization to verify consistency of configuration across servers, and DISK_Probe(), which is used by the RECOVERY module to determine whether the server is up. As the functions in this package are RPC implementations, they may be called at any time and are entirely responsible for their own locking. There can be at most one active remote write transaction at a time; this rule is enforced by this package, which maintains the global 'ubik_currentTrans' as a pointer to the active transaction. For the most part, the RPCs in this module use a common prologue, which includes acquiring the database lock, and do not release that lock until just before returning. However, the common prologue examines both 'ubik_currentTrans' and the transaction flags without benefit of the database lock. This can be corrected fairly easily by acquiring the database lock sooner; however, this requires referring to the database via the ubik_dbase global rather than via the 'dbase' pointer in 'ubik_currentTrans'. Note that SDISK_GetVersion() and SDISK_SetVersion() contain a sanity check to ensure they are not running on the sync site, since these are called only from that portion of the recovery loop which runs only on the sync site. These calls to ubeacon_AmSyncSite() must be made with the database lock held, for the results to be valid. The same holds true for a commented-out check in SDISK_GetFile() (which really should remain that way -- there's no reason not to allow any authorized caller to do this at any time). Additionally, note that SDISK_Commit() acquires a write lock on the application cache, to insure that no transactions are running which depend on the application cache, page cache, or underlying database to change. The locking order requires that this lock be acquired before taking the database lock; thus, both locks must be acquired earlier. SDISK_SendFile() is really quite bad in its handling of the database lock under error conditions. Presently, an error condition may cause it to release the lock too early nor not at all. SDISK_UpdateInterfaceAddr() is called by peer servers as part of their startup process, to trade lists of additional addresses and insure sufficient agreement on configuration to allow the new server to start. It will need to acquire the lock protecting the addr[] array attached to each host. UBIK (high-level API) This package contains the high-level APIs exported by Ubik to applications, including initialization, the APIs for creating and using transactions, and various utilities. It also contains a number of internal utilities. Initialization Initialization of Ubik is handled in ubik_ServerInitCommon(), which is called by two API functions, ubik_ServerInit() and ubik_ServerInitByInfo(), which provide different interfaces for passing configuration information. This code handles initialization of the database, locks, and various globals, calls the initialization routines for each subsystem, makes sure Rx has been initialized, and creates Ubik's Rx services and threads. Most of this is done before any Ubik-specific threads are created and before Ubik's RPCs can be called, and so takes place without holding any lock. However, since Rx may have been initialized and started before Ubik, it is possible that some Rx worker threads may have already been created. In practice, locks internal to Rx should insure that any memory writes made before an Rx service is created become visible to any worker running on behalf of that service; nonetheless, this should not be depended on, and proper locking should be used during all phases of initialization. Unfortunately, the current code actually starts Rx services before initialization is complete, resulting in a number of possible races. The general THREAD SAFETY section below contains recommendations for reordering the startup process to eliminate this problem, while still avoiding potential deadlock if multiple servers start at once. ContactQuorum_* Nearly a third of ubik.c consists of various ContactQuorum_* functions, which are used to make a particular RPC to every server in the quorum. These used to be a single ContactQuorum() function, but have been split into multiple functions to regain type-safety. These retain a common prologue and epilogue, some small part of which has been abstracted out, but most of which remains duplicated in each of these functions. Indeed, the only part unique to each function is the actual RPC call and, in the case of ContactQuorum_DISK_WriteV(), code to fall back to multiple calls to DISK_Write() on a per-server basis. The code common to all ContactQuorum_* functions will need to acquire the beacon lock when checking whether a server is up (no calls are made to servers which are not marked up), and again when marking a server down. Note that when a server is marked down, the 'up' and 'beaconSinceDown' flags must be cleared at the same time, without releasing the beacon lock. In addition, the helper function Quorum_StartIO() must obtain the server address lock while it obtains a reference on the Rx connection that will be used. All of the ContactQuorum_* functions must be called with the database lock held; this is necessary to protect their access to each server's 'version' and 'currentDB' fields and to the local database version. However, these functions release the lock while actually making RPCs. This behavior is new -- ContactQuorum() in OpenAFS 1.4 did not release the database lock during RPCs -- but it is safe, because no caller of ContactQuorum_*() depends on state read under the database lock before the call to remain valid after. Transactions The main Ubik transaction API consists of ubik_BeginTrans(), ubik_BeginTransReadAny(), ubik_BeginTransReadAnyWrite(), ubik_AbortTrans(), ubik_EndTrans(), ubik_Read(), ubik_Write(), ubik_Flush(), ubik_Seek(), ubik_Tell(), ubik_Truncate(), and ubik_SetLock(). The ubik_Begin* functions require a database pointer; all others require an active transaction pointer. These functions are called directly by the application without any locks, and are responsible for their own locking. Note that many of these examine the transaction type before acquiring any locks; this is permissible because the transaction type is immutable once the transaction is created (but it requires that the call be made from the thread that created the transaction, or that the application have used some mechanism to insure an appropriate memory barrier exists; the same mechanism used to protect the transaction pointer should suffice). BeginTrans() contains the common code responsible for starting a new transaction; this is made available to applications as ubik_BeginTrans(), ubik_BeginTransReadAny(), and ubik_BeginTransReadAnyWrite(). If a write transaction is in progress, this function waits for it to complete. However, since doing so releases the database lock, it is necessary to call urecovery_AllBetter() again once the lock has been reacquired. Further, in the UBIK_PAUSE case, the DBVOTING flag may have been set, so this also must be retested (see the BEACON package section for more on the difficulties of UBIK_PAUSE and DBVOTING). Both ubik_Flush() and ubik_Write() examine and manipulate the transaction I/O vector without benefit of any lock. These fields ('iovec_info' and 'iovec_data') should be protected by the database lock. In the case of ubik_Write(), this may require releasing the lock before calling ubik_Flush(), or the introduction of a private ubik_Flush_r() which assumes the lock is already held (but note that in either case, the lock is released, because ubik_Flush() calls ContactQuorum_* functions). Utilities The functions ubik_GetVersion() and ubik_WaitVersion() provide the application with a way to discover the current database version and to wait for it to change. These interfaces are not currently used in OpenAFS. ubik_GetVersion() needs to acquire the database lock while copying the database version. The internal function ubikGetPrimaryInterfaceAddr() is used by Ubik RPCs to determine a peer server's primary address, given the IP address from which a call arrived. This needs to hold the server address lock, as described in the section on struct ubik_server. Application Cache The section on struct ubik_dbase describes the operation of the application cache. The function ubik_CheckCache() is provided to allow an application to insure the cache is up to date and obtain a read lock on the cache which lasts for the duration of the transaction. Note that ubik_AbortTrans() currently always invalidates the application cache by clearing ubik_dbase->cachedVersion, for which it requires a write lock on the cache_lock. This means that aborted transactions block waiting for completion of any transactions which hold the cache_lock, which may well include all outstanding transactions. In the case of read transactions, which cannot have modified the database, this is unnecessary and may significantly reduce concurrency. THREAD SAFETY This section discusses changes needed to insure thread safety. The initialization sequence in ubik_ServerInitCommon() should be cleaned up, to ensure that every subsystem is fully initialized before starting any background threads or accepting RPCs. The correct initialization sequence should look something like this: + Initialize error tables + Allocate and initialize the database structure + Initialize Rx + Initialize the various subsystems: + udisk_Init() (formerly disk.c:DInit) + ulock_Init() (new, pulled out of ulock_getLock) + uvote_Init() + urecovery_Initialize() + ubeacon_InitServerList* + Create Rx services + Start an Rx server thread + updateUbikNetworkAddress() (rename and export from beacon.c) + Start the beacon and recovery threads Initialization functions may need to be added for some subsystems which do not currently have them: + src/ubik/disk.c:DInit() should be exported and renamed udisk_Init(); it can then be called from ubik_ServerInitCommon() instead of from udisk_begin(). + The lock initialization currently done in ulock_getLock() should instead be done in a separate ulock_Init(), which should be called from ubik_ServerInitCommon() A new lock is needed to protect state belonging to the VOTE package, including the globals 'ubik_dbVersion' and 'ubik_dbTid', which represent the sync site's database state. See the VOTE package section for details on the use of the vote lock. A new lock is needed to protect state belonging to the BEACON package, including the globals 'ubik_amSyncSite' and 'syncSiteUntil' and several fields in the ubik_server structure. This lock also protects the per-server 'up' flag, which is shared among several modules. In addition, some beacon-specific fields are accessed directly by other modules. Thus, this lock must be visible outside the BEACON module, or else accessors must be provided for these items. See the BEACON package section for details on the use of the beacon lock. A new lock is needed to provide additional protection for the database version, transaction counter, and flags, so that these can safely be examined (but not updated) by the beacon thread without the need to acquire the database lock. See the BEACON package section for details on the use of the version lock. A new lock is needed to protect per-server lists of non-primary addresses, connections to the VOTE and DISK services on each server, and the client-mode security class objects used to set up these connections. See the section on struct ubik_server for details on the use of the server address lock. The required locking order is described by the following list. Any of these locks may be acquired singly; when acquiring multiple locks, they should be acquired in the listed order: + application cache lock (dbase->cache_lock) + database lock (DBHOLD/DBRELE) + beacon lock + vote lock + version lock + server address lock Some cleanup is required in various places where existing locking conventions are not properly obeyed: + SVOTE_Beacon() needs to hold the database lock when calling urecovery_CheckTid() + Management of the database lock in SDISK_SendFile() needs to be cleaned up so that it is always released at the proper time. + urecovery_Interact() is not consistent about holding the database lock when it examines and updates 'urecovery_state'. It also examines the local database version at least once without benefit of this lock. See the RECOVERY package section for details on which locks are needed when by urecovery_Interact(). + ubik_Flush() and ubik_Write() need to acquire the database lock before accessing the 'iovec_info' and 'iovec_data' fields. + ubik_GetVersion() needs to acquire the database lock before copying the database version. The various debugging RPCs in the VOTE packages, and the data collection routines they call in other packages, generally operate without benefit of any locks. This is probably desirable, as it avoids any possibility of debugging operations blocking on or interfering with the rest of the system, and allows debugging data to be collected even in the event of some unforeseen deadlock. It is also "safe", in that it does not risk damage to data structures or access to uninitialized memory, provided that data structure initialization is complete before these functions are called. However, it does mean that these RPCs may return data that is not entirely self-consistent. + Correctness requires various locks be held during parts of SVOTE_Debug() and SVOTE_DebugOld(); see the VOTE package section. + SVOTE_Debug() also examines 'ubik_currentTrans' without benefit of the database lock. This usage is not "safe", as that transaction may be freed by another thread while SVOTE_Debug() is examining it. + SVOTE_XSDebug() should hold the server address lock while copying out server addresses, and other locks while copying per-server state. + udisk_Debug() should hold the database lock while examining the database version and collecting page cache statistics. + ulock_Debug should use LOCK_LOCK/LOCK_UNLOCK when examining the internals of struct Lock. Ideally, knowledge of these internals should be abstracted to src/lwp/lock.h. + ubeacon_Debug() should hold the beacon lock while accessing the value of 'syncSiteUntil' and, if the access is moved here from SVOTE_Debug/SVOTE_DebugOld, 'ubik_amSyncSite'. CONCURRENCY THOUGHTS This section collections information discussed previously about changes that may be required or desirable in order to improve concurrency. Note that none of these changes are needed to insure thread safety, provided the changes discussed in the previous section are made. + PHYS could be made more concurrent by maintaining a separate lock for the fdcache, avoiding the need to DBHOLD while calling these. However, higher-layer consistency probably requires that certain calls or groups of calls to this package be atomic and isolated. + Could DISK be made more concurrent by maintaining a separate lock for the page cache and or free truncation record list? + Could concurrency be improved by maintaining separate per-transaction locks? If so, what is the locking hierarchy with respect to database locks and the DISK and PHYS package locks? + Could concurrency be improved by maintaining more reader/writer state and/or separate write locks to insure consistency of database file access without requiring so much DBHOLD? + Alternately, could concurrency be improved by requiring that an active transaction be accessed only by the thread that created it? MULTI-DATABASE SUPPORT There isn't any. Some of the Ubik interfaces and even the organization of some internal data structures makes it appear as though multiple databases can be supported. However, this is not the case, for several reasons: + Only a single set of peer servers is tracked, via a global linked list of server structures. Each server structure includes state such as the server's database version and whether it is up-to-date, so it is not possible to track multiple databases even for a common set of servers. + The physical I/O package (phys.c) tracks cached open file descriptors without regard to what database a file is part of; thus, it cannot handle accessing multiple databases at the same time. + There are a variety of globals containing state which needs to be tracked on a per-database basis. + There are a variety of globals which are effectively protected by a per-database lock, or by no lock at all. + The various RPCs used to communicate between servers assume a single database and do not include a database identifier. Thus, in the presence of multiple databases, it would be impossible to determine, for example, for which database a DISK RPC was intended. The upshot of this is that considerable work would be needed to get Ubik into a condition which would allow a single process to maintain multiple independent Ubik databases at the same time. Note that this is orthogonal to the question of whether a single multi-file database is possible. OTHER DATA STRUCTURES Some other data structures are mentioned in ubik.p.h or elsewhere, but are of relatively little interest from the point of view of threading. They are described here primarily as an alternative to deleting descriptive text that was already written. struct ubik_hdr - on-disk database file header This structure represents the on-disk format of the Ubik database file header (label). It is used only for local variables in uphys_getlabel() and uphys_setlabel(), which read and write this header, respectively. struct ubik_stat - database file status This structure is used to convey the size and modification timestamp of a database file. It is used as the type of an output parameter of uphys_stat(), and for local variables in callers of that function. struct ubik_stats - global statistics This structure is used to contain "miscellaneous" global statistics. Currently that consists of a single counter which is incremented in one place (an exceptional condition in ubik_EndTrans) and is never read. The one increment is performed only when holding the database lock; thus, this structure may be treated the same as other globals protected by non-global locks. OTHER NOTES There appears to be nothing to insure that calls to disk.c:DRead() during a write transaction won't find and use a "clean" cached page when there is already a dirty page in the cache resulting from an earlier write as part of the same transaction. This means that if a transaction reads a page after writing it, it may read back the original data instead of the new, and if a transaction writes a page more than once, the later writes may corrupt the database. It seems likely that we are merely lucky in this regard, and this problem should be fixed. src/ubik/beacon.c:ubeacon_ReinitServer() assumes 'ubik_CRXSecurityRock' is in fact an AFS configuration directory pointer, suitable as an argument to afsconf_UpToDate(). This is inappropriate; 'ubik_CRXSecurityRock' is an opaque argument to (*ubik_CRXSecurityProc)(), Ubik must not make assumptions about what it is. It may be worth examining whether it is appropriate for SVOTE_Beacon() to call urecovery_CheckTid() only when voting yes, and not whenever a beacon is received from a server claiming to be sync site. Recently, code has been introduced which attempts to insure that the recovery process does not truncate an old database file before a new one has been fully transferred, since this could, depending on the timing of a server failure, deprive the new quorum of a valid database. This so-called "new recovery" code violates the abstraction provided by the PHYS package, encoding specific knowledge of the underlying file store into the REMOTE and RECOVERY packages. This means that the backend provided by phys.c cannot be replaced with an alternative, since in that case recovery would do the wrong thing with the transferred database. The DBWRITING loop in urecovery_Interact() is not portable, as it depends the contents of the timeout after a call to select(2). This was fine for IOMGR_Select(), which has defined behavior, but not for select(2).