mirror of
https://git.openafs.org/openafs.git
synced 2025-01-22 17:00:15 +00:00
3cffaee790
In DAFS, replace uses of the VLockPartition_r partition-level locks with the approprivate VLockVolume*NB volume-level locks (and sometimes FSYNC_VerifyCheckout). This allows for greater parallelization of volserver attachment / volume creation, for volume operations to occur during salvages, and for multiple salvages on a single partition to occur simultaneously. More architectural details of volume-level locks can be found in the changes to doc/arch/dafs-overview.txt. Change-Id: I4e8ef4c864002d7e7c976691824c53dfa9cfaf91 Reviewed-on: http://gerrit.openafs.org/1406 Tested-by: Andrew Deason <adeason@sinenomine.net> Reviewed-by: Derrick Brashear <shadow@dementia.org> Tested-by: Derrick Brashear <shadow@dementia.org>
397 lines
19 KiB
Plaintext
397 lines
19 KiB
Plaintext
The Demand-Attach FileServer (DAFS) has resulted in many changes to how
|
|
many things on AFS fileservers behave. The most sweeping changes are
|
|
probably in the volume package, but significant changes have also been
|
|
made in the SYNC protocol, the vnode package, salvaging, and a few
|
|
miscellaneous bits in the various fileserver processes.
|
|
|
|
This document serves as an overview for developers on how to deal with
|
|
these changes, and how to use the new mechanisms. For more specific
|
|
details, consult the relevant doxygen documentation, the code comments,
|
|
and/or the code itself.
|
|
|
|
- The salvageserver
|
|
|
|
The salvageserver (or 'salvaged') is a new OpenAFS fileserver process in
|
|
DAFS. This daemon accepts salvage requests via SALVSYNC (see below), and
|
|
salvages a volume group by fork()ing a child, and running the normal
|
|
salvager code (it enters vol-salvage.c by calling SalvageFileSys1).
|
|
|
|
Salvages that are initiated from a request to the salvageserver (called
|
|
'demand-salvages') occur automatically; whenever the fileserver (or
|
|
other tool) discovers that a volume needs salvaging, it will schedule a
|
|
salvage on the salvageserver without any intervention needed.
|
|
|
|
When scheduling a salvage, the vol id should be the id for the volume
|
|
group (the RW vol id). If the salvaging child discovers that it was
|
|
given a non-RW vol id, it will send the salvageserver a SALVSYNC LINK
|
|
command, and will exit. This will tell the salvageserver that whenever
|
|
it receives a salvage request for that vol id, it should schedule a
|
|
salvage for the corresponding RW id instead.
|
|
|
|
- FSSYNC/SALVSYNC
|
|
|
|
The FSSYNC and SALVSYNC protocols are the protocols used for
|
|
interprocess communication between the various fileserver processes.
|
|
FSSYNC is used for querying the fileserver for volume metadata,
|
|
'checking out' volumes from the fileserver, and a few other things.
|
|
SALVSYNC is used to schedule and query salvages in the salvageserver.
|
|
|
|
FSSYNC existed prior to DAFS, but it encompasses a much larger set of
|
|
commands with the advent of DAFS. SALVSYNC is entirely new to DAFS.
|
|
|
|
-- SYNC
|
|
|
|
FSSYNC and SALVSYNC are both layered on top of a protocol called SYNC.
|
|
SYNC isn't much a protocol in itself; it just handles some boilerplate
|
|
for the messages passed back and forth, and some error codes common to
|
|
both FSSYNC and SALVSYNC.
|
|
|
|
SYNC is layered on top of TCP/IP, though we only use it to communicate
|
|
with the local host (usually via a unix domain socket). It does not
|
|
handle anything like authentication, authorization, or even things like
|
|
serialization. Although it uses network primitives for communication,
|
|
it's only useful for communication between processes on the same
|
|
machine, and that is all we use it for.
|
|
|
|
SYNC calls are basically RPCs, but very simple. The calls are always
|
|
synchronous, and each SYNC server can only handle one request at a time.
|
|
Thus, it is important for SYNC server handlers to return as quickly as
|
|
possible; hitting the network or disk to service a SYNC request should
|
|
be avoided to the extent that such is possible.
|
|
|
|
SYNC-related source files are src/vol/daemon_com.c and
|
|
src/vol/daemon_com.h
|
|
|
|
-- FSSYNC
|
|
|
|
--- server
|
|
|
|
The FSSYNC server runs in the fileserver; source is in
|
|
src/vol/fssync-server.c.
|
|
|
|
As mentioned above, FSSYNC handlers should finish quickly when
|
|
servicing a request, so hitting the network or disk should be avoided.
|
|
In particular, you absolutely cannot make a SALVSYNC call inside an
|
|
FSSYNC handler; the SALVSYNC client wrapper routines actively prevent
|
|
this from happening, so even if you try to do such a thing, you will not
|
|
be allowed to. This prohibition is to prevent deadlock, since the
|
|
salvageserver could have made the FSSYNC request that you are servicing.
|
|
|
|
When a client makes a FSYNC_VOL_OFF or NEEDVOLUME request, the
|
|
fileserver offlines the volume if necessary, and keeps track that the
|
|
volume has been 'checked out'. A volume is left online if the checkout
|
|
mode indicates the volume cannot change (see VVolOpLeaveOnline_r).
|
|
|
|
Until the volume has been 'checked in' with the ON, LEAVE_OFFLINE, or
|
|
DONE commands, no other program can check out the volume.
|
|
|
|
Other FSSYNC commands include abilities to query volume metadata and
|
|
stats, to force volumes to be attached or offline, and to update the
|
|
volume group cache. See doc/arch/fssync.txt for documentation on the
|
|
individual FSSYNC commands.
|
|
|
|
--- clients
|
|
|
|
FSSYNC clients are generally any OpenAFS process that runs on a
|
|
fileserver and tries to access volumes directly. The volserver,
|
|
salvageserver, and bosserver all qualify, as do (sometimes) some
|
|
utilities like vol-info or vol-bless. For issuing FSSYNC commands
|
|
directly, there is the debugging tool fssync-debug. FSSYNC client code
|
|
is in src/vol/fssync-client.c, but it's not very interesting.
|
|
|
|
Any program that wishes to directly access a volume on disk must check
|
|
out the volume via FSSYNC (NEEDVOLUME or OFF commands), to ensure the
|
|
volume doesn't change while the program is using it. If the program
|
|
determines that the volume is somehow inconsistent and should be
|
|
salvaged, it should send the FSSYNC command FORCE_ERROR with reason code
|
|
FSYNC_SALVAGE to the fileserver, which will take care of salvaging it.
|
|
|
|
-- SALVSYNC
|
|
|
|
The SALVSYNC server runs in the salvageserver; code is in
|
|
src/vol/salvsync-server.c. SALVSYNC clients are just the fileserver, the
|
|
salvageserver run with the -client switch, and the salvageserver worker
|
|
children. If any other process notices that a volume needs salvaging, it
|
|
should issue a FORCE_ERROR FSSYNC command to the fileserver with the
|
|
FSYNC_SALVAGE reason code.
|
|
|
|
The SALVSYNC protocol is simpler than the FSSYNC protocol. The commands
|
|
are basically just to create, cancel, change, and query salvages. The
|
|
RAISEPRIO command increases the priority of a salvage job that hasn't
|
|
started yet, so volumes that are accessed more frequently will get
|
|
salvaged first. The LINK command is used by the salvageserver worker
|
|
children to inform the salvageserver parent that it tried to salvage a
|
|
readonly volume for which a read-write clone exists (in which case we
|
|
should just schedule a salvage for the parent read-write volume).
|
|
|
|
Note that canceling a salvage is just for salvages that haven't run
|
|
yet; it only takes a salvage job off of a queue; it doesn't stop a
|
|
salvageserver worker child in the middle of a salvage.
|
|
|
|
- The volume package
|
|
|
|
-- refcounts
|
|
|
|
Before DAFS, the Volume struct just had one reference count, vp->nUsers.
|
|
With DAFS, we know have the notion of an internal/lightweight reference
|
|
count, and an external/heavyweight reference count. Lightweight refs are
|
|
acquired with VCreateReservation_r, and released with
|
|
VCancelReservation_r. Heavyweight refs are acquired as before, normally
|
|
with a GetVolume or AttachVolume variant, and releasing the ref with
|
|
VPutVolume.
|
|
|
|
Lightweight references are only acquired within the volume package; a vp
|
|
should not be given to e.g. the fileserver code with an extra
|
|
lightweight ref. A heavyweight ref is generally acquired for a vp that
|
|
will be given to some non-volume-package code; acquiring a heavyweight
|
|
ref guarantees that the volume header has been loaded.
|
|
|
|
Acquiring a lightweight ref just guarantees that the volume will not go
|
|
away or suddenly become unavailable after dropping VOL_LOCK. Certain
|
|
operations like detachment or scheduling a salvage only occur when all
|
|
of the heavy and lightweight refs go away; see VCancelReservation_r.
|
|
|
|
-- state machine
|
|
|
|
Instead of having a per-volume lock, each vp always has an associated
|
|
'state', that says what, if anything, is occurring to a volume at any
|
|
particular time; or if the volume is attached, offline, etc. To do the
|
|
basic equivalent of a lock -- that is, ensure that nobody else will
|
|
change the volume when we drop VOL_LOCK -- you can put the volume in
|
|
what is called an 'exclusive' state (see VIsExclusiveState).
|
|
|
|
When a volume is in an exclusive state, no thread should modify the
|
|
volume (or expect the vp data to stay the same), except the thread that
|
|
put it in that state. Whenever you manipulate a volume, you should make
|
|
sure it is not in an exclusive state; first call VCreateReservation_r to
|
|
make sure the volume doesn't go away, and then call
|
|
VWaitExclusiveState_r. When that returns, you are guaranteed to have a
|
|
vp that is in a non-exclusive state, and so can me manipulated. Call
|
|
VCancelReservation_r when done with it, to indicate you don't need it
|
|
anymore.
|
|
|
|
Look at the definition of the VolState enumeration to see all volume
|
|
states, and a brief explanation of them.
|
|
|
|
-- VLRU
|
|
|
|
See: Most functions with VLRU in their name in src/vol/volume.c.
|
|
|
|
The VLRU is what dictates when volumes are detached after a certain
|
|
amount of inactivity. The design is pretty much a generational garbage
|
|
collection mechanism. There are 5 queues that a volume can be on the
|
|
VLRU (VLRUQueueName in volume.h). 'Candidate' volumes haven't seen
|
|
activity in a while, and so are candidates to be detached. 'New' volumes
|
|
have seen activity only recently; 'mid' volumes have seen activity for
|
|
awhile, and 'old' volumes have seen activity for a long while. 'Held'
|
|
volumes cannot be soft detached at all.
|
|
|
|
Volumes are moved from new->mid->old if they have had activity recently,
|
|
and are moved from old->mid->new->candidate if they have not had any
|
|
activity recently. The definition of 'recently' is configurable by the
|
|
-vlruthresh fileserver parameter; see VLRU_ComputeConstants for how they
|
|
are determined. Volumes start at 'new' on attachment, and if any
|
|
activity occurs when a volume is on 'candidate', it's moved to 'new'
|
|
immediately.
|
|
|
|
Volumes are generally promoted/demoted and soft-detached by
|
|
VLRU_ScannerThread, which runs every so often and moves volumes between
|
|
VLRU queues depending on their last access time and the various
|
|
thresholds (or soft-detaches them, in the case of the 'candidate'
|
|
queue). Soft-detaching just means the volume is taken offline and put
|
|
into the preattached state.
|
|
|
|
--- DONT_SALVAGE
|
|
|
|
The dontSalvage flag in volume headers can be set to DONT_SALVAGE to
|
|
indicate that a volume probably doesn't need to be salvaged. Before
|
|
DAFS, volumes were placed on an 'UpdateList' which was periodically
|
|
scanned, and dontSalvage was set on volumes that hadn't been touched in
|
|
a while.
|
|
|
|
With DAFS and the VLRU additions, setting dontSalvage now happens when a
|
|
volume is demoted a VLRU generation, and no separate list is kept. So if
|
|
a volume has been idle enough to demote, and it hasn't been accessed in
|
|
SALVAGE_INTERVAL time, dontSalvage will be set automatically by the VLRU
|
|
scanner.
|
|
|
|
-- Vnode
|
|
|
|
Source files: src/vol/vnode.c, src/vol/vnode.h, src/vol/vnode_inline.h
|
|
|
|
The changes to the vnode package are largely very similar to those in
|
|
the volume package. A Vnode is put into specific states, some of which
|
|
are exclusive and act like locks (see VnChangeState_r,
|
|
VnIsExclusiveState). Vnodes also have refcounts, incremented and
|
|
decremented with VnCreateReservation_r and VnCancelReservation_r like
|
|
you would expect. I/O should be done outside of any global locks; just
|
|
the vnode is 'locked' by being put in an exclusive state if necessary.
|
|
|
|
In addition to a state, vnodes also have a count of readers. When a
|
|
caller gets a vnode with a read lock, we of course must wait for the
|
|
vnode to be in a nonexclusive state (VnWaitExclusive_r), then the number
|
|
of readers is incremented (VnBeginRead_r), but the vnode is kept in a
|
|
non-exclusive state (VN_STATE_READ).
|
|
|
|
When a caller gets a vnode with a write lock, we must wait not only for
|
|
the vnode to be in a nonexclusive state, but also for there to be no
|
|
readers (VnWaitQuiescent_r), so we can actually change it.
|
|
|
|
VnLock still exists in DAFS, but it's almost a no-op. All we do for DAFS
|
|
in VnLock is set vnp->writer to the current thread id for a write lock,
|
|
for some consistency checks later (read locks are actually no-ops).
|
|
Actual mutual exclusion in DAFS is done by the vnode state machine and
|
|
the reader count.
|
|
|
|
- viced state serialization
|
|
|
|
See src/tviced/serialize_state.* and ShutDownAndCore in
|
|
src/viced/viced.c
|
|
|
|
Before DAFS, whenever a fileserver restarted, it lost all information
|
|
about all clients, what callbacks they had, etc. So when a client with
|
|
existing callbacks contacted the fileserver, all callback information
|
|
needed to be reset, potentially causing a bunch of unnecessary traffic.
|
|
And of course, if the client does not contact the fileserver again, it
|
|
could not get sent callbacks it should get sent.
|
|
|
|
DAFS now has the ability to save the host and CB data to a file on
|
|
shutdown, and restore it when it starts up again. So when a fileserver
|
|
is restarted, the host and CB information should be effectively the same
|
|
as when it shut down. So a client may not even know if a fileserver was
|
|
restarted.
|
|
|
|
Getting this state information can be a little difficult, since the host
|
|
package data structures aren't necessarily always consistent, even after
|
|
H_LOCK is dropped. What we attempt to do is stop all of the background
|
|
threads early in the shutdown process (set fs_state.mode -
|
|
FS_MODE_SHUTDOWN), and wait for the background threads to exit (or be
|
|
marked as 'tranquil'; see the fs_state struct) later on, before trying
|
|
to save state. This makes it a lot less likely for anything to be
|
|
modifying the host or CB structures by the time we try to save them.
|
|
|
|
- volume group cache
|
|
|
|
See: src/vol/vg_cache* and src/vol/vg_scan.c
|
|
|
|
The VGC is a mechanism in DAFS to speed up volume salvages. Pre-VGC,
|
|
whenever the salvager code salvaged an individual volume, it would need
|
|
to read all of the volume headers on the partition, so it knows what
|
|
volumes are in the volume group it is salvaging, so it knows what
|
|
volumes to tell the fileserver to take offline. With demand-salvages,
|
|
this can make salvaging take a very long time, since the time to read in
|
|
all volume headers can take much more time than the time to actually
|
|
salvage a single volume group.
|
|
|
|
To prevent the need to scan the partition volume headers every single
|
|
time, the fileserver maintains a cache of which volumes are in what
|
|
volume groups. The cache is populated by scanning a partition's volume
|
|
headers, and is started in the background upon receiving the first
|
|
salvage request for a partition (VVGCache_scanStart_r,
|
|
_VVGC_scan_start).
|
|
|
|
After the VGC is populated, it is kept up to date with volumes being
|
|
created and deleted via the FSSYNC VG_ADD and VG_DEL
|
|
commands. These are called every time a volume header is created,
|
|
removed, or changed when using the volume header wrappers in vutil.c
|
|
(VCreateVolumeDiskHeader, VDestroyVolumeDiskHeader,
|
|
VWriteVolumeDiskHeader). These wrappers should always be used to
|
|
create/remove/modify vol headers, to ensure that the necessary FSSYNC
|
|
commands are called.
|
|
|
|
-- race prevention
|
|
|
|
In order to prevent races between volume changes and VGC partition scans
|
|
(that is, someone scans a header while it is being written and not yet
|
|
valid), updates to the VGC involving adding or modifying volume headers
|
|
should always be done under the 'partition header lock'. This is a
|
|
per-partition lock to conceptually lock the set of volume headers on
|
|
that partition. It is only read-held when something is writing to a
|
|
volume header, and it is write-held for something that is scanning the
|
|
partition for volume headers (the VGC or partition salvager). This is a
|
|
little counterintuitive, but it is what we want. We want multiple
|
|
headers to be written to at once, but if we are the VGC scanner, we want
|
|
to ensure nobody else is writing when we look at a header file.
|
|
|
|
Because the race described above is so rare, vol header scanners don't
|
|
actually hold the lock unless a problem is detected. So, what they do is
|
|
read a particular volume header without any lock, and if there is a
|
|
problem with it, they grab a write lock on the partition vol headers,
|
|
and try again. If it still has a problem, the header is just faulty; if
|
|
it's okay, then we avoided the race.
|
|
|
|
Note that destroying vol headers does not require any locks, since
|
|
unlink()s are atomic and don't cause any races for us here.
|
|
|
|
- partition and volume locking
|
|
|
|
Previously, whenever the volserver would attach a volume or the salvager
|
|
would salvage anything, the partition would be locked
|
|
(VLockPartition_r). This unnecessarily serializes part of most volserver
|
|
operations. It also makes it so only one salvage can run on a partition
|
|
at a time, and that a volserver operation cannot occur at the same time
|
|
as a salvage. With the addition of the VGC (previous section), the
|
|
salvager partition lock is unnecessary on namei, since the salvager does
|
|
not need to scan all volume headers.
|
|
|
|
Instead of the rather heavyweight partition lock, in DAFS we now lock
|
|
individual volumes. Locking an individual volume is done by locking a
|
|
certain byte in the file /vicepX/.volume.lock. To lock volume with ID
|
|
1234, you lock 1 byte at offset 1234 (with VLockFile: fcntl on unix,
|
|
LockFileEx on windows as of the time of this writing). To read-lock the
|
|
volume, acquire a read lock; to write-lock the volume, acquire a write
|
|
lock.
|
|
|
|
Due to the potentially very large number of volumes attached by the
|
|
fileserver at once, the fileserver does not keep volumes locked the
|
|
entire time they are attached (which would make volume locking
|
|
potentially very slow). Rather, it locks the volume before attaching,
|
|
and unlocks it when the volume has been attached. However, all other
|
|
programs are expected to acquire a volume lock for the entire duration
|
|
they interact with the volume. Whether a read or write lock is obtained
|
|
is determined by the attachment mode, and whether or not the volume in
|
|
question is an RW volume (see VVolLockType()).
|
|
|
|
These locks are all acquired non-blocking, so we can just fail if we
|
|
fail to acquire a lock. That is, an errant process holding a file-level
|
|
lock cannot cause any process to just hang, waiting for a lock.
|
|
|
|
-- re-reading volume headers
|
|
|
|
Since we cannot know whether a volume is writeable or not until the
|
|
volume header is read, and we cannot atomically upgrade file-level
|
|
locks, part of attachment can now occur twice (see attach2 and
|
|
attach_volume_header). What occurs is we read the vol header, assuming
|
|
the volume is readonly (acquiring a read or write lock as necessary).
|
|
If, after reading the vol header, we discover that the volume is
|
|
writable and that means we need to acquire a write lock, we read the vol
|
|
header again while acquiring a write lock on the header.
|
|
|
|
-- verifying checkouts
|
|
|
|
Since the fileserver does not hold volume locks for the entire time a
|
|
volume is attached, there could have been a potential race between the
|
|
fileserver and other programs. Consider when a non-fileserver program
|
|
checks out a volume from the fileserver via FSSYNC, then locks the
|
|
volume. Before the program locked the volume, the fileserver could have
|
|
restarted and attached the volume. Since the fileserver releases the
|
|
volume lock after attachment, the fileserver and the other program could
|
|
both think they have control over the volume, which is a problem.
|
|
|
|
To prevent this non-fileserver programs are expected to verify that
|
|
their volume is checked out after locking it (FSYNC_VerifyCheckout).
|
|
What this does is ask the fileserver for the current volume operation on
|
|
the specific volume, and verifies that it matches how the program
|
|
checked out the volume.
|
|
|
|
For example, programType X checks out volume V from the fileserver, and
|
|
then locks it. We then ask the fileserver for the current volume
|
|
operation on volume V. If the programType on the vol operation does not
|
|
match (or the PID, or the checkout mode, or other things), we know the
|
|
fileserver must have restarted or something similar, and we do not have
|
|
the volume checked out like we thought we did.
|
|
|
|
If the program determines that the fileserver may have restarted, it
|
|
then must retry checking out and locking the volume (or return an
|
|
error).
|