openafs/doc/txt/dafs-overview.txt

The Demand-Attach FileServer (DAFS) has resulted in many changes to how
many things on AFS fileservers behave. The most sweeping changes are
probably in the volume package, but significant changes have also been
made in the SYNC protocol, the vnode package, salvaging, and a few
miscellaneous bits in the various fileserver processes.

This document serves as an overview for developers on how to deal with
these changes, and how to use the new mechanisms. For more specific
details, consult the relevant doxygen documentation, the code comments,
and/or the code itself.

 - The salvageserver

The salvageserver (or 'salvaged') is a new OpenAFS fileserver process in
DAFS. This daemon accepts salvage requests via SALVSYNC (see below), and
salvages a volume group by fork()ing a child, and running the normal
salvager code (it enters vol-salvage.c by calling SalvageFileSys1).

Salvages that are initiated from a request to the salvageserver (called
'demand-salvages') occur automatically; whenever the fileserver (or
other tool) discovers that a volume needs salvaging, it will schedule a
salvage on the salvageserver without any intervention needed.

When scheduling a salvage, the vol id should be the id for the volume
group (the RW vol id). If the salvaging child discovers that it was
given a non-RW vol id, it will send the salvageserver a SALVSYNC LINK
command, and will exit. This will tell the salvageserver that whenever
it receives a salvage request for that vol id, it should schedule a
salvage for the corresponding RW id instead.

 - FSSYNC/SALVSYNC

The FSSYNC and SALVSYNC protocols are the protocols used for
interprocess communication between the various fileserver processes.
FSSYNC is used for querying the fileserver for volume metadata,
'checking out' volumes from the fileserver, and a few other things.
SALVSYNC is used to schedule and query salvages in the salvageserver.

FSSYNC existed prior to DAFS, but it encompasses a much larger set of
commands with the advent of DAFS. SALVSYNC is entirely new to DAFS.

 -- SYNC

FSSYNC and SALVSYNC are both layered on top of a protocol called SYNC.
SYNC isn't much a protocol in itself; it just handles some boilerplate
for the messages passed back and forth, and some error codes common to
both FSSYNC and SALVSYNC.

SYNC is layered on top of TCP/IP, though we only use it to communicate
with the local host (usually via a unix domain socket). It does not
handle anything like authentication, authorization, or even things like
serialization. Although it uses network primitives for communication,
it's only useful for communication between processes on the same
machine, and that is all we use it for.

SYNC calls are basically RPCs, but very simple. The calls are always
synchronous, and each SYNC server can only handle one request at a time.
Thus, it is important for SYNC server handlers to return as quickly as
possible; hitting the network or disk to service a SYNC request should
be avoided to the extent that such is possible.

SYNC-related source files are src/vol/daemon_com.c and
src/vol/daemon_com.h

 -- FSSYNC

 --- server

The FSSYNC server runs in the fileserver; source is in
src/vol/fssync-server.c.

As mentioned above, FSSYNC handlers should finish quickly when
servicing a request, so hitting the network or disk should be avoided.
In particular, you absolutely cannot make a SALVSYNC call inside an
FSSYNC handler; the SALVSYNC client wrapper routines actively prevent
this from happening, so even if you try to do such a thing, you will not
be allowed to. This prohibition is to prevent deadlock, since the
salvageserver could have made the FSSYNC request that you are servicing.

When a client makes a FSYNC_VOL_OFF or NEEDVOLUME request, the
fileserver offlines the volume if necessary, and keeps track that the
volume has been 'checked out'. A volume is left online if the checkout
mode indicates the volume cannot change (see VVolOpLeaveOnline_r).

Until the volume has been 'checked in' with the ON, LEAVE_OFFLINE, or
DONE commands, no other program can check out the volume.

Other FSSYNC commands include abilities to query volume metadata and
stats, to force volumes to be attached or offline, and to update the
volume group cache. See doc/arch/fssync.txt for documentation on the
individual FSSYNC commands.

 --- clients

FSSYNC clients are generally any OpenAFS process that runs on a
fileserver and tries to access volumes directly. The volserver,
salvageserver, and bosserver all qualify, as do (sometimes) some
utilities like vol-info or vol-bless. For issuing FSSYNC commands
directly, there is the debugging tool fssync-debug.  FSSYNC client code
is in src/vol/fssync-client.c, but it's not very interesting.

Any program that wishes to directly access a volume on disk must check
out the volume via FSSYNC (NEEDVOLUME or OFF commands), to ensure the
volume doesn't change while the program is using it. If the program
determines that the volume is somehow inconsistent and should be
salvaged, it should send the FSSYNC command FORCE_ERROR with reason code
FSYNC_SALVAGE to the fileserver, which will take care of salvaging it.

 -- SALVSYNC

The SALVSYNC server runs in the salvageserver; code is in
src/vol/salvsync-server.c. SALVSYNC clients are just the fileserver, the
salvageserver run with the -client switch, and the salvageserver worker
children. If any other process notices that a volume needs salvaging, it
should issue a FORCE_ERROR FSSYNC command to the fileserver with the
FSYNC_SALVAGE reason code.

The SALVSYNC protocol is simpler than the FSSYNC protocol. The commands
are basically just to create, cancel, change, and query salvages. The
RAISEPRIO command increases the priority of a salvage job that hasn't
started yet, so volumes that are accessed more frequently will get
salvaged first. The LINK command is used by the salvageserver worker
children to inform the salvageserver parent that it tried to salvage a
readonly volume for which a read-write clone exists (in which case we
should just schedule a salvage for the parent read-write volume).

Note that canceling a salvage is just for salvages that haven't run
yet; it only takes a salvage job off of a queue; it doesn't stop a
salvageserver worker child in the middle of a salvage.

 - The volume package

 -- refcounts

Before DAFS, the Volume struct just had one reference count, vp->nUsers.
With DAFS, we know have the notion of an internal/lightweight reference
count, and an external/heavyweight reference count. Lightweight refs are
acquired with VCreateReservation_r, and released with
VCancelReservation_r. Heavyweight refs are acquired as before, normally
with a GetVolume or AttachVolume variant, and releasing the ref with
VPutVolume.

Lightweight references are only acquired within the volume package; a vp
should not be given to e.g. the fileserver code with an extra
lightweight ref. A heavyweight ref is generally acquired for a vp that
will be given to some non-volume-package code; acquiring a heavyweight
ref guarantees that the volume header has been loaded.

Acquiring a lightweight ref just guarantees that the volume will not go
away or suddenly become unavailable after dropping VOL_LOCK. Certain
operations like detachment or scheduling a salvage only occur when all
of the heavy and lightweight refs go away; see VCancelReservation_r.

 -- state machine

Instead of having a per-volume lock, each vp always has an associated
'state', that says what, if anything, is occurring to a volume at any
particular time; or if the volume is attached, offline, etc. To do the
basic equivalent of a lock -- that is, ensure that nobody else will
change the volume when we drop VOL_LOCK -- you can put the volume in
what is called an 'exclusive' state (see VIsExclusiveState).

When a volume is in an exclusive state, no thread should modify the
volume (or expect the vp data to stay the same), except the thread that
put it in that state. Whenever you manipulate a volume, you should make
sure it is not in an exclusive state; first call VCreateReservation_r to
make sure the volume doesn't go away, and then call
VWaitExclusiveState_r. When that returns, you are guaranteed to have a
vp that is in a non-exclusive state, and so can me manipulated. Call
VCancelReservation_r when done with it, to indicate you don't need it
anymore.

Look at the definition of the VolState enumeration to see all volume
states, and a brief explanation of them.

 -- VLRU

See: Most functions with VLRU in their name in src/vol/volume.c.

The VLRU is what dictates when volumes are detached after a certain
amount of inactivity. The design is pretty much a generational garbage
collection mechanism. There are 5 queues that a volume can be on the
VLRU (VLRUQueueName in volume.h). 'Candidate' volumes haven't seen
activity in a while, and so are candidates to be detached. 'New' volumes
have seen activity only recently; 'mid' volumes have seen activity for
awhile, and 'old' volumes have seen activity for a long while. 'Held'
volumes cannot be soft detached at all.

Volumes are moved from new->mid->old if they have had activity recently,
and are moved from old->mid->new->candidate if they have not had any
activity recently. The definition of 'recently' is configurable by the
-vlruthresh fileserver parameter; see VLRU_ComputeConstants for how they
are determined. Volumes start at 'new' on attachment, and if any
activity occurs when a volume is on 'candidate', it's moved to 'new'
immediately.

Volumes are generally promoted/demoted and soft-detached by
VLRU_ScannerThread, which runs every so often and moves volumes between
VLRU queues depending on their last access time and the various
thresholds (or soft-detaches them, in the case of the 'candidate'
queue). Soft-detaching just means the volume is taken offline and put
into the preattached state.

 --- DONT_SALVAGE

The dontSalvage flag in volume headers can be set to DONT_SALVAGE to
indicate that a volume probably doesn't need to be salvaged. Before
DAFS, volumes were placed on an 'UpdateList' which was periodically
scanned, and dontSalvage was set on volumes that hadn't been touched in
a while.

With DAFS and the VLRU additions, setting dontSalvage now happens when a
volume is demoted a VLRU generation, and no separate list is kept. So if
a volume has been idle enough to demote, and it hasn't been accessed in
SALVAGE_INTERVAL time, dontSalvage will be set automatically by the VLRU
scanner.

 -- Vnode

Source files: src/vol/vnode.c, src/vol/vnode.h, src/vol/vnode_inline.h

The changes to the vnode package are largely very similar to those in
the volume package. A Vnode is put into specific states, some of which
are exclusive and act like locks (see VnChangeState_r,
VnIsExclusiveState). Vnodes also have refcounts, incremented and
decremented with VnCreateReservation_r and VnCancelReservation_r like
you would expect. I/O should be done outside of any global locks; just
the vnode is 'locked' by being put in an exclusive state if necessary.

In addition to a state, vnodes also have a count of readers. When a
caller gets a vnode with a read lock, we of course must wait for the
vnode to be in a nonexclusive state (VnWaitExclusive_r), then the number
of readers is incremented (VnBeginRead_r), but the vnode is kept in a
non-exclusive state (VN_STATE_READ).

When a caller gets a vnode with a write lock, we must wait not only for
the vnode to be in a nonexclusive state, but also for there to be no
readers (VnWaitQuiescent_r), so we can actually change it.

VnLock still exists in DAFS, but it's almost a no-op. All we do for DAFS
in VnLock is set vnp->writer to the current thread id for a write lock,
for some consistency checks later (read locks are actually no-ops).
Actual mutual exclusion in DAFS is done by the vnode state machine and
the reader count.

 - viced state serialization

See src/viced/serialize_state.* and ShutDownAndCore in
src/viced/viced.c

Before DAFS, whenever a fileserver restarted, it lost all information
about all clients, what callbacks they had, etc. So when a client with
existing callbacks contacted the fileserver, all callback information
needed to be reset, potentially causing a bunch of unnecessary traffic.
And of course, if the client does not contact the fileserver again, it
could not get sent callbacks it should get sent.

DAFS now has the ability to save the host and CB data to a file on
shutdown, and restore it when it starts up again. So when a fileserver
is restarted, the host and CB information should be effectively the same
as when it shut down. So a client may not even know if a fileserver was
restarted.

Getting this state information can be a little difficult, since the host
package data structures aren't necessarily always consistent, even after
H_LOCK is dropped. What we attempt to do is stop all of the background
threads early in the shutdown process (set fs_state.mode -
FS_MODE_SHUTDOWN), and wait for the background threads to exit (or be
marked as 'tranquil'; see the fs_state struct) later on, before trying
to save state. This makes it a lot less likely for anything to be
modifying the host or CB structures by the time we try to save them.

 - volume group cache

See: src/vol/vg_cache* and src/vol/vg_scan.c

The VGC is a mechanism in DAFS to speed up volume salvages. Pre-VGC,
whenever the salvager code salvaged an individual volume, it would need
to read all of the volume headers on the partition, so it knows what
volumes are in the volume group it is salvaging, so it knows what
volumes to tell the fileserver to take offline. With demand-salvages,
this can make salvaging take a very long time, since the time to read in
all volume headers can take much more time than the time to actually
salvage a single volume group.

To prevent the need to scan the partition volume headers every single
time, the fileserver maintains a cache of which volumes are in what
volume groups. The cache is populated by scanning a partition's volume
headers, and is started in the background upon receiving the first
salvage request for a partition (VVGCache_scanStart_r,
_VVGC_scan_start).

After the VGC is populated, it is kept up to date with volumes being
created and deleted via the FSSYNC VG_ADD and VG_DEL
commands. These are called every time a volume header is created,
removed, or changed when using the volume header wrappers in vutil.c
(VCreateVolumeDiskHeader, VDestroyVolumeDiskHeader,
VWriteVolumeDiskHeader). These wrappers should always be used to
create/remove/modify vol headers, to ensure that the necessary FSSYNC
commands are called.

 -- race prevention

In order to prevent races between volume changes and VGC partition scans
(that is, someone scans a header while it is being written and not yet
valid), updates to the VGC involving adding or modifying volume headers
should always be done under the 'partition header lock'. This is a
per-partition lock to conceptually lock the set of volume headers on
that partition. It is only read-held when something is writing to a
volume header, and it is write-held for something that is scanning the
partition for volume headers (the VGC or partition salvager). This is a
little counterintuitive, but it is what we want.  We want multiple
headers to be written to at once, but if we are the VGC scanner, we want
to ensure nobody else is writing when we look at a header file.

Because the race described above is so rare, vol header scanners don't
actually hold the lock unless a problem is detected. So, what they do is
read a particular volume header without any lock, and if there is a
problem with it, they grab a write lock on the partition vol headers,
and try again. If it still has a problem, the header is just faulty; if
it's okay, then we avoided the race.

Note that destroying vol headers does not require any locks, since
unlink()s are atomic and don't cause any races for us here.

 - partition and volume locking

Previously, whenever the volserver would attach a volume or the salvager
would salvage anything, the partition would be locked
(VLockPartition_r). This unnecessarily serializes part of most volserver
operations. It also makes it so only one salvage can run on a partition
at a time, and that a volserver operation cannot occur at the same time
as a salvage. With the addition of the VGC (previous section), the
salvager partition lock is unnecessary on namei, since the salvager does
not need to scan all volume headers.

Instead of the rather heavyweight partition lock, in DAFS we now lock
individual volumes. Locking an individual volume is done by locking a
certain byte in the file /vicepX/.volume.lock. To lock volume with ID
1234, you lock 1 byte at offset 1234 (with VLockFile: fcntl on unix,
LockFileEx on windows as of the time of this writing). To read-lock the
volume, acquire a read lock; to write-lock the volume, acquire a write
lock.

Due to the potentially very large number of volumes attached by the
fileserver at once, the fileserver does not keep volumes locked the
entire time they are attached (which would make volume locking
potentially very slow). Rather, it locks the volume before attaching,
and unlocks it when the volume has been attached. However, all other
programs are expected to acquire a volume lock for the entire duration
they interact with the volume. Whether a read or write lock is obtained
is determined by the attachment mode, and whether or not the volume in
question is an RW volume (see VVolLockType()).

These locks are all acquired non-blocking, so we can just fail if we
fail to acquire a lock. That is, an errant process holding a file-level
lock cannot cause any process to just hang, waiting for a lock.

 -- re-reading volume headers

Since we cannot know whether a volume is writable or not until the
volume header is read, and we cannot atomically upgrade file-level
locks, part of attachment can now occur twice (see attach2 and
attach_volume_header). What occurs is we read the vol header, assuming
the volume is readonly (acquiring a read or write lock as necessary).
If, after reading the vol header, we discover that the volume is
writable and that means we need to acquire a write lock, we read the vol
header again while acquiring a write lock on the header.

 -- verifying checkouts

Since the fileserver does not hold volume locks for the entire time a
volume is attached, there could have been a potential race between the
fileserver and other programs. Consider when a non-fileserver program
checks out a volume from the fileserver via FSSYNC, then locks the
volume. Before the program locked the volume, the fileserver could have
restarted and attached the volume. Since the fileserver releases the
volume lock after attachment, the fileserver and the other program could
both think they have control over the volume, which is a problem.

To prevent this non-fileserver programs are expected to verify that
their volume is checked out after locking it (FSYNC_VerifyCheckout).
What this does is ask the fileserver for the current volume operation on
the specific volume, and verifies that it matches how the program
checked out the volume.

For example, programType X checks out volume V from the fileserver, and
then locks it. We then ask the fileserver for the current volume
operation on volume V. If the programType on the vol operation does not
match (or the PID, or the checkout mode, or other things), we know the
fileserver must have restarted or something similar, and we do not have
the volume checked out like we thought we did.

If the program determines that the fileserver may have restarted, it
then must retry checking out and locking the volume (or return an
error).