mirror of
https://git.openafs.org/openafs.git
synced 2025-01-18 15:00:12 +00:00
b8e8145fa9
Add rx protocol spec and rx debug spec written by Nickolia Zeldovich. Rx protocol specification draft (2002) Nickolai Zeldovich, kolya@MIT.EDU Change-Id: I65a9a83a8889503f3a82c8fde7a87f84d2736c8d Reviewed-on: https://gerrit.openafs.org/12676 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
922 lines
38 KiB
Plaintext
922 lines
38 KiB
Plaintext
Rx protocol specification draft
|
|
Nickolai Zeldovich, kolya@MIT.EDU
|
|
|
|
Introduction
|
|
============
|
|
|
|
Rx is a client-server RPC protocol, an extended and combined version
|
|
of the older R and RFTP protocols. This document describes Rx, but
|
|
the details of Rx security protocols (such as Rxkad) are not specified.
|
|
|
|
Rx communicates via UDP datagrams on a user-specified port. Rx also
|
|
provides for multiplexing of Rx services on a single port, via a
|
|
16-bit service ID which identifies a particular Rx service that's
|
|
listening on a given port akin to a port number. Therefore, an Rx
|
|
service is identified by a triple of <IP address; UDP port number;
|
|
Rx service ID>.
|
|
|
|
The protocol is connection-oriented -- a client and a server must
|
|
first hand-shake and establish a connection before Rx calls can be
|
|
made. Said hand-shaking is implicit upon the first request if no
|
|
authentication is desired, or can consist of a pair of Challenge
|
|
and Response requests in order to establish authentication between
|
|
the client and the server.
|
|
|
|
Protocol Overview
|
|
=================
|
|
|
|
As mentioned above, Rx uses UDP/IP datagrams on a user-specified
|
|
port to communicate. An optional user-selectable authentication
|
|
and encryption method can be used to achieve desired security.
|
|
Each Rx server may provide multiple services, specified by the
|
|
Service ID. This allows for service multiplexing, much in the
|
|
same way as UDP port numbers allow for multiplexing of UDP
|
|
datagrams addressed to the same host.
|
|
|
|
Each client and server pair that want to communicate using Rx must
|
|
establish an Rx connection, which can be thought of as a context
|
|
for all subsequent Rx activity between these two parties. An Rx
|
|
connection can only be associated with a single Rx service.
|
|
|
|
Each Rx connection context contains multiple channels, which are
|
|
used for data transmission and actually performing an RPC call.
|
|
The channels are independent of each other, allowing multiple
|
|
RPC calls to be performed to the same Rx server simultaneously.
|
|
|
|
An Rx call involves the transmission of call arguments over an Rx
|
|
channel to the server and reception of the reply data. For each
|
|
Rx call, an available Rx channel must be allocated exclusively to
|
|
that call. The channel cannot be used for anything else until the
|
|
call completes. After call completion, the channel may be reused
|
|
for subsequent Rx calls.
|
|
|
|
Rx Connections
|
|
==============
|
|
|
|
This section makes many references to fields of an Rx header; see
|
|
the ``Packet Formats'' section for specific layout of the Rx header.
|
|
|
|
The connection epoch is a unique value chosen by Rx on startup and
|
|
used by the peer to both to identify connections to this host, and
|
|
to detect when this host's Rx restarts. An Rx connection between
|
|
two hosts is identified by:
|
|
|
|
{ Epoch, Connection ID, Peer IP, Peer Port },
|
|
if the high bit of the epoch (+) is not set
|
|
{ Epoch, Connection ID },
|
|
if the high bit of the epoch (+) is set
|
|
|
|
This means that if the high epoch bit is set, the recipient of a
|
|
packet should accept packets for this Rx connection from any IP
|
|
address and port number. Conversely, if the high bit is not set,
|
|
the IP and port number must be the same in order for packets to
|
|
be properly recognized as being part of the same connection.
|
|
|
|
Connection ID is chosen by the client that establishes the connection.
|
|
The last two bits of the same 32-bit field are used by Rx to multiplex
|
|
between 4 parallel calls on the same connection. Each one of them is
|
|
called an Rx channel, and therefore the field is denoted "Channel ID".
|
|
|
|
Call number identifies a particular call within a channel (so there
|
|
are four call numbers associated with an Rx connection). Each new
|
|
call should start with a higher number than the previous call, and
|
|
typically this is just the previous call number + 1. The initial
|
|
call number must be non-zero, since call number zero indicates a
|
|
connection-only Rx packet (see below). The call number is chosen
|
|
by the peer initiating the call. Although only one call can use
|
|
a channel at one time, the call number allows peers to distinguish
|
|
packets on the same channel that belong to different calls.
|
|
|
|
The sequence number is similar to the sequence number in TCP, but
|
|
instead of bytes they count packets within a call. Sequence numbers
|
|
always start with 1 at the beginning of each call, and are incremented
|
|
by 1 for each additional packet sent. Retransmissions in Rx are done
|
|
on a packet-by-packet basis, identified by these sequence numbers.
|
|
|
|
Every outgoing packet associated with a certain connection is stamped
|
|
with a serial number in the serial field, and the serial number is
|
|
incremented by 1 for every packet sent. This is used by the flow
|
|
control mechanisms (described below). The serial number for a
|
|
connection should start out with 1 (i.e., the first packet sent
|
|
should have a serial number of 1.)
|
|
|
|
Service ID identifies a particular Rx service running on a given
|
|
host/port combination. This is analogous to how UDP port numbers
|
|
allow multiplexing packets to a single IP address. Note that once
|
|
an Rx connection has been created, the service ID may not be changed;
|
|
existing implementations cache the service ID value for a given
|
|
connection, and will ignore service ID values in subsequent packets.
|
|
|
|
The Checksum field allows for an optional packet checksum. A zero
|
|
checksum field value means that checksums are not being computed.
|
|
An Rx security protocol (identified by the security field, described
|
|
below) may choose to use this field to transport some checksum of
|
|
the packet that is computed and verified by it (for example, rxkad
|
|
uses this field for a cryptographic header checksum). Rx itself
|
|
makes no use of the checksum field.
|
|
|
|
The status field allows for additional user flags to be transported
|
|
with each packet. These have no significance to the protocol itself.
|
|
These flags are associated with a call rather than an individual
|
|
packet.
|
|
|
|
The security field specifies the type of security in use on this
|
|
connection. These values don't have a defined mapping in the Rx
|
|
protocol but rather are mapped to specific Rx security types by
|
|
the application using Rx.
|
|
|
|
An Rx security protocol can use the checksum field as described
|
|
above, and can also modify the packet payload in any way, for
|
|
instance by encrypting the contents or adding headers or trailers
|
|
specific to the security protocol (although the end result must
|
|
be a properly sized packet that Rx will be able to transmit.)
|
|
|
|
The "Flags" field consists of a number of single-bit flags with
|
|
meanings as follows. The actual bit values are defined below,
|
|
in the ``Protocol Constants'' section.
|
|
|
|
* CLIENT-INITIATED
|
|
This packet originated from an Rx client (as opposed
|
|
to server). To avoid packet loops, a server should
|
|
always clear the CLIENT-INITIATED flag on any packets
|
|
it sends, and discard incoming packets without the
|
|
CLIENT-INITIATED flag.
|
|
|
|
* REQUEST-ACK
|
|
Sender is requesting acknowledgement of this packet,
|
|
via an Ack packet response.
|
|
|
|
* LAST-PACKET
|
|
This packet is the last packet in this call from the
|
|
sender.
|
|
|
|
NOTE: some older Rx implementations, which do not
|
|
support the trailing packet size fields in Rx Ack
|
|
packets, use the LAST-PACKET flag for computing the
|
|
MTU. In particular, when a DATA packet with the
|
|
REQUEST-ACK flag but without the LAST-PACKET flag
|
|
is received, the MTU is adjusted down to the size
|
|
of that packet.
|
|
|
|
* MORE-PACKETS
|
|
More packets are going to be following this one. This
|
|
flag is set on all but the last packet by the sender
|
|
transmitting a list of packets at once, for possible
|
|
optimization at the receiver end.
|
|
|
|
* SLOW-START-OK
|
|
In an ack packet, indicates that the sender of this
|
|
packet supports the slow-start mechanism, described
|
|
below under ``Flow Control''.
|
|
|
|
* JUMBO-PACKET
|
|
In a data packet, indicates that this packet is part
|
|
of a jumbogram, and is not the last one. See the
|
|
``Jumbograms'' section below for more details.
|
|
|
|
Packet Types
|
|
============
|
|
|
|
The "Type" field indicates the contents of this packet. Actual
|
|
values are specified in the ``Protocol Constants'' section.
|
|
This section describes the simpler packet types, and subsequent
|
|
sections cover more complex packet types in more detail.
|
|
|
|
Certain type packets are connection-only requests (that is, they
|
|
are not associated with an RPC call). A connection-only request
|
|
is indicated by a zero call number. Valid packet types in a
|
|
connection-only context are Abort, Challenge, Response, Debug,
|
|
Version, and the parameter exchange packet types. All other
|
|
packets can only be used in the context of a call. Additionally,
|
|
Abort can be used both in a connection and call context.
|
|
|
|
The payload of the packet following the header depends on the
|
|
type of the field, as follows:
|
|
|
|
* DATA type (Standard data packet)
|
|
|
|
The payload of a data packet is simply the Rx payload,
|
|
corresponding to the sequence number and call specified
|
|
in the header. The actual data that is transmitted in
|
|
Rx data packets is described below.
|
|
|
|
The receipt of a data packet by a client implicitly
|
|
acknowledges that the server has received and processed
|
|
all the packets that have been transmitted to it as
|
|
part of this call.
|
|
|
|
* ACK type (Acknowledgement of received data)
|
|
|
|
An acknowledgement packet provides information about
|
|
which packets were or were not received by the peer,
|
|
and other useful parameters. The semantics of these
|
|
packets are described below in the ``Call Layer''
|
|
section.
|
|
|
|
* BUSY type (Busy response)
|
|
|
|
When a client tries to start a new call on a channel
|
|
which the server still considers active, a busy response
|
|
is returned. The call and channel number in the packet
|
|
header indicate which call is being rejected. This packet
|
|
type has no payload associated with it.
|
|
|
|
* ABORT type (Abort packet)
|
|
|
|
Indicates that the relevant connection or call (if the
|
|
call number field is non-zero) has encountered an error
|
|
and has been terminated. The payload of the packet has
|
|
a network-byte-order 32-bit user error code.
|
|
|
|
* ACKALL type (Acknowledgement of all packets)
|
|
|
|
An acknowledge-all packet indicates the obvious: the peer
|
|
wants to acknowledge the receipt of all packets sent to
|
|
it. This could be used, for example, when a connection
|
|
is being closed and the client wants to ensure that no
|
|
retransmissions are attempted after it exits.
|
|
|
|
There is no payload associated with an acknowledge-all
|
|
packet.
|
|
|
|
* CHALLENGE, RESPONSE types (Challenge request/response)
|
|
|
|
The payload of the packet is security-layer-specific
|
|
data, and is used to authenticate an Rx connection.
|
|
|
|
Perhaps this should include a reference to some spec
|
|
on rxkad (or rxkad should just be added to this spec.)
|
|
|
|
* DEBUG type (Debug packet)
|
|
|
|
Rx supports an optional debugging interface; see the
|
|
``Debugging'' section below for more details.
|
|
|
|
* PARAMS types (Parameter exchange)
|
|
|
|
These types were assigned in AFS 3.2 but never used for
|
|
anything, and therefore have no protocol significance
|
|
at this time.
|
|
|
|
* VERSION type (Get AFS version)
|
|
|
|
If a server receives a packet with a type value of 13, and
|
|
the client-initiated flag set, it should respond with a
|
|
65-byte payload containing a string that identifies the
|
|
version of AFS software it is running. The response should
|
|
not have the client-initiated flag set.
|
|
|
|
Nothing should respond to a version packet without the
|
|
client-initiated flag, to avoid infinite packet loops.
|
|
|
|
Call Layer
|
|
==========
|
|
|
|
The call layer provides a reliable data transport over an
|
|
Rx channel, and is used by the RPC layer to make Rx calls.
|
|
One of the most important pieces of the call layer is the
|
|
Rx acknowledgement packet. The acknowledgement packet is
|
|
used by Rx to determine when retransmissions are needed,
|
|
as well as determining the proper transmission / receiving
|
|
parameters to use (such as the transmit window size and
|
|
jumbogram length, described in more detail below).
|
|
|
|
A new call is established by the client simply sending a
|
|
data packet to the server on an available channel. Either
|
|
side can indicate that they have no more data to send by
|
|
setting the LAST-PACKET flag in their last Rx packet. The
|
|
call remains open until the upper layer informs Rx that it
|
|
is done with the call. (The upper layer in this case would
|
|
most likely be the Rx RPC layer.)
|
|
|
|
The structure of an Rx acknowledgement packet is described
|
|
in the Packet Formats section. We will refer to particular
|
|
fields of the acknowledgement packet here by names.
|
|
|
|
The <Buffer Space> field specifies the number of packets that
|
|
the sender of the acknowledgement is willing to provide for
|
|
receiving packets for this call. The sender, presumably,
|
|
should not send packets beyond the number specified here,
|
|
without receiving further acknowledgement allowing it.
|
|
|
|
The <Max Skew> field indicates the maximum packet skew that
|
|
the sender of this packet has seen for this call. If a
|
|
packet is received N packets later than expected (based
|
|
on the packet's serial number, i.e. if the last received
|
|
packet's serial number is N higher than this packet's),
|
|
then it is defined to have a skew of N. This can be used
|
|
to avoid retransmission because of packet reordering.
|
|
|
|
The <First Sequence> number specifies the sequence number of
|
|
the first packet that is being explicitly acknowledged (either
|
|
positively or negatively) by this packet. All packets with
|
|
sequence numbers smaller than this are implicitly acknowledged.
|
|
|
|
The <Reserved> field, previously used to indicate the previous
|
|
received packet, is no longer used. It should be set to zero
|
|
by the sender and not interpreted by the receiver.
|
|
|
|
The <Serial Number> field indicates the serial number of the
|
|
packet which has triggered this acknowledgement, or zero if there
|
|
is no such packet (i.e. the ack packet was delayed and should not
|
|
be used for round-trip time computation). The receiver should
|
|
note that any transmitted packets with a serial number less than
|
|
this, which are not acknowledged by this packet, are likely lost
|
|
or reordered. Thus, these packets should be retransmitted, after
|
|
a possible delay to allow for packet reordering (as measured by
|
|
packet skew).
|
|
|
|
The trailing fields after the variable-length acknowledgements
|
|
section are not always 32-bit aligned with respect to the packet,
|
|
and aren't always present. (Their presence depends on the Rx
|
|
version of the peer.) The maximum and recommended packet sizes
|
|
are, respectively, the largest possible packet size that the peer
|
|
is willing to accept from us, and the size of the packet they
|
|
would prefer to receive. In absence of these fields, it should
|
|
be assumed that the maximum allowed packet size is 1444 bytes.
|
|
|
|
The receive window size indicates the size of the ACK sender's
|
|
receive window, in packets. Its use is described below in
|
|
the "Flow Control" section. If this field is absent, the
|
|
implementation must assume a maximum window size of 15 packets;
|
|
older implementations that do not support this trailing field
|
|
only allow for a window of 15 packets.
|
|
|
|
The "Max Packets per Jumbogram" field indicates how many packets
|
|
the ACK sender is willing to receive in a jumbogram (also
|
|
described below). All packets in a jumbogram are always of the
|
|
same size (except the last one), regardless of the maximum and
|
|
recommended packet sizes described above.
|
|
|
|
The <Reason> field specifies a particular type of an ack packet.
|
|
Valid reason codes are specified in the ``Packet Formats and
|
|
Protocol Constants'' section; their meanings are as follows:
|
|
|
|
REQUESTED
|
|
Acknowledgement was requested. The peer received
|
|
a packet from us with the acknowledgement-requested
|
|
flag set, and is acknowledging it.
|
|
|
|
DUPLICATE
|
|
A duplicate packet was received. The duplicate
|
|
packet's serial number is in the <Serial> field.
|
|
|
|
OUT-OF-SEQUENCE
|
|
A packet was received out of sequence. The serial
|
|
number of said packet is in the <Serial> field.
|
|
|
|
WINDOW-EXCEEDED
|
|
A packet was received but exceeded the current
|
|
receive window, and was dropped.
|
|
|
|
NO-SPACE
|
|
A packet was received, but no buffer space was
|
|
available and therefore it was dropped.
|
|
|
|
PING
|
|
This is a keep-alive packet, used to verify that
|
|
the peer is still alive. If the REQUEST-ACK flag
|
|
in the Rx packet is set, the recipient of this
|
|
packet should reply with a PING-RESPONSE packet.
|
|
|
|
PING-RESPONSE
|
|
This is a response to a keep-alive ack (ping).
|
|
|
|
DELAYED
|
|
A delayed acknowledgement, usually because a certain
|
|
amount of time has passed since the receipt of the
|
|
last packet and there are outstanding unacknowledged
|
|
packets. Should not be used for RTT computation.
|
|
|
|
OTHER
|
|
Un-delayed general acknowledgement, which does not
|
|
fall in any of the above categories.
|
|
|
|
A peer should never delay the transmission of an ack packet
|
|
in response to a received packet unless it sets the delayed
|
|
ack type field. This is because ack packets (except for
|
|
delayed ones) are used for RTT computation by Rx.
|
|
|
|
All acknowledgement packets should have the REQUEST-ACK
|
|
flag in the Rx header turned off, except for PING type
|
|
ack packets.
|
|
|
|
The <Ack Count> field specifies the number of bytes following
|
|
in the acknowledgements section. Each of those bytes indicate
|
|
the acknowledgement status corresponding to a sequence number
|
|
between firstSequence and firstSequence+ackCount-1 inclusively.
|
|
There can be up to 255 bytes in the acknowledgements section.
|
|
Typically the ack count is the receive window size of the
|
|
ack packet sender, and the individual packet status bytes
|
|
correspond to the packets in the current receive window.
|
|
The values in each of those bytes can be as follows:
|
|
|
|
0 Explicit negative acknowledgement: packet with the
|
|
corresponding sequence number has not been received
|
|
or has been dropped.
|
|
1 Explicit acknowledgement: packet with the corresponding
|
|
sequence number has been received but not processed by
|
|
the application yet.
|
|
|
|
It's important to note the distinction between packets with
|
|
sequence numbers before firstSequence, between firstSequence
|
|
and firstSequence+ackCount-1, and those with sequence numbers
|
|
of at least firstSequence+ackCount. Those in the first category
|
|
have been passed up to the application level and the sender
|
|
(recipient of this ack) can recycle packets with such sequence
|
|
numbers.
|
|
|
|
Packets in the second category are individually acknowledged
|
|
in the acknowledgements section, either as being queued for
|
|
the application or not received. The recipient of the ack
|
|
should keep all packets with sequence numbers in this range,
|
|
but avoid retransmitting the positively acknowledged ones.
|
|
Negatively acknowledged packets should be retransmitted.
|
|
A more detailed explaination of the retransmit strategy is
|
|
given below.
|
|
|
|
Packets in the third category are not acknowledged at all,
|
|
and the recipient of the ack should assume no knowledge
|
|
of their state. Since the Rx receive window should not
|
|
exceed the size of an ack packet, the sender shouldn't
|
|
have transmitted any packets in this category anyway.
|
|
|
|
* Round-trip time computation
|
|
|
|
To determine when packet retransmission is necessary, Rx
|
|
computes some statistics about the round-trip time between
|
|
the two hosts: exponentially-decaying averages of the
|
|
round-trip time and the standard deviation thereof. Each
|
|
acknowledgement packet which mentions a specific packet in
|
|
the <Serial> field and is not delayed is used to update the
|
|
round-trip statistics. First, the round-trip time for this
|
|
packet (R) is computed as the difference between the arrival
|
|
time of the ack packet and the time we transmitted the
|
|
packet with the serial number specified in <Serial>.
|
|
|
|
Next, the round-trip time average and standard deviation
|
|
values are updated. For instance, this algorithm could
|
|
be used:
|
|
|
|
RTTdev = RTTdev * (3/4) + |RTTavg - R| / 4
|
|
RTTavg = RTTavg * (7/8) + R / 8
|
|
|
|
* Packet retransmission
|
|
|
|
In order to support reliable data transport, Rx must retransmit
|
|
packet which are lost in the network. This must not be done
|
|
too early, otherwise we might retransmit a packet whose first
|
|
copy is still in transit, thereby wasting bandwidth.
|
|
|
|
Rx computes a retransmit timeout value T, and retransmits any
|
|
packet which hasn't been positively acknowledged since last
|
|
transmission for at least T seconds. This timeout could be
|
|
computed as follows from the round-trip statistics above:
|
|
|
|
T = RTTavg + 4 * RTTdev + 0.350
|
|
|
|
This allows the packet to be up to 4 deviations late and still
|
|
not be retransmitted. The 350 msec fudge factor is used to
|
|
compensate for bursty networks, though it is likely becoming
|
|
less relevant (and accurate) with time.
|
|
|
|
A more clever algorithm could take into account the maximum
|
|
packet skew rate, and improve the retransmission strategy to
|
|
take into the account the likelihood that a given packet has
|
|
been reordered, and give it extra time before retransmission.
|
|
|
|
* Keepalive and Timeout
|
|
|
|
The upper layer (either the Rx RPC layer or the application)
|
|
have to specify a timeout, T, to the call layer. If the peer
|
|
is not heard from within T seconds, the call layer declares
|
|
the call to be dead and propagates the error to the upper
|
|
layer.
|
|
|
|
In order to determine whether the peer is still alive or not,
|
|
keepalive requests are used. These take form of an ack PING
|
|
and PING-RESPONSE packets. When the client has not received
|
|
any response from the server, either to the original request
|
|
or the keepalive requests, in T seconds, the call times out.
|
|
|
|
The following strategy may be used to determine when to send
|
|
keepalive requests:
|
|
|
|
Compute a keepalive timeout, KT = T/6
|
|
|
|
If the call was initiated KT seconds ago, or KT
|
|
seconds have passed since the last keepalive
|
|
request transmission, send a keepalive packet.
|
|
|
|
This strategy limits the number of transmitted keepalive
|
|
packets to a fixed number in the case of a dead server,
|
|
and proportional to the real timeout in case of a slow
|
|
server. It also allows up to 5 keepalives to be dropped
|
|
before the server is erroneously declared dead.
|
|
|
|
* Flow Control
|
|
|
|
Every Rx client or server has associated with each Rx call a
|
|
receive and transmit window. These windows indicate the number
|
|
of packets that haven't been fully acknowledged packets (that
|
|
is, not read by the peer's application) that an Rx sender can
|
|
have outstanding at any time. A sender's transmit window may
|
|
never be greater than it's peer's receive window for that call.
|
|
The receive windows are exchanged via the "Receive Window Size"
|
|
parameter in an Ack packet.
|
|
|
|
Rx ``sliding windows'' are similar to those used by TCP, except
|
|
they measure packets rather than bytes. Also, in TCP the window
|
|
effectively applies to bytes in flight between the two peers,
|
|
whileas in Rx the window applies to packets between the user
|
|
applications. For example, a transmit window of 8 on a certain
|
|
Rx connection means that at most 8 packets can be transmitted
|
|
and not yet read by the peer's application at any time. The
|
|
sequence number of the first packet that hasn't been read by
|
|
the application is indicated by the First Sequence field of
|
|
an Ack packet.
|
|
|
|
The selection of initial window sizes isn't strictly defined
|
|
by the Rx protocol, but here are a few things that one might
|
|
want to consider when choosing initial windows:
|
|
|
|
* A useful strategy can be to advertise a small receive
|
|
window until the application starts reading data, and
|
|
advertise a larger window afterwards.
|
|
|
|
* The transmit window should be initially a conservative
|
|
small value. Once an Ack packet is received, the peer's
|
|
advertised receive window can be used to choose a better
|
|
transmit window.
|
|
|
|
Rx uses the slow start, congestion avoidance, and fast recovery
|
|
algorithms[6]. The algorithms are modified to work in the context
|
|
of Rx packet-based transmission windows, and are described below.
|
|
|
|
These algorithms require two additional variables to be maintained
|
|
for each active Rx call: a congestion window, cwind, and a slow
|
|
start threshold, ssthresh.
|
|
|
|
Define a "negative ack" as an Ack packet that contains a negative
|
|
acknowledgement followed by a positive one. Similarly, define a
|
|
"positive ack" to be any Ack that is not negative. Upon receiving
|
|
three negative acks for a call in a row since the last congestion
|
|
avoidance attempt (if any), the Rx protocol enters congestion
|
|
avoidance for that Rx call.
|
|
|
|
* Slow start, congestion avoidance, and fast recovery algorithms
|
|
|
|
First, the congestion window, cwind, is initialized to 1.
|
|
The number of unread transmitted packets is now limited not
|
|
only by the transmission window, but also by the congestion
|
|
window. The latter limit is a little different: Rx may
|
|
send up to cwind packets (by sequence number) past the last
|
|
contiguous positively acknowledged packet. For example,
|
|
if an Ack packet indicates that packets 1, 2 and 8 were
|
|
received, and cwind is 2, Rx may transmit packets 3 and 4.
|
|
|
|
When congestion occurs (indicated by a negative ack or a
|
|
packet retransmission timeout), Rx enters congestion avoidance
|
|
and fast recovery. The slow-start threshold, ssthresh, is
|
|
set to half of the effective transmission window (minimum of
|
|
cwind and transmit window), but no less than 2 packets.
|
|
|
|
If triggered by a negative ack, any negatively acknowledged
|
|
packets should be retransmitted as soon as possible (i.e.
|
|
window-permitting).
|
|
|
|
If triggered by a retransmission timeout, the congestion
|
|
window is reset to a single packet.
|
|
|
|
When in fast-recovery mode, every additional negative ack
|
|
packet received causes cwind to be increased by one packet.
|
|
A positive ack packet causes cwind to be set to ssthresh,
|
|
and terminates fast recovery. At this point we are back
|
|
to congestion avoidance, since the cwind is half the original
|
|
transmission window.
|
|
|
|
When packet acknowledgements are received, the congestion
|
|
window should be increased. If cwind is less than ssthresh,
|
|
cwind should be increased by 1 for each newly acknowledged
|
|
packet. If cwind is at least ssthresh, cwind is increased
|
|
by 1 for each newly received Ack packet.
|
|
|
|
The size of the receive window should not grow past the size of
|
|
an Rx ack packet (which can acknowledge up to 255 packets at a
|
|
time.)
|
|
|
|
Debugging
|
|
=========
|
|
|
|
Rx provides for an optional debugging interface, using the Debug AFS
|
|
packet type, allowing remote Rx clients to query an Rx server for
|
|
some Rx protocol statistics. Not all implementations are required
|
|
to implement this interface. Some parts of this interface may also
|
|
be specific to a particular implementation of Rx. In order to prevent
|
|
packet loops, a server should only reply to debug packets with the
|
|
client-initiated flag set.
|
|
|
|
The payload of a debug request packet is always the same; both of
|
|
the 32-bit quantities are in network byte order:
|
|
|
|
0 1 2 3
|
|
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Debug Type |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Debug Index |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
|
The debug type indicates the kind of debug information being sent
|
|
or requested, and determines the format of the rest of the packet.
|
|
The debug index allows some debug types to export array-like data,
|
|
indexed by this field. The following debug types are defined for
|
|
the Transarc implementation:
|
|
|
|
0x01 Retrieve basic connection statistics
|
|
0x02 Get information about some connections
|
|
0x03 Get information about all connections
|
|
0x04 Get all Rx stats
|
|
0x05 Get all peers of this server
|
|
|
|
The index field in the debug packet indicates which element of the
|
|
debug information the client wants to access, in cases where there
|
|
are multiple entries in question.
|
|
|
|
The responses to each of those debug queries contain the following
|
|
information:
|
|
|
|
1. Retrieve basic connection stats
|
|
|
|
An array of general statistics about packet allocation,
|
|
server performance, and so on. The first octet in this
|
|
response represents the debug protocol version being used
|
|
by the server. See RX_DEBUGI_VERSION* in rx/rx.h.
|
|
|
|
2, 3. Get information about connections
|
|
|
|
Both of these calls return a struct rx_debugConn (see
|
|
rx/rx.h), indexed by the "index" field.
|
|
|
|
The first version of the debug call (type 2) only retrieves
|
|
information about connections which are deemed interesting,
|
|
that is, connections which are active, or about to be
|
|
reaped.
|
|
|
|
The end of the list is signaled by a response where the
|
|
connection ID value is 0xFFFFFFFF.
|
|
|
|
4. Get Rx stats
|
|
|
|
This call returns a struct rx_stats to the client in network
|
|
byte order, containing various statistics about the state of
|
|
Rx on the server (see rx/rx.h).
|
|
|
|
5. Get all Rx peers
|
|
|
|
Similar to the connection request above (2, 3) this call
|
|
returns all the Rx peers of the server (in a network-byte-order
|
|
struct rx_debugPeer), indexed by the index field in the request.
|
|
End of list is indicated by a host value of 0xFFFFFFFF. (These
|
|
are the first 4 octets.)
|
|
|
|
In response to unknown requests, the server returns 0xFFFFFFF8 in the
|
|
debug type field.
|
|
|
|
XXX The response interface should probably be fixed
|
|
to include a fixed header that indicates whether
|
|
the request was successfully completed.
|
|
|
|
Jumbograms
|
|
==========
|
|
|
|
To be able to transmit more data in a single packet, Rx supports
|
|
``jumbograms'', which are single UDP datagrams containing multiple
|
|
sequential Rx DATA packets. In a jumbogram, all packets except the
|
|
last one must be of a fixed maximal size (1412 bytes). Because all
|
|
the packets in the jumbogram are sequential, only one full header
|
|
is needed. Here is what a jumbogram could look like:
|
|
|
|
+-----------+---------------+--------------+---------------+
|
|
| Rx header | 1412 byte pkt | Short header | 1412 byte pkt | ->
|
|
+-----------+---------------+--------------+---------------+
|
|
|
|
+--------------+- -+-----------------------+
|
|
-> | Short header | ... | <= 1412 byte last pkt |
|
|
+--------------+- -+-----------------------+
|
|
|
|
Every Rx packet in a jumbogram except the first one must be preceeded
|
|
by the short Rx header, and all packets except the last one must have
|
|
the Jumbogram Rx flag set in their respective headers. The number of
|
|
packets in a jumbogram may not exceed the peer's advertised Max Packets
|
|
Per Jumbogram value in the Ack packet.
|
|
|
|
The maximum number of packets per jumbogram should be assumed to be 1
|
|
(i.e., no jumbograms) unless explicitly specified otherwise by an Ack
|
|
packet. If an Ack packet is received without the packet-per-jumbogram
|
|
field, it might indicate that the peer is now running a version of Rx
|
|
that does not support jumbograms, and therefore no jumbograms should
|
|
be sent until they are explicitly enabled again.
|
|
|
|
The short header in a jumbogram has the following makeup:
|
|
|
|
0 1
|
|
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Flags | Reserved |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Checksum |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
|
All the packets in the jumbogram have the same Rx header fields
|
|
(from the full Rx header) except for Flags, Checksum, Sequence,
|
|
and Serial. The flags and checksum field for subsequent packets
|
|
are taken from the short header preceeding that packet in the
|
|
jumbogram. The sequence and serial numbers are assumed to be
|
|
consecutive, and are incremented by 1 from the first packet in
|
|
the jumbogram (ie the full Rx header).
|
|
|
|
Retransmitted packets should not be sent in a jumbogram.
|
|
|
|
RPC Layer
|
|
=========
|
|
|
|
This section discusses how an RPC call is made using the Rx protocol.
|
|
There are two common ``types'' of Rx calls: simple and streaming.
|
|
These mostly reflect a difference in the upper-level API rather than
|
|
in the Rx protocol. A simple Rx call has a fixed number of input
|
|
variables and a fixed number of output variables. A streaming Rx
|
|
call, in addition to the above, allows the user to send and receive
|
|
arbitrary amounts of data (whose length should be specified as a
|
|
fixed-length argument.)
|
|
|
|
In either case, an Rx call consists of two basic stages: client
|
|
sending the data to the server, and server sending the response
|
|
back to the client. No data can be sent by the client in the
|
|
same call after the server has started sending its response.
|
|
|
|
Each remote function call associated with a particular Rx service
|
|
(identified by the IP-port-serviceId triplet, as mentioned above)
|
|
is assigned a 32-bit integer opcode number. To make a simple Rx
|
|
call, the caller must transmit the opcode number followed by the
|
|
expected arguments for that call over an Rx channel using XDR
|
|
encoding. The callee uses XDR to unmarshall the opcode and input
|
|
arguments, performs a function call corresponding to that opcode
|
|
and arguments, and then uses XDR to encode the return values back
|
|
to the caller. The caller then uses XDR to receive the output
|
|
variables.
|
|
|
|
For streaming calls which send data from the caller to the callee,
|
|
the convention is to include the length of the data to be sent as
|
|
one of the fixed-length arguments, and send the variable-length
|
|
data immediately after the fixed-length portion. For streaming
|
|
calls which receive data, the convention is for the callee to first
|
|
reply with a fixed-length field specifying the number of bytes it's
|
|
about to send, and then send those bytes. Upon completion of the
|
|
streaming part of the call, the output arguments are sent back to
|
|
the caller in fixed-length XDR form, as with simple calls.
|
|
|
|
Packet Formats and Protocol Constants
|
|
=====================================
|
|
|
|
* Rx packet
|
|
|
|
Every simple Rx packet has an Rx header, of the form below.
|
|
All quantities are in network byte order.
|
|
|
|
0 1 2 3
|
|
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|+| Connection Epoch |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Connection ID | * |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Call Number |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Sequence Number |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Serial Number |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Type | Flags | Status | Security |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Checksum | Service ID |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Payload ....
|
|
+-+-+-+-+-
|
|
|
|
[*] The field marked with * is the Channel ID. The last
|
|
two bits of the connection ID are used to multiplex
|
|
between 4 parallel calls.
|
|
|
|
[+] The bit marked with + is used to indicate that only
|
|
the connection ID should be used to identify this
|
|
connection, and sender host/port should not be used.
|
|
|
|
The values for the Flags field are defined as follows:
|
|
|
|
0000 0001 CLIENT-INITIATED
|
|
0000 0010 REQUEST-ACK
|
|
0000 0100 LAST-PACKET
|
|
0000 1000 MORE-PACKETS
|
|
0001 0000 - Reserved -
|
|
0010 0000 SLOW-START-OK
|
|
0010 0000 JUMBO-PACKET
|
|
|
|
Commonly, but not necessarily, the following value mappings
|
|
for the Security field are used:
|
|
|
|
0 No security or encryption
|
|
1 bcrypt security, only used in AFS 2.0
|
|
2 "krb4" rxkad
|
|
3 "krb4" rxkad with encryption (sometimes)
|
|
|
|
The following packet type values are defined:
|
|
|
|
1 DATA Standard data packet
|
|
2 ACK Acknowledgement of received data
|
|
3 BUSY Busy response
|
|
4 ABORT Abort packet
|
|
5 ACKALL Acknowledgement of all packets
|
|
6 CHALLENGE Challenge request
|
|
7 RESPONSE Challenge response
|
|
8 DEBUG Debug packet
|
|
9 PARAMS Exchange of parameters
|
|
10 PARAMS Exchange of parameters
|
|
11 PARAMS Exchange of parameters
|
|
12 PARAMS Exchange of parameters
|
|
13 VERSION Get AFS version
|
|
|
|
* Rx acknowledgement packet
|
|
|
|
0 1 2 3
|
|
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Buffer Space | Max Skew |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| First Sequence |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Reserved |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Serial |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Reason | Ack Count | Acknowledgements ...
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ..
|
|
|
|
... -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
... Acks | Reserved | Reserved |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Maximum Packet Size |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Recommended Packet Size |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Receive Window Size |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
| Max Packets per Jumbogram |
|
|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
|
|
|
Note that the trailing fields can have arbitrary alignment,
|
|
determined by the number of individual acks in the packet.
|
|
There are three reserved octets between the variable acks
|
|
section and the start of the trailing fields; they also have
|
|
no particular alignment.
|
|
|
|
The valid values for the Reason code are:
|
|
|
|
1 REQUESTED
|
|
2 DUPLICATE
|
|
3 OUT-OF-SEQUENCE
|
|
4 WINDOW-EXCEEDED
|
|
5 NO-SPACE
|
|
6 PING
|
|
7 PING-RESPONSE
|
|
8 DELAYED
|
|
9 OTHER
|
|
|
|
Acknowledgements
|
|
================
|
|
|
|
Jeffrey Hutzelman <jhutz@cmu.edu> reviewed an early draft of this
|
|
specification, and provided much appreciated feedback on technical
|
|
details as well as document structuring.
|
|
|
|
Love Hornquist-Astrand <lha@stacken.kth.se> made many corrections
|
|
to this specification, especially regarding backwards-compatibility
|
|
with older Rx implementations.
|
|
|
|
References
|
|
==========
|
|
|
|
[1] /afs/sipb.mit.edu/contrib/doc/AFS/hijacking-afs.ps.gz
|
|
|
|
[2] OpenAFS: src/rx/
|
|
|
|
[3] /afs/sipb.mit.edu/contrib/doc/AFS/ps/rx-spec.ps
|
|
|
|
[4] ftp://ftp.stacken.kth.se/pub/arla/prog-afs/shadow/doc/r.vdoc
|
|
|
|
[5] ftp://ftp.stacken.kth.se/pub/arla/prog-afs/shadow/doc/rx.mss
|
|
|
|
[6] http://web.mit.edu/rfc/rfc2001.txt
|
|
|
|
$Id: rx-spec,v 1.22 2002/10/20 06:46:00 kolya Exp $
|