mirror of
https://git.openafs.org/openafs.git
synced 2025-01-18 15:00:12 +00:00
doc: add kolya's rx-spec to doc/txt
Add rx protocol spec and rx debug spec written by Nickolia Zeldovich. Rx protocol specification draft (2002) Nickolai Zeldovich, kolya@MIT.EDU Change-Id: I65a9a83a8889503f3a82c8fde7a87f84d2736c8d Reviewed-on: https://gerrit.openafs.org/12676 Tested-by: BuildBot <buildbot@rampaginggeek.com> Reviewed-by: Benjamin Kaduk <kaduk@mit.edu>
This commit is contained in:
parent
c6f5ebc4cf
commit
b8e8145fa9
921
doc/txt/rx-spec.txt
Normal file
921
doc/txt/rx-spec.txt
Normal file
@ -0,0 +1,921 @@
|
||||
Rx protocol specification draft
|
||||
Nickolai Zeldovich, kolya@MIT.EDU
|
||||
|
||||
Introduction
|
||||
============
|
||||
|
||||
Rx is a client-server RPC protocol, an extended and combined version
|
||||
of the older R and RFTP protocols. This document describes Rx, but
|
||||
the details of Rx security protocols (such as Rxkad) are not specified.
|
||||
|
||||
Rx communicates via UDP datagrams on a user-specified port. Rx also
|
||||
provides for multiplexing of Rx services on a single port, via a
|
||||
16-bit service ID which identifies a particular Rx service that's
|
||||
listening on a given port akin to a port number. Therefore, an Rx
|
||||
service is identified by a triple of <IP address; UDP port number;
|
||||
Rx service ID>.
|
||||
|
||||
The protocol is connection-oriented -- a client and a server must
|
||||
first hand-shake and establish a connection before Rx calls can be
|
||||
made. Said hand-shaking is implicit upon the first request if no
|
||||
authentication is desired, or can consist of a pair of Challenge
|
||||
and Response requests in order to establish authentication between
|
||||
the client and the server.
|
||||
|
||||
Protocol Overview
|
||||
=================
|
||||
|
||||
As mentioned above, Rx uses UDP/IP datagrams on a user-specified
|
||||
port to communicate. An optional user-selectable authentication
|
||||
and encryption method can be used to achieve desired security.
|
||||
Each Rx server may provide multiple services, specified by the
|
||||
Service ID. This allows for service multiplexing, much in the
|
||||
same way as UDP port numbers allow for multiplexing of UDP
|
||||
datagrams addressed to the same host.
|
||||
|
||||
Each client and server pair that want to communicate using Rx must
|
||||
establish an Rx connection, which can be thought of as a context
|
||||
for all subsequent Rx activity between these two parties. An Rx
|
||||
connection can only be associated with a single Rx service.
|
||||
|
||||
Each Rx connection context contains multiple channels, which are
|
||||
used for data transmission and actually performing an RPC call.
|
||||
The channels are independent of each other, allowing multiple
|
||||
RPC calls to be performed to the same Rx server simultaneously.
|
||||
|
||||
An Rx call involves the transmission of call arguments over an Rx
|
||||
channel to the server and reception of the reply data. For each
|
||||
Rx call, an available Rx channel must be allocated exclusively to
|
||||
that call. The channel cannot be used for anything else until the
|
||||
call completes. After call completion, the channel may be reused
|
||||
for subsequent Rx calls.
|
||||
|
||||
Rx Connections
|
||||
==============
|
||||
|
||||
This section makes many references to fields of an Rx header; see
|
||||
the ``Packet Formats'' section for specific layout of the Rx header.
|
||||
|
||||
The connection epoch is a unique value chosen by Rx on startup and
|
||||
used by the peer to both to identify connections to this host, and
|
||||
to detect when this host's Rx restarts. An Rx connection between
|
||||
two hosts is identified by:
|
||||
|
||||
{ Epoch, Connection ID, Peer IP, Peer Port },
|
||||
if the high bit of the epoch (+) is not set
|
||||
{ Epoch, Connection ID },
|
||||
if the high bit of the epoch (+) is set
|
||||
|
||||
This means that if the high epoch bit is set, the recipient of a
|
||||
packet should accept packets for this Rx connection from any IP
|
||||
address and port number. Conversely, if the high bit is not set,
|
||||
the IP and port number must be the same in order for packets to
|
||||
be properly recognized as being part of the same connection.
|
||||
|
||||
Connection ID is chosen by the client that establishes the connection.
|
||||
The last two bits of the same 32-bit field are used by Rx to multiplex
|
||||
between 4 parallel calls on the same connection. Each one of them is
|
||||
called an Rx channel, and therefore the field is denoted "Channel ID".
|
||||
|
||||
Call number identifies a particular call within a channel (so there
|
||||
are four call numbers associated with an Rx connection). Each new
|
||||
call should start with a higher number than the previous call, and
|
||||
typically this is just the previous call number + 1. The initial
|
||||
call number must be non-zero, since call number zero indicates a
|
||||
connection-only Rx packet (see below). The call number is chosen
|
||||
by the peer initiating the call. Although only one call can use
|
||||
a channel at one time, the call number allows peers to distinguish
|
||||
packets on the same channel that belong to different calls.
|
||||
|
||||
The sequence number is similar to the sequence number in TCP, but
|
||||
instead of bytes they count packets within a call. Sequence numbers
|
||||
always start with 1 at the beginning of each call, and are incremented
|
||||
by 1 for each additional packet sent. Retransmissions in Rx are done
|
||||
on a packet-by-packet basis, identified by these sequence numbers.
|
||||
|
||||
Every outgoing packet associated with a certain connection is stamped
|
||||
with a serial number in the serial field, and the serial number is
|
||||
incremented by 1 for every packet sent. This is used by the flow
|
||||
control mechanisms (described below). The serial number for a
|
||||
connection should start out with 1 (i.e., the first packet sent
|
||||
should have a serial number of 1.)
|
||||
|
||||
Service ID identifies a particular Rx service running on a given
|
||||
host/port combination. This is analogous to how UDP port numbers
|
||||
allow multiplexing packets to a single IP address. Note that once
|
||||
an Rx connection has been created, the service ID may not be changed;
|
||||
existing implementations cache the service ID value for a given
|
||||
connection, and will ignore service ID values in subsequent packets.
|
||||
|
||||
The Checksum field allows for an optional packet checksum. A zero
|
||||
checksum field value means that checksums are not being computed.
|
||||
An Rx security protocol (identified by the security field, described
|
||||
below) may choose to use this field to transport some checksum of
|
||||
the packet that is computed and verified by it (for example, rxkad
|
||||
uses this field for a cryptographic header checksum). Rx itself
|
||||
makes no use of the checksum field.
|
||||
|
||||
The status field allows for additional user flags to be transported
|
||||
with each packet. These have no significance to the protocol itself.
|
||||
These flags are associated with a call rather than an individual
|
||||
packet.
|
||||
|
||||
The security field specifies the type of security in use on this
|
||||
connection. These values don't have a defined mapping in the Rx
|
||||
protocol but rather are mapped to specific Rx security types by
|
||||
the application using Rx.
|
||||
|
||||
An Rx security protocol can use the checksum field as described
|
||||
above, and can also modify the packet payload in any way, for
|
||||
instance by encrypting the contents or adding headers or trailers
|
||||
specific to the security protocol (although the end result must
|
||||
be a properly sized packet that Rx will be able to transmit.)
|
||||
|
||||
The "Flags" field consists of a number of single-bit flags with
|
||||
meanings as follows. The actual bit values are defined below,
|
||||
in the ``Protocol Constants'' section.
|
||||
|
||||
* CLIENT-INITIATED
|
||||
This packet originated from an Rx client (as opposed
|
||||
to server). To avoid packet loops, a server should
|
||||
always clear the CLIENT-INITIATED flag on any packets
|
||||
it sends, and discard incoming packets without the
|
||||
CLIENT-INITIATED flag.
|
||||
|
||||
* REQUEST-ACK
|
||||
Sender is requesting acknowledgement of this packet,
|
||||
via an Ack packet response.
|
||||
|
||||
* LAST-PACKET
|
||||
This packet is the last packet in this call from the
|
||||
sender.
|
||||
|
||||
NOTE: some older Rx implementations, which do not
|
||||
support the trailing packet size fields in Rx Ack
|
||||
packets, use the LAST-PACKET flag for computing the
|
||||
MTU. In particular, when a DATA packet with the
|
||||
REQUEST-ACK flag but without the LAST-PACKET flag
|
||||
is received, the MTU is adjusted down to the size
|
||||
of that packet.
|
||||
|
||||
* MORE-PACKETS
|
||||
More packets are going to be following this one. This
|
||||
flag is set on all but the last packet by the sender
|
||||
transmitting a list of packets at once, for possible
|
||||
optimization at the receiver end.
|
||||
|
||||
* SLOW-START-OK
|
||||
In an ack packet, indicates that the sender of this
|
||||
packet supports the slow-start mechanism, described
|
||||
below under ``Flow Control''.
|
||||
|
||||
* JUMBO-PACKET
|
||||
In a data packet, indicates that this packet is part
|
||||
of a jumbogram, and is not the last one. See the
|
||||
``Jumbograms'' section below for more details.
|
||||
|
||||
Packet Types
|
||||
============
|
||||
|
||||
The "Type" field indicates the contents of this packet. Actual
|
||||
values are specified in the ``Protocol Constants'' section.
|
||||
This section describes the simpler packet types, and subsequent
|
||||
sections cover more complex packet types in more detail.
|
||||
|
||||
Certain type packets are connection-only requests (that is, they
|
||||
are not associated with an RPC call). A connection-only request
|
||||
is indicated by a zero call number. Valid packet types in a
|
||||
connection-only context are Abort, Challenge, Response, Debug,
|
||||
Version, and the parameter exchange packet types. All other
|
||||
packets can only be used in the context of a call. Additionally,
|
||||
Abort can be used both in a connection and call context.
|
||||
|
||||
The payload of the packet following the header depends on the
|
||||
type of the field, as follows:
|
||||
|
||||
* DATA type (Standard data packet)
|
||||
|
||||
The payload of a data packet is simply the Rx payload,
|
||||
corresponding to the sequence number and call specified
|
||||
in the header. The actual data that is transmitted in
|
||||
Rx data packets is described below.
|
||||
|
||||
The receipt of a data packet by a client implicitly
|
||||
acknowledges that the server has received and processed
|
||||
all the packets that have been transmitted to it as
|
||||
part of this call.
|
||||
|
||||
* ACK type (Acknowledgement of received data)
|
||||
|
||||
An acknowledgement packet provides information about
|
||||
which packets were or were not received by the peer,
|
||||
and other useful parameters. The semantics of these
|
||||
packets are described below in the ``Call Layer''
|
||||
section.
|
||||
|
||||
* BUSY type (Busy response)
|
||||
|
||||
When a client tries to start a new call on a channel
|
||||
which the server still considers active, a busy response
|
||||
is returned. The call and channel number in the packet
|
||||
header indicate which call is being rejected. This packet
|
||||
type has no payload associated with it.
|
||||
|
||||
* ABORT type (Abort packet)
|
||||
|
||||
Indicates that the relevant connection or call (if the
|
||||
call number field is non-zero) has encountered an error
|
||||
and has been terminated. The payload of the packet has
|
||||
a network-byte-order 32-bit user error code.
|
||||
|
||||
* ACKALL type (Acknowledgement of all packets)
|
||||
|
||||
An acknowledge-all packet indicates the obvious: the peer
|
||||
wants to acknowledge the receipt of all packets sent to
|
||||
it. This could be used, for example, when a connection
|
||||
is being closed and the client wants to ensure that no
|
||||
retransmissions are attempted after it exits.
|
||||
|
||||
There is no payload associated with an acknowledge-all
|
||||
packet.
|
||||
|
||||
* CHALLENGE, RESPONSE types (Challenge request/response)
|
||||
|
||||
The payload of the packet is security-layer-specific
|
||||
data, and is used to authenticate an Rx connection.
|
||||
|
||||
Perhaps this should include a reference to some spec
|
||||
on rxkad (or rxkad should just be added to this spec.)
|
||||
|
||||
* DEBUG type (Debug packet)
|
||||
|
||||
Rx supports an optional debugging interface; see the
|
||||
``Debugging'' section below for more details.
|
||||
|
||||
* PARAMS types (Parameter exchange)
|
||||
|
||||
These types were assigned in AFS 3.2 but never used for
|
||||
anything, and therefore have no protocol significance
|
||||
at this time.
|
||||
|
||||
* VERSION type (Get AFS version)
|
||||
|
||||
If a server receives a packet with a type value of 13, and
|
||||
the client-initiated flag set, it should respond with a
|
||||
65-byte payload containing a string that identifies the
|
||||
version of AFS software it is running. The response should
|
||||
not have the client-initiated flag set.
|
||||
|
||||
Nothing should respond to a version packet without the
|
||||
client-initiated flag, to avoid infinite packet loops.
|
||||
|
||||
Call Layer
|
||||
==========
|
||||
|
||||
The call layer provides a reliable data transport over an
|
||||
Rx channel, and is used by the RPC layer to make Rx calls.
|
||||
One of the most important pieces of the call layer is the
|
||||
Rx acknowledgement packet. The acknowledgement packet is
|
||||
used by Rx to determine when retransmissions are needed,
|
||||
as well as determining the proper transmission / receiving
|
||||
parameters to use (such as the transmit window size and
|
||||
jumbogram length, described in more detail below).
|
||||
|
||||
A new call is established by the client simply sending a
|
||||
data packet to the server on an available channel. Either
|
||||
side can indicate that they have no more data to send by
|
||||
setting the LAST-PACKET flag in their last Rx packet. The
|
||||
call remains open until the upper layer informs Rx that it
|
||||
is done with the call. (The upper layer in this case would
|
||||
most likely be the Rx RPC layer.)
|
||||
|
||||
The structure of an Rx acknowledgement packet is described
|
||||
in the Packet Formats section. We will refer to particular
|
||||
fields of the acknowledgement packet here by names.
|
||||
|
||||
The <Buffer Space> field specifies the number of packets that
|
||||
the sender of the acknowledgement is willing to provide for
|
||||
receiving packets for this call. The sender, presumably,
|
||||
should not send packets beyond the number specified here,
|
||||
without receiving further acknowledgement allowing it.
|
||||
|
||||
The <Max Skew> field indicates the maximum packet skew that
|
||||
the sender of this packet has seen for this call. If a
|
||||
packet is received N packets later than expected (based
|
||||
on the packet's serial number, i.e. if the last received
|
||||
packet's serial number is N higher than this packet's),
|
||||
then it is defined to have a skew of N. This can be used
|
||||
to avoid retransmission because of packet reordering.
|
||||
|
||||
The <First Sequence> number specifies the sequence number of
|
||||
the first packet that is being explicitly acknowledged (either
|
||||
positively or negatively) by this packet. All packets with
|
||||
sequence numbers smaller than this are implicitly acknowledged.
|
||||
|
||||
The <Reserved> field, previously used to indicate the previous
|
||||
received packet, is no longer used. It should be set to zero
|
||||
by the sender and not interpreted by the receiver.
|
||||
|
||||
The <Serial Number> field indicates the serial number of the
|
||||
packet which has triggered this acknowledgement, or zero if there
|
||||
is no such packet (i.e. the ack packet was delayed and should not
|
||||
be used for round-trip time computation). The receiver should
|
||||
note that any transmitted packets with a serial number less than
|
||||
this, which are not acknowledged by this packet, are likely lost
|
||||
or reordered. Thus, these packets should be retransmitted, after
|
||||
a possible delay to allow for packet reordering (as measured by
|
||||
packet skew).
|
||||
|
||||
The trailing fields after the variable-length acknowledgements
|
||||
section are not always 32-bit aligned with respect to the packet,
|
||||
and aren't always present. (Their presence depends on the Rx
|
||||
version of the peer.) The maximum and recommended packet sizes
|
||||
are, respectively, the largest possible packet size that the peer
|
||||
is willing to accept from us, and the size of the packet they
|
||||
would prefer to receive. In absence of these fields, it should
|
||||
be assumed that the maximum allowed packet size is 1444 bytes.
|
||||
|
||||
The receive window size indicates the size of the ACK sender's
|
||||
receive window, in packets. Its use is described below in
|
||||
the "Flow Control" section. If this field is absent, the
|
||||
implementation must assume a maximum window size of 15 packets;
|
||||
older implementations that do not support this trailing field
|
||||
only allow for a window of 15 packets.
|
||||
|
||||
The "Max Packets per Jumbogram" field indicates how many packets
|
||||
the ACK sender is willing to receive in a jumbogram (also
|
||||
described below). All packets in a jumbogram are always of the
|
||||
same size (except the last one), regardless of the maximum and
|
||||
recommended packet sizes described above.
|
||||
|
||||
The <Reason> field specifies a particular type of an ack packet.
|
||||
Valid reason codes are specified in the ``Packet Formats and
|
||||
Protocol Constants'' section; their meanings are as follows:
|
||||
|
||||
REQUESTED
|
||||
Acknowledgement was requested. The peer received
|
||||
a packet from us with the acknowledgement-requested
|
||||
flag set, and is acknowledging it.
|
||||
|
||||
DUPLICATE
|
||||
A duplicate packet was received. The duplicate
|
||||
packet's serial number is in the <Serial> field.
|
||||
|
||||
OUT-OF-SEQUENCE
|
||||
A packet was received out of sequence. The serial
|
||||
number of said packet is in the <Serial> field.
|
||||
|
||||
WINDOW-EXCEEDED
|
||||
A packet was received but exceeded the current
|
||||
receive window, and was dropped.
|
||||
|
||||
NO-SPACE
|
||||
A packet was received, but no buffer space was
|
||||
available and therefore it was dropped.
|
||||
|
||||
PING
|
||||
This is a keep-alive packet, used to verify that
|
||||
the peer is still alive. If the REQUEST-ACK flag
|
||||
in the Rx packet is set, the recipient of this
|
||||
packet should reply with a PING-RESPONSE packet.
|
||||
|
||||
PING-RESPONSE
|
||||
This is a response to a keep-alive ack (ping).
|
||||
|
||||
DELAYED
|
||||
A delayed acknowledgement, usually because a certain
|
||||
amount of time has passed since the receipt of the
|
||||
last packet and there are outstanding unacknowledged
|
||||
packets. Should not be used for RTT computation.
|
||||
|
||||
OTHER
|
||||
Un-delayed general acknowledgement, which does not
|
||||
fall in any of the above categories.
|
||||
|
||||
A peer should never delay the transmission of an ack packet
|
||||
in response to a received packet unless it sets the delayed
|
||||
ack type field. This is because ack packets (except for
|
||||
delayed ones) are used for RTT computation by Rx.
|
||||
|
||||
All acknowledgement packets should have the REQUEST-ACK
|
||||
flag in the Rx header turned off, except for PING type
|
||||
ack packets.
|
||||
|
||||
The <Ack Count> field specifies the number of bytes following
|
||||
in the acknowledgements section. Each of those bytes indicate
|
||||
the acknowledgement status corresponding to a sequence number
|
||||
between firstSequence and firstSequence+ackCount-1 inclusively.
|
||||
There can be up to 255 bytes in the acknowledgements section.
|
||||
Typically the ack count is the receive window size of the
|
||||
ack packet sender, and the individual packet status bytes
|
||||
correspond to the packets in the current receive window.
|
||||
The values in each of those bytes can be as follows:
|
||||
|
||||
0 Explicit negative acknowledgement: packet with the
|
||||
corresponding sequence number has not been received
|
||||
or has been dropped.
|
||||
1 Explicit acknowledgement: packet with the corresponding
|
||||
sequence number has been received but not processed by
|
||||
the application yet.
|
||||
|
||||
It's important to note the distinction between packets with
|
||||
sequence numbers before firstSequence, between firstSequence
|
||||
and firstSequence+ackCount-1, and those with sequence numbers
|
||||
of at least firstSequence+ackCount. Those in the first category
|
||||
have been passed up to the application level and the sender
|
||||
(recipient of this ack) can recycle packets with such sequence
|
||||
numbers.
|
||||
|
||||
Packets in the second category are individually acknowledged
|
||||
in the acknowledgements section, either as being queued for
|
||||
the application or not received. The recipient of the ack
|
||||
should keep all packets with sequence numbers in this range,
|
||||
but avoid retransmitting the positively acknowledged ones.
|
||||
Negatively acknowledged packets should be retransmitted.
|
||||
A more detailed explaination of the retransmit strategy is
|
||||
given below.
|
||||
|
||||
Packets in the third category are not acknowledged at all,
|
||||
and the recipient of the ack should assume no knowledge
|
||||
of their state. Since the Rx receive window should not
|
||||
exceed the size of an ack packet, the sender shouldn't
|
||||
have transmitted any packets in this category anyway.
|
||||
|
||||
* Round-trip time computation
|
||||
|
||||
To determine when packet retransmission is necessary, Rx
|
||||
computes some statistics about the round-trip time between
|
||||
the two hosts: exponentially-decaying averages of the
|
||||
round-trip time and the standard deviation thereof. Each
|
||||
acknowledgement packet which mentions a specific packet in
|
||||
the <Serial> field and is not delayed is used to update the
|
||||
round-trip statistics. First, the round-trip time for this
|
||||
packet (R) is computed as the difference between the arrival
|
||||
time of the ack packet and the time we transmitted the
|
||||
packet with the serial number specified in <Serial>.
|
||||
|
||||
Next, the round-trip time average and standard deviation
|
||||
values are updated. For instance, this algorithm could
|
||||
be used:
|
||||
|
||||
RTTdev = RTTdev * (3/4) + |RTTavg - R| / 4
|
||||
RTTavg = RTTavg * (7/8) + R / 8
|
||||
|
||||
* Packet retransmission
|
||||
|
||||
In order to support reliable data transport, Rx must retransmit
|
||||
packet which are lost in the network. This must not be done
|
||||
too early, otherwise we might retransmit a packet whose first
|
||||
copy is still in transit, thereby wasting bandwidth.
|
||||
|
||||
Rx computes a retransmit timeout value T, and retransmits any
|
||||
packet which hasn't been positively acknowledged since last
|
||||
transmission for at least T seconds. This timeout could be
|
||||
computed as follows from the round-trip statistics above:
|
||||
|
||||
T = RTTavg + 4 * RTTdev + 0.350
|
||||
|
||||
This allows the packet to be up to 4 deviations late and still
|
||||
not be retransmitted. The 350 msec fudge factor is used to
|
||||
compensate for bursty networks, though it is likely becoming
|
||||
less relevant (and accurate) with time.
|
||||
|
||||
A more clever algorithm could take into account the maximum
|
||||
packet skew rate, and improve the retransmission strategy to
|
||||
take into the account the likelihood that a given packet has
|
||||
been reordered, and give it extra time before retransmission.
|
||||
|
||||
* Keepalive and Timeout
|
||||
|
||||
The upper layer (either the Rx RPC layer or the application)
|
||||
have to specify a timeout, T, to the call layer. If the peer
|
||||
is not heard from within T seconds, the call layer declares
|
||||
the call to be dead and propagates the error to the upper
|
||||
layer.
|
||||
|
||||
In order to determine whether the peer is still alive or not,
|
||||
keepalive requests are used. These take form of an ack PING
|
||||
and PING-RESPONSE packets. When the client has not received
|
||||
any response from the server, either to the original request
|
||||
or the keepalive requests, in T seconds, the call times out.
|
||||
|
||||
The following strategy may be used to determine when to send
|
||||
keepalive requests:
|
||||
|
||||
Compute a keepalive timeout, KT = T/6
|
||||
|
||||
If the call was initiated KT seconds ago, or KT
|
||||
seconds have passed since the last keepalive
|
||||
request transmission, send a keepalive packet.
|
||||
|
||||
This strategy limits the number of transmitted keepalive
|
||||
packets to a fixed number in the case of a dead server,
|
||||
and proportional to the real timeout in case of a slow
|
||||
server. It also allows up to 5 keepalives to be dropped
|
||||
before the server is erroneously declared dead.
|
||||
|
||||
* Flow Control
|
||||
|
||||
Every Rx client or server has associated with each Rx call a
|
||||
receive and transmit window. These windows indicate the number
|
||||
of packets that haven't been fully acknowledged packets (that
|
||||
is, not read by the peer's application) that an Rx sender can
|
||||
have outstanding at any time. A sender's transmit window may
|
||||
never be greater than it's peer's receive window for that call.
|
||||
The receive windows are exchanged via the "Receive Window Size"
|
||||
parameter in an Ack packet.
|
||||
|
||||
Rx ``sliding windows'' are similar to those used by TCP, except
|
||||
they measure packets rather than bytes. Also, in TCP the window
|
||||
effectively applies to bytes in flight between the two peers,
|
||||
whileas in Rx the window applies to packets between the user
|
||||
applications. For example, a transmit window of 8 on a certain
|
||||
Rx connection means that at most 8 packets can be transmitted
|
||||
and not yet read by the peer's application at any time. The
|
||||
sequence number of the first packet that hasn't been read by
|
||||
the application is indicated by the First Sequence field of
|
||||
an Ack packet.
|
||||
|
||||
The selection of initial window sizes isn't strictly defined
|
||||
by the Rx protocol, but here are a few things that one might
|
||||
want to consider when choosing initial windows:
|
||||
|
||||
* A useful strategy can be to advertise a small receive
|
||||
window until the application starts reading data, and
|
||||
advertise a larger window afterwards.
|
||||
|
||||
* The transmit window should be initially a conservative
|
||||
small value. Once an Ack packet is received, the peer's
|
||||
advertised receive window can be used to choose a better
|
||||
transmit window.
|
||||
|
||||
Rx uses the slow start, congestion avoidance, and fast recovery
|
||||
algorithms[6]. The algorithms are modified to work in the context
|
||||
of Rx packet-based transmission windows, and are described below.
|
||||
|
||||
These algorithms require two additional variables to be maintained
|
||||
for each active Rx call: a congestion window, cwind, and a slow
|
||||
start threshold, ssthresh.
|
||||
|
||||
Define a "negative ack" as an Ack packet that contains a negative
|
||||
acknowledgement followed by a positive one. Similarly, define a
|
||||
"positive ack" to be any Ack that is not negative. Upon receiving
|
||||
three negative acks for a call in a row since the last congestion
|
||||
avoidance attempt (if any), the Rx protocol enters congestion
|
||||
avoidance for that Rx call.
|
||||
|
||||
* Slow start, congestion avoidance, and fast recovery algorithms
|
||||
|
||||
First, the congestion window, cwind, is initialized to 1.
|
||||
The number of unread transmitted packets is now limited not
|
||||
only by the transmission window, but also by the congestion
|
||||
window. The latter limit is a little different: Rx may
|
||||
send up to cwind packets (by sequence number) past the last
|
||||
contiguous positively acknowledged packet. For example,
|
||||
if an Ack packet indicates that packets 1, 2 and 8 were
|
||||
received, and cwind is 2, Rx may transmit packets 3 and 4.
|
||||
|
||||
When congestion occurs (indicated by a negative ack or a
|
||||
packet retransmission timeout), Rx enters congestion avoidance
|
||||
and fast recovery. The slow-start threshold, ssthresh, is
|
||||
set to half of the effective transmission window (minimum of
|
||||
cwind and transmit window), but no less than 2 packets.
|
||||
|
||||
If triggered by a negative ack, any negatively acknowledged
|
||||
packets should be retransmitted as soon as possible (i.e.
|
||||
window-permitting).
|
||||
|
||||
If triggered by a retransmission timeout, the congestion
|
||||
window is reset to a single packet.
|
||||
|
||||
When in fast-recovery mode, every additional negative ack
|
||||
packet received causes cwind to be increased by one packet.
|
||||
A positive ack packet causes cwind to be set to ssthresh,
|
||||
and terminates fast recovery. At this point we are back
|
||||
to congestion avoidance, since the cwind is half the original
|
||||
transmission window.
|
||||
|
||||
When packet acknowledgements are received, the congestion
|
||||
window should be increased. If cwind is less than ssthresh,
|
||||
cwind should be increased by 1 for each newly acknowledged
|
||||
packet. If cwind is at least ssthresh, cwind is increased
|
||||
by 1 for each newly received Ack packet.
|
||||
|
||||
The size of the receive window should not grow past the size of
|
||||
an Rx ack packet (which can acknowledge up to 255 packets at a
|
||||
time.)
|
||||
|
||||
Debugging
|
||||
=========
|
||||
|
||||
Rx provides for an optional debugging interface, using the Debug AFS
|
||||
packet type, allowing remote Rx clients to query an Rx server for
|
||||
some Rx protocol statistics. Not all implementations are required
|
||||
to implement this interface. Some parts of this interface may also
|
||||
be specific to a particular implementation of Rx. In order to prevent
|
||||
packet loops, a server should only reply to debug packets with the
|
||||
client-initiated flag set.
|
||||
|
||||
The payload of a debug request packet is always the same; both of
|
||||
the 32-bit quantities are in network byte order:
|
||||
|
||||
0 1 2 3
|
||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Debug Type |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Debug Index |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|
||||
The debug type indicates the kind of debug information being sent
|
||||
or requested, and determines the format of the rest of the packet.
|
||||
The debug index allows some debug types to export array-like data,
|
||||
indexed by this field. The following debug types are defined for
|
||||
the Transarc implementation:
|
||||
|
||||
0x01 Retrieve basic connection statistics
|
||||
0x02 Get information about some connections
|
||||
0x03 Get information about all connections
|
||||
0x04 Get all Rx stats
|
||||
0x05 Get all peers of this server
|
||||
|
||||
The index field in the debug packet indicates which element of the
|
||||
debug information the client wants to access, in cases where there
|
||||
are multiple entries in question.
|
||||
|
||||
The responses to each of those debug queries contain the following
|
||||
information:
|
||||
|
||||
1. Retrieve basic connection stats
|
||||
|
||||
An array of general statistics about packet allocation,
|
||||
server performance, and so on. The first octet in this
|
||||
response represents the debug protocol version being used
|
||||
by the server. See RX_DEBUGI_VERSION* in rx/rx.h.
|
||||
|
||||
2, 3. Get information about connections
|
||||
|
||||
Both of these calls return a struct rx_debugConn (see
|
||||
rx/rx.h), indexed by the "index" field.
|
||||
|
||||
The first version of the debug call (type 2) only retrieves
|
||||
information about connections which are deemed interesting,
|
||||
that is, connections which are active, or about to be
|
||||
reaped.
|
||||
|
||||
The end of the list is signaled by a response where the
|
||||
connection ID value is 0xFFFFFFFF.
|
||||
|
||||
4. Get Rx stats
|
||||
|
||||
This call returns a struct rx_stats to the client in network
|
||||
byte order, containing various statistics about the state of
|
||||
Rx on the server (see rx/rx.h).
|
||||
|
||||
5. Get all Rx peers
|
||||
|
||||
Similar to the connection request above (2, 3) this call
|
||||
returns all the Rx peers of the server (in a network-byte-order
|
||||
struct rx_debugPeer), indexed by the index field in the request.
|
||||
End of list is indicated by a host value of 0xFFFFFFFF. (These
|
||||
are the first 4 octets.)
|
||||
|
||||
In response to unknown requests, the server returns 0xFFFFFFF8 in the
|
||||
debug type field.
|
||||
|
||||
XXX The response interface should probably be fixed
|
||||
to include a fixed header that indicates whether
|
||||
the request was successfully completed.
|
||||
|
||||
Jumbograms
|
||||
==========
|
||||
|
||||
To be able to transmit more data in a single packet, Rx supports
|
||||
``jumbograms'', which are single UDP datagrams containing multiple
|
||||
sequential Rx DATA packets. In a jumbogram, all packets except the
|
||||
last one must be of a fixed maximal size (1412 bytes). Because all
|
||||
the packets in the jumbogram are sequential, only one full header
|
||||
is needed. Here is what a jumbogram could look like:
|
||||
|
||||
+-----------+---------------+--------------+---------------+
|
||||
| Rx header | 1412 byte pkt | Short header | 1412 byte pkt | ->
|
||||
+-----------+---------------+--------------+---------------+
|
||||
|
||||
+--------------+- -+-----------------------+
|
||||
-> | Short header | ... | <= 1412 byte last pkt |
|
||||
+--------------+- -+-----------------------+
|
||||
|
||||
Every Rx packet in a jumbogram except the first one must be preceeded
|
||||
by the short Rx header, and all packets except the last one must have
|
||||
the Jumbogram Rx flag set in their respective headers. The number of
|
||||
packets in a jumbogram may not exceed the peer's advertised Max Packets
|
||||
Per Jumbogram value in the Ack packet.
|
||||
|
||||
The maximum number of packets per jumbogram should be assumed to be 1
|
||||
(i.e., no jumbograms) unless explicitly specified otherwise by an Ack
|
||||
packet. If an Ack packet is received without the packet-per-jumbogram
|
||||
field, it might indicate that the peer is now running a version of Rx
|
||||
that does not support jumbograms, and therefore no jumbograms should
|
||||
be sent until they are explicitly enabled again.
|
||||
|
||||
The short header in a jumbogram has the following makeup:
|
||||
|
||||
0 1
|
||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Flags | Reserved |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Checksum |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|
||||
All the packets in the jumbogram have the same Rx header fields
|
||||
(from the full Rx header) except for Flags, Checksum, Sequence,
|
||||
and Serial. The flags and checksum field for subsequent packets
|
||||
are taken from the short header preceeding that packet in the
|
||||
jumbogram. The sequence and serial numbers are assumed to be
|
||||
consecutive, and are incremented by 1 from the first packet in
|
||||
the jumbogram (ie the full Rx header).
|
||||
|
||||
Retransmitted packets should not be sent in a jumbogram.
|
||||
|
||||
RPC Layer
|
||||
=========
|
||||
|
||||
This section discusses how an RPC call is made using the Rx protocol.
|
||||
There are two common ``types'' of Rx calls: simple and streaming.
|
||||
These mostly reflect a difference in the upper-level API rather than
|
||||
in the Rx protocol. A simple Rx call has a fixed number of input
|
||||
variables and a fixed number of output variables. A streaming Rx
|
||||
call, in addition to the above, allows the user to send and receive
|
||||
arbitrary amounts of data (whose length should be specified as a
|
||||
fixed-length argument.)
|
||||
|
||||
In either case, an Rx call consists of two basic stages: client
|
||||
sending the data to the server, and server sending the response
|
||||
back to the client. No data can be sent by the client in the
|
||||
same call after the server has started sending its response.
|
||||
|
||||
Each remote function call associated with a particular Rx service
|
||||
(identified by the IP-port-serviceId triplet, as mentioned above)
|
||||
is assigned a 32-bit integer opcode number. To make a simple Rx
|
||||
call, the caller must transmit the opcode number followed by the
|
||||
expected arguments for that call over an Rx channel using XDR
|
||||
encoding. The callee uses XDR to unmarshall the opcode and input
|
||||
arguments, performs a function call corresponding to that opcode
|
||||
and arguments, and then uses XDR to encode the return values back
|
||||
to the caller. The caller then uses XDR to receive the output
|
||||
variables.
|
||||
|
||||
For streaming calls which send data from the caller to the callee,
|
||||
the convention is to include the length of the data to be sent as
|
||||
one of the fixed-length arguments, and send the variable-length
|
||||
data immediately after the fixed-length portion. For streaming
|
||||
calls which receive data, the convention is for the callee to first
|
||||
reply with a fixed-length field specifying the number of bytes it's
|
||||
about to send, and then send those bytes. Upon completion of the
|
||||
streaming part of the call, the output arguments are sent back to
|
||||
the caller in fixed-length XDR form, as with simple calls.
|
||||
|
||||
Packet Formats and Protocol Constants
|
||||
=====================================
|
||||
|
||||
* Rx packet
|
||||
|
||||
Every simple Rx packet has an Rx header, of the form below.
|
||||
All quantities are in network byte order.
|
||||
|
||||
0 1 2 3
|
||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|+| Connection Epoch |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Connection ID | * |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Call Number |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Sequence Number |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Serial Number |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Type | Flags | Status | Security |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Checksum | Service ID |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Payload ....
|
||||
+-+-+-+-+-
|
||||
|
||||
[*] The field marked with * is the Channel ID. The last
|
||||
two bits of the connection ID are used to multiplex
|
||||
between 4 parallel calls.
|
||||
|
||||
[+] The bit marked with + is used to indicate that only
|
||||
the connection ID should be used to identify this
|
||||
connection, and sender host/port should not be used.
|
||||
|
||||
The values for the Flags field are defined as follows:
|
||||
|
||||
0000 0001 CLIENT-INITIATED
|
||||
0000 0010 REQUEST-ACK
|
||||
0000 0100 LAST-PACKET
|
||||
0000 1000 MORE-PACKETS
|
||||
0001 0000 - Reserved -
|
||||
0010 0000 SLOW-START-OK
|
||||
0010 0000 JUMBO-PACKET
|
||||
|
||||
Commonly, but not necessarily, the following value mappings
|
||||
for the Security field are used:
|
||||
|
||||
0 No security or encryption
|
||||
1 bcrypt security, only used in AFS 2.0
|
||||
2 "krb4" rxkad
|
||||
3 "krb4" rxkad with encryption (sometimes)
|
||||
|
||||
The following packet type values are defined:
|
||||
|
||||
1 DATA Standard data packet
|
||||
2 ACK Acknowledgement of received data
|
||||
3 BUSY Busy response
|
||||
4 ABORT Abort packet
|
||||
5 ACKALL Acknowledgement of all packets
|
||||
6 CHALLENGE Challenge request
|
||||
7 RESPONSE Challenge response
|
||||
8 DEBUG Debug packet
|
||||
9 PARAMS Exchange of parameters
|
||||
10 PARAMS Exchange of parameters
|
||||
11 PARAMS Exchange of parameters
|
||||
12 PARAMS Exchange of parameters
|
||||
13 VERSION Get AFS version
|
||||
|
||||
* Rx acknowledgement packet
|
||||
|
||||
0 1 2 3
|
||||
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Buffer Space | Max Skew |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| First Sequence |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Reserved |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Serial |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Reason | Ack Count | Acknowledgements ...
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ..
|
||||
|
||||
... -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
... Acks | Reserved | Reserved |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Maximum Packet Size |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Recommended Packet Size |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Receive Window Size |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
| Max Packets per Jumbogram |
|
||||
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|
||||
|
||||
Note that the trailing fields can have arbitrary alignment,
|
||||
determined by the number of individual acks in the packet.
|
||||
There are three reserved octets between the variable acks
|
||||
section and the start of the trailing fields; they also have
|
||||
no particular alignment.
|
||||
|
||||
The valid values for the Reason code are:
|
||||
|
||||
1 REQUESTED
|
||||
2 DUPLICATE
|
||||
3 OUT-OF-SEQUENCE
|
||||
4 WINDOW-EXCEEDED
|
||||
5 NO-SPACE
|
||||
6 PING
|
||||
7 PING-RESPONSE
|
||||
8 DELAYED
|
||||
9 OTHER
|
||||
|
||||
Acknowledgements
|
||||
================
|
||||
|
||||
Jeffrey Hutzelman <jhutz@cmu.edu> reviewed an early draft of this
|
||||
specification, and provided much appreciated feedback on technical
|
||||
details as well as document structuring.
|
||||
|
||||
Love Hornquist-Astrand <lha@stacken.kth.se> made many corrections
|
||||
to this specification, especially regarding backwards-compatibility
|
||||
with older Rx implementations.
|
||||
|
||||
References
|
||||
==========
|
||||
|
||||
[1] /afs/sipb.mit.edu/contrib/doc/AFS/hijacking-afs.ps.gz
|
||||
|
||||
[2] OpenAFS: src/rx/
|
||||
|
||||
[3] /afs/sipb.mit.edu/contrib/doc/AFS/ps/rx-spec.ps
|
||||
|
||||
[4] ftp://ftp.stacken.kth.se/pub/arla/prog-afs/shadow/doc/r.vdoc
|
||||
|
||||
[5] ftp://ftp.stacken.kth.se/pub/arla/prog-afs/shadow/doc/rx.mss
|
||||
|
||||
[6] http://web.mit.edu/rfc/rfc2001.txt
|
||||
|
||||
$Id: rx-spec,v 1.22 2002/10/20 06:46:00 kolya Exp $
|
Loading…
Reference in New Issue
Block a user