doc/draft-ietf-ppsp-grishchenko-swift.nroff

   1 .\" \# TD4  -- Set TOC depth by altering this value (TD5 = depth 5)\r
   2 .\" \# TOC\r
   3 .\" Auto generated Nroff by NroffEdit on April 12, 2010\r
   4 .pl 10.0i\r
   5 .po 0\r
   6 .ll 7.2i\r
   7 .lt 7.2i\r
   8 .nr LL 7.2i\r
   9 .nr LT 7.2i\r
  10 .ds LF Grishchenko\r
  11 .ds RF FORMFEED[Page %]\r
  12 .ds LH Internet-Draft\r
  13 .ds RH April 2010\r
  14 .ds CH swift\r
  15 .ds CF Expires October 12, 2010\r
  16 .hy 0\r
  17 .nh\r
  18 .ad l\r
  19 .in 0\r
  20 .nf\r
  21 .tl 'PPSP WG' 'V. Grishchenko'\r
  22 .tl 'Internet-Draft' 'TU Delft'\r
  23 .tl 'Intended status: Experimental''April 12, 2010'\r
  24 .tl 'Expires: October 12, 2010'''\r
  25 \r
  26 \r
  27 .fi\r
  28 .in 3\r
  29 .in 12\r
  30 .ti 8\r
  31 The Generic Multiparty Transport Protocol (swift) \%<draft-ietf-ppsp-grishchenko-swift-00.txt>\r
  32 \r
  33 .ti 0\r
  34 Abstract\r
  35 \r
  36 .ti 3\r
  37 The swift is a generic multiparty (swarming) transport protocol.\r
  38 \r
  39 .fi\r
  40 .in 3\r
  41 The TCP, today's dominating transport protocol, is connection/ conversation-oriented. But traffic-wise, the currently dominating usecase is content dissemination. There is a multitude of incompatible approaches to resolve that discrepancy above/below the transport layer: peer-to-peer, CDN, caches, mirrors, multicast, etc.\r
  42 The swift aims at creating a single unified content-centric transport protocol serving as a lingua-franca of content distribution.\r
  43 To implement that ultimate data cloud model, the protocol has to unify use cases of data download, video-on-demand and live streaming. It must work in the settings of client-server, peer-to-peer, CDN or \%peer-assisted networks, effectively blending those architectures.\r
  44 \r
  45 \r
  46 .ti 0\r
  47 Status of this memo\r
  48 \r
  49 .fi\r
  50 This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79.\r
  51 \r
  52 Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups.  Note that other groups may also distribute working documents as Internet- Drafts.\r
  53 \r
  54 Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time.  It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."\r
  55 \r
  56 The list of current Internet-Drafts can be accessed at \%http://www.ietf.org/ietf/1id-abstracts.txt.\r
  57 \r
  58 The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html.\r
  59 \r
  60 \r
  61 .nf\r
  62 Copyright (c) 2010 IETF Trust and the persons identified as the\r
  63 document authors.  All rights reserved.\r
  64 This document is subject to BCP 78 and the IETF Trust's Legal\r
  65 Provisions Relating to IETF Documents\r
  66 \%(http://trustee.ietf.org/license-info) in effect on the date of\r
  67 publication of this document.  Please review these documents\r
  68 carefully, as they describe your rights and restrictions with respect\r
  69 to this document.  Code Components extracted from this document must\r
  70 include Simplified BSD License text as described in Section 4.e of\r
  71 the Trust Legal Provisions and are provided without warranty as\r
  72 described in the Simplified BSD License.\r
  73 \r
  74 \r
  75 .ti 0\r
  76 Table of Contents\r
  77 \r
  78 1.  Requirements notation\r
  79 2.  Introduction\r
  80 3.  Design goals\r
  81 4.  swift subsystems and design choices\r
  82   4.1.  The atomic datagram principle\r
  83   4.2.  Handshake and multiplexing\r
  84   4.3.  Generic acknowledgments\r
  85   4.4.  Data integrity and on-demand Merkle hashes\r
  86   4.5.  Peer exchange and NAT hole punching\r
  87   4.6.  Data requests (HINTs)\r
  88   4.7.  Subsetting of the protocol\r
  89   4.8.  Directory lists\r
  90 5. Enveloping\r
  91   5.1.  IP\r
  92   5.2.  UDP\r
  93   5.3.  TCP\r
  94 6. Security Considerations\r
  95 7. Extensibility\r
  96 References\r
  97 Author's address\r
  98 \r
  99 \r
 100 .ti 0\r
 101 1.  Requirements notation\r
 102 \r
 103 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",\r
 104 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in\r
 105 this document are to be interpreted as described in [RFC2119].\r
 106 \r
 107 \r
 108 .ti 0\r
 109 2.  Introduction\r
 110 \r
 111 Historically, the Internet was based on end-to-end unicast\r
 112 and, considering the failure of multicast, was addressed by\r
 113 different technologies, which ultimately boiled down to maintaining\r
 114 and coordinating distributed replicas. On one hand, downloading\r
 115 from a nearby well-provisioned replica is somewhat faster and/or\r
 116 cheaper; on the other hand, it requires to coordinate multiple\r
 117 parties (the data source, mirrors/CDN sites/peers, consumers). As\r
 118 the Internet progresses to richer and richer content, the overhead\r
 119 of peer/replica coordination becomes dwarfed by the mass of the\r
 120 download itself. Thus, the niche for multiparty transfers expands.\r
 121 Still, current, relevant technologies are tightly coupled to a\r
 122 single usecase or even infrastructure of a particular corporation.\r
 123 The mission of the project is to create a generic content-centric\r
 124 multiparty transport protocol to allow seamless, effortless data\r
 125 dissemination on the Net.\r
 126 \r
 127       | mirror-based   peer-assisted        peer-to-peer\r
 128 ------+----------------------------------------------------\r
 129 data  | SunSITE        CacheLogic VelociX   BitTorrent\r
 130 VoD   | YouTube        Azureus(+seedboxes)  SwarmPlayer\r
 131 live  | Akamai Str.    Octoshape, Joost     PPlive\r
 132                     TABLE 1. Usecases.\r
 133 \r
 134 .fi\r
 135 The protocol must be designed for maximum genericity, thus focusing on the very core of the mission, contain no magic constants and no hardwired policies. Effectively, it is a set of messages allowing to securely retrieve data from whatever source available, in parallel. The protocol must be able to run over IP as an independent transport protocol. For compatibility reasons, it must also run over UDP and TCP.\r
 136 \r
 137 \r
 138 .ti 0\r
 139 3.  Design goals\r
 140 \r
 141 .fi\r
 142 The technical focus of the swift protocol is to find the simplest solution involving the minimum set of primitives, still being sufficient to implement all the targeted usecases (see Table 1), suitable for use in general-purpose software and hardware (i.e. a web browser or a set-top box). The five design goals for the protocol are:\r
 143 \r
 144 .nf\r
 145 1. Embeddable kernel-ready protocol.\r
 146 2. Embrace real-time streaming, in- and out-of-order download.\r
 147 3. Have short warm-up times.\r
 148 4. Traverse NATs transparently.\r
 149 5. Be extensible, allow for multitude of implementation over\r
 150    diverse mediums, allow for drop-in pluggability.\r
 151 \r
 152 Later in the draft, the objectives are referenced as (1)-(5).\r
 153 \r
 154 .fi\r
 155 The goal of embedding (1) means that the protocol must be ready to function as a regular transport protocol inside a set-top box, mobile device, a browser and/or in the kernel space. Thus, the protocol must have light footprint, preferably less than TCP, in spite the necessity to support numerous ongoing connections as well as to constantly probe the network for new possibilities. The practical overhead for TCP is estimated at 10KB per connection [HTTP1MLN]. We aim at <1KB per peer connected. Also, the amount of code necessary to make a basic implementation must be limited to 10KLoC of C. Otherwise, besides the resource considerations, maintaining and auditing the code might become prohibitively expensive.\r
 156 \r
 157 The support for all three basic usecases of real-time streaming, \%in-order download and out-of-order download (2) is necessary for the manifested goal of THE multiparty transport protocol as no single usecase dominates over the others.\r
 158 \r
 159 The objective of short warm-up times (3) is the matter of end-user experience; the playback must start as soon as possible. Thus any unnecessary initialization roundtrips and warm-up cycles must be eliminated from the transport layer.\r
 160 \r
 161 .fi\r
 162 Transparent NAT traversal (4) is absolutely necessary as at least 60% of today's users are hidden behind NATs. NATs severely affect connection patterns in P2P networks thus impacting performance and fairness [MOLNAT,LUCNAT].\r
 163 \r
 164 The protocol must define a common message set (5) to be used by implementations; it must not hardwire any magic constants, algorithms or schemes beyound that. For example, an implementation is free to use its own congestion control, connection rotation or reciprocity algorithms. Still, the protocol must enable such algorithms by supplying sufficient information. For example, trackerless peer discovery needs peer exchange messages, scavenger congestion control may need timestamped acknowledgments, etc.\r
 165 \r
 166 \r
 167 .ti 0\r
 168 4.  swift subsystems and design choices\r
 169 \r
 170 .fi\r
 171 To large extent, swift design is defined by the cornerstone decision\r
 172 to get rid of TCP and not to reinvent any TCP-like transports on\r
 173 top of UDP or otherwise. The requirements (1), (4), (5) make TCP a\r
 174 bad choice due to its high per-connection footprint, complex and\r
 175 less reliable NAT traversal and fixed predefined congestion control\r
 176 algorithms. Besides that, an important consideration is that no\r
 177 block of TCP functionality turns out to be useful for the general\r
 178 case of swarming downloads. Namely,\r
 179 .nf\r
 180   1. in-order delivery is less useful as peer-to-peer protocols\r
 181   often employ out-of-order delivery themselves and in either case\r
 182   \%out-of-order data can still be stored;\r
 183   2. reliable delivery/retransmissions are less useful because\r
 184   the same data might be requested from different sources; as\r
 185   in-order delivery is not required, packet losses might be\r
 186   patched up lazily, without stopping the flow of data;\r
 187   3. flow control is not necessary as the receiver is much less\r
 188   likely to be saturated with the data and even if so, that\r
 189   situation is perfectly detected by the congestion control;\r
 190   4. TCP congestion control is less useful as custom congestion\r
 191   control is often needed [LEDBAT].\r
 192 In general, TCP is built and optimized for a different usecase than\r
 193 we have with swarmed downloads. The abstraction of a "data pipe"\r
 194 orderly delivering some stream of bytes from one peer to another\r
 195 turned out to be irrelevant. In even more general terms, TCP\r
 196 supports the abstraction of pairwise _conversations_, while we need\r
 197 a content-centric protocol built around the abstraction of a cloud\r
 198 of participants disseminating the same _data_ in any way and order\r
 199 that is convenient to them.\r
 200 \r
 201 .fi\r
 202 Thus, the choice is to design a protocol that runs on top of unreliable datagrams. Instead of reimplementing TCP, we create a \%datagram-based protocol, completely dropping the sequential data stream abstraction. Removing unnecessary features of TCP makes it easier both to implement the protocol and to verify it; numerous TCP vulnerabilities were caused by complexity of the protocol's state machine. Still, we reserve the possibility to run swift on top of TCP or HTTP. The draft itself assumes swift-over-UDP implementation; the necessary adjustments to run the protocol over IP or TCP are listed in Sec. 5.\r
 203 \r
 204 Pursuing the maxim of making things as simple as possible but not simpler, we fit the protocol into the constraints of the transport layer by dropping all the transmission's technical metadata except for the content's root hash (compare that to metadata files used in BitTorrent). Elimination of technical metadata is achieved through the use of Merkle [MERKLE,ABMRKL] hash trees, exclusively single-file transfers and other techniques. As a result, a transfer is identified and bootstrapped by its root hash only.\r
 205 \r
 206 .fi\r
 207 To avoid the usual layering of positive/negative acknowledgment mechanisms we introduce a scale-invariant acknowledgment system (see Sec 4.4). The system allows for aggregation and variable level of detail in requesting, announcing and acknowledging data, serves \%in-order and out-of-order retrieval with equal ease.\r
 208 Besides the protocol's footprint, we also aim at lowering the size of a minimal useful interaction. Once a single datagram is received, it must be checked for data integrity, and then either dropped or accepted, consumed and relayed.\r
 209 \r
 210 .ti 0\r
 211 4.1.  The atomic datagram principle\r
 212 \r
 213 .fi\r
 214 Ideally, every datagram sent must be independent of other datagrams, so each datagram SHOULD be processed separately and a loss of one datagram MUST NOT disrupt the flow. Thus, a datagram carries zero or more messages, and neither messages nor message interdependencies should span over multiple datagrams. In particular, any data piece is verified using uncle hash chains; all hashes necessary for verifying data integrity are put into the same datagram as the data (Sec. 4.3). As a general rule, if some additional data is still missing to process a message within a datagram, the message SHOULD be dropped.\r
 215 \r
 216 .fi\r
 217 Each datagram starts with four bytes corresponding to the receiving channel number (Sec. 4.2). The rest of a datagram is a concatenation of messages. Each message within a datagram has fixed length, depending on the type of the message. The first byte of a message denotes its type. Integers are serialized in the network \%(big-endian) byte order. Variable-length messages, free-form text or JSON/bencoded objects are not allowed.\r
 218 Consider an example of an acknowledgment message (Sec 4.4). It has message type of 2 and a payload of a four-byte integer (say, 1); it might be written in hex as: "02 00000001". Later in the document, a \%hex-like two char per byte notation is used to represent message formats.\r
 219 \r
 220 In case a datagram has a piece of data, a sender MUST always put the data message (type id 1) in the tail of a datagram. Such a message consists of type id, bin number (see Sec. 4.3) and the actual data. Normally there is 1 kilobyte of data, except the case when file size is not a multiple of 1024 bytes, so the tail packet is somewhat shorter. Example:\r
 221 .nf\r
 222 01 00000000 48656c6c6f20776f726c6421\r
 223 (This message accommodates an entire file: "Hello world!")\r
 224 \r
 225 \r
 226 .ti 0\r
 227 4.2.  Handshake and multiplexing\r
 228 \r
 229 .fi\r
 230 For the sake of simplicity, one transfer always deals with one file only. Retrieval of large collections of files is done by retrieving a directory list file and then recursively retrieving files, which might also turn to be directory lists (see Sec. 4.9). To distinguish different transfers between the same pair of peers, the protocol introduces an additional layer of multiplexing, the channels. "Channels" loosely correspond to TCP connections; "content" of a single "channel" is a single file. A channel is established with a handshake. To start a handshake, the initiating peer needs to know:\r
 231 .nf\r
 232 (1) the IP address of a peer\r
 233 (2) peer's UDP port and\r
 234 (3) the root hash of the content (see Sec. 4.5).\r
 235 .fi\r
 236 The handshake is made by a HANDSHAKE message, whose only payload is a channel number. HANDSHAKE message type is 0. The initiating handshake must be followed by the transfer's root hash.\r
 237 \r
 238 The initiator sends first datagram to its peer:\r
 239 .nf\r
 240    00000000  04 7FFFFFFF 1234123412341234123412341234123412341234\r
 241    00 00000011\r
 242 (to unknown channel, handshake from channel 0x11, initiating a\r
 243 transfer of a file with a root hash 123...1234)\r
 244 \r
 245 Peer's response datagram:\r
 246    00000011  00 00000022  03 00000003\r
 247 (peer to the initiator: use channel number 0x22 for this transfer;\r
 248 I also have first 4 kilobytes of the file, see Sec. 4.3)\r
 249 \r
 250 .fi\r
 251 At this point, the initiator knows that the peer really responds; for that purpose channel ids MUST be random enough to prevent easy guessing. So, the third datagram of a handshake MAY already contain some heavy payload. To minimize the number of initialization roundtrips, the first two datagrams MAY also contain some minor payload, e.g. a couple of HAVE messages roughly indicating the current progress of a peer or a HINT (see Sec. 4.7).\r
 252 .nf\r
 253    00000022\r
 254 (this is a simple zero-payload keepalive datagram consisting of\r
 255 a 4-byte channel id only. At this point both peers have the\r
 256 proof they really talk to each other; three-way handshake is\r
 257 complete)\r
 258 \r
 259 .fi\r
 260 In general, no error codes or responses are used in the protocol; absence of any response indicates an error. Invalid messages are discarded. Explicit closing of a channel may be achieved by setting channel number to zero by a handshake message: 00 00000000.\r
 261 \r
 262 Simple NAT hole punching [SNP] introduces the scenario when both parties of the handshake are initiators. To avoid creation of two transfers in the case both initiating datagrams get through, both peers must then act as responding peers. Thus, once an initiating datagram is sent and another initiating "counter"-datagram is received, the initiating peer sends a response datagram with the same channel id as in the outstanding initiating datagram.\r
 263 \r
 264 \r
 265 .ti 0\r
 266 4.3.  Generic acknowledgments\r
 267 \r
 268 .nf\r
 269 Generic acknowledgments came out of the need to simplify the\r
 270 data addressing/requesting/acknowledging mechanics, which tends\r
 271 to become overly complex and multilayered with the conventional\r
 272 approach. Take BitTorrent+TCP tandem for example:\r
 273 \r
 274 1. The basic data unit is of course a byte of content in a file.\r
 275 2. BitTorrent's highest-level unit is a "torrent", physically a\r
 276 byte range resulting from concatenation of content files.\r
 277 3. A torrent is divided into "pieces", typically about a thousand\r
 278 of them. Pieces are used to communicate own progress to other\r
 279 peers. Pieces are also basic data integrity units, as the torrent's\r
 280 metadata includes SHA1 hash for every piece.\r
 281 4. The actual data transfers are requested and made in 16KByte\r
 282 units, named "blocks" or chunks.\r
 283 5. Still, one layer lower, TCP also operates with bytes and byte\r
 284 offsets which are totally different from the torrent's bytes and\r
 285 offsets, as TCP considers cumulative byte offsets for all content\r
 286 sent by a connection, be it data, metadata or commands.\r
 287 6. Finally, another layer lower, IP transfers independent datagrams\r
 288 (typically around a kilobyte), which TCP then reassembles into\r
 289 continuous streams.\r
 290 \r
 291 Obviously, such addressing schemes need lots of mappings; from\r
 292 piece number and block to file(s) and offset(s) to TCP sequence\r
 293 numbers to the actual packets and the other way around. Lots of\r
 294 complexity is introduced by mismatch of bounds: packet bounds are\r
 295 different from file, block or hash/piece bounds. The picture is\r
 296 typical for a codebase which was historically layered.\r
 297 \r
 298 To simplify this aspect, we employ a generic content addressing\r
 299 scheme based on binary intervals (shortcutted "bins"). The base\r
 300 interval is 1KB "packet", the top interval is the complete 2**63\r
 301 range.  Till Sec. 4.4.1, any file is considered to be 2**k bytes long.\r
 302 The binary tree of intervals is simple, well-understood, correlates\r
 303 well with machine representation of integers and the structure of\r
 304 Merkle hashes (Sec. 4.4). A novel addition to the classical scheme\r
 305 are "bin numbers", a scheme of numbering binary intervals which\r
 306 lays them out into a vector nicely. Bin numbering is done in the\r
 307 order of interval's "center", ascending, namely:\r
 308 \r
 309            7\r
 310      3          11\r
 311   1     5     9    13\r
 312 0  2  4  6   8 10 12 14\r
 313 \r
 314 .fi\r
 315 The number 0xFFFFFFFF (32-bit) or 0xFFFFFFFFFFFFFFFF (64-bit) stands for an empty interval; 0x7FFF...FFF stands for "everything". In general, this numbering system allows to work with simpler data structures, e.g. to use arrays instead of binary trees in many cases. As a minor convenience, it also allows to use one integer instead of two to denote an interval. By requiring that every message uses bin numbers, we enforce genericity.\r
 316 \r
 317 Back to the acknowledgment message. A HAVE message (type 3) states that the sending peer obtained the specified bin and successfully checked its integrity:\r
 318 .nf\r
 319 02 00000003\r
 320 (got/checked first four kilobytes of a file/stream)\r
 321 \r
 322 The data is acknowledged in terms of bins; as a result, every\r
 323 single packet is acknowledged logarithmic number of times. That\r
 324 provides some necessary redundancy of acknowledgments and\r
 325 sufficiently compensates unreliability of datagrams. Compare that\r
 326 e.g. to TCP acknowledgments, which are (linearly) cumulative.\r
 327 For keeping the state information, an implementation MAY use the\r
 328 "binmap" data structure, which is a hybrid of a bitmap and a binary\r
 329 tree, discussed in detail in [BINMAP].\r
 330 An ACK message (type 2) acknowledges data that was received from\r
 331 its addressee; to facilitate delay-based congestion control, an\r
 332 ACK message contains a timestamp:\r
 333 \r
 334 02 00000002 12345678\r
 335 (got the second kilobyte of the file from you; my microsecond\r
 336 timer was showing 0x12345678 at that moment)\r
 337 \r
 338 \r
 339 .ti 0\r
 340 4.4.  Data integrity and on-demand Merkle hashes\r
 341 \r
 342 .fi\r
 343 The integrity checking scheme is unified for two usecases of download and streaming. Also, it works down to the level of a single datagram by employing Merkle hash trees [MERKLE]. Peers receive chains of uncle hashes just in time to check the incoming data. As metadata is restricted to just a single root hash, newcomer peers derive the size of a file from hashes. That functionality heavily depends on the concept of peak hashes, discussed in Sec. 4.4.1. Any specifics related to the cases of file download and streaming is discussed in Sec. 4.4.2 and 4.4.3, respectively.\r
 344 \r
 345 Here, we discuss the common part of the workflow. As a general\r
 346 rule, the sender SHOULD prepend data with hashes which are\r
 347 necessary for verifying that data, no more, no less. While some\r
 348 optimistic optimizations are definitely possible, the receiver\r
 349 SHOULD drop data if it is impossible to verify it. Before sending a\r
 350 packet of data to the receiver, the sender inspects the receiver's\r
 351 previous acknowledgments to derive which hashes the receiver\r
 352 already has for sure.\r
 353 Suppose, the receiver had acknowledged bin 1 (first two kilobytes\r
 354 of the file), then it must already have uncle hashes 5, 11 and so\r
 355 on. That is because those hashes are necessary to check packets of\r
 356 bin 1 against the root hash. Then, hashes 3, 7 and so on must be\r
 357 also known as they are calculated in the process of checking the\r
 358 uncle hash chain. Hence, to send bin 12 (i.e. the 7th kilobyte of\r
 359 data), the sender needs to prepend hashes for bins 14 and 9, which\r
 360 let the data be checked against hash 11 which is already known to\r
 361 the receiver.\r
 362 The sender MUST put into the datagram the chain of uncle hashes\r
 363 necessary for verification of the packet, always before the data\r
 364 message itself, i.e.:\r
 365 \r
 366 .nf\r
 367 04 00000009 F01234567890ABCDEF1234567890ABCDEF123456\r
 368 04 0000000E 01234567890ABCDEF1234567890ABCDEF1234567\r
 369 (uncle hashes for the packet 12)\r
 370 01 0000000C DA1ADA1ADA1A...\r
 371 (packet 12 itself)\r
 372 \r
 373 .fi\r
 374 The sender MAY optimistically skip hashes which were sent out in previous (still unacknowledged) datagrams.\r
 375 It is an optimization tradeoff between redundant hash transmission and possibility of collateral data loss in the case some necessary hashes were lost in the network so some delivered data cannot be verified and thus has to be dropped.\r
 376 In either case, the receiver builds the Merkle tree on-demand, incrementally, starting from the root hash, and uses it for data validation.\r
 377 \r
 378 \r
 379 .ti 0\r
 380 4.4.1. Peak hashes\r
 381 \r
 382 .fi\r
 383 The concept of peak hashes enables two cornerstone features of swift:\r
 384 download/streaming unification and file size proving. Formally,\r
 385 peak hashes are hashes defined over filled bins, whose parent\r
 386 hashes are defined over incomplete (not filled) bins. Filled bin is\r
 387 a bin which does not extend past the end of the file, or, more\r
 388 precisely, contains no empty packets. Practically, we use peaks\r
 389 to cover the data range with logarithmic number of hashes,\r
 390 so each hash is defined over a "round" aligned 2^k interval.\r
 391 As an example, suppose a file is 7162 bytes long. That fits into\r
 392 7 packets, the tail packet being 1018 bytes long. The binary\r
 393 representation for 7 is 111. Here we might note that in general,\r
 394 every "1" in binary representation of the file's packet length\r
 395 corresponds to a peak hash. Namely, for this particular file we'll\r
 396 have three peaks, bin numbers 3, 9, 12.\r
 397 Thus, once a newcomer joins a swarm, the first peer who sends him\r
 398 data prepends it with peak hashes. The newcomer checks them against\r
 399 the root hash (see Sec 4.4.2).\r
 400 \r
 401 .nf\r
 402 04 00000003 1234567890ABCDEF1234567890ABCDEF12345678\r
 403 04 00000009 234567890ABCDEF1234567890ABCDEF123456789\r
 404 04 0000000C 34567890ABCDEF1234567890ABCDEF1234567890\r
 405 (this sequence of peak hashes proves that a file is 7KB long)\r
 406 \r
 407 \r
 408 .ti 0\r
 409 4.4.2. Hash trees for files\r
 410 \r
 411 the entire data range (2**63 bytes). Every hash in the tree is\r
 412 defined in the usual way, as a SHA1 hash of a concatenation of two\r
 413 \%lower-level SHA1 hashes, which correspond to left and right data\r
 414 \%half-ranges respectively. For example,\r
 415              hash_1 = SHA1 (hash_0+hash_2)\r
 416 where + stands for concatenation and hash_i stands for Merkle hash\r
 417 of the bin number i. Obviously, that does not hold for the\r
 418 \%base-layer hashes. Those are normal SHA1 hashes over 1KB data\r
 419 ranges ("packets"), except probably for the tail packet, which\r
 420 might have less than 1KB of data. The normal recursive formula does\r
 421 not apply to empty bins, i.e. bins that have no data absolutely;\r
 422 their hashes are just zeros.\r
 423 \r
 424 Lemma. Peak hashes could be checked against the root hash.\r
 425 Proof. (a) Any peak hash is always the left sibling. Otherwise, be\r
 426 it the right sibling, its left neighbor/sibling must also be\r
 427 defined over a filled bin, so their parent is also defined over a\r
 428 filled bin, contradiction. (b) For the rightmost peak hash, its\r
 429 right sibling is zero. (c) For any peak hash, its right sibling\r
 430 might be calculated using peak hashes to the left and zeros for\r
 431 empty bins. (d) Once the right sibling of the leftmost peak hash\r
 432 is calculated, its parent might be calculated. (e) Once that parent\r
 433 is calculated, we might trivially get to the root hash by\r
 434 concatenating the hash with zeros and hashing it repeatedly.\r
 435 \r
 436 .fi\r
 437 Informally, the Lemma might be expressed as follows: peak hashes cover all data, so the remaining hashes are either trivial (zeros) or might be calculated from peak hashes and zero hashes.\r
 438 \r
 439 .nf\r
 440 Thus, once a peer gets peak hashes and checks them against the\r
 441 root hash, it learns the file size and it also gets practical\r
 442 anchors for building uncle chains during the transmission (as the\r
 443 root hash is too high in the sky). A newcomer peer MAY signal it\r
 444 already has peak hashes by acknowledging any bin, even the empty one:\r
 445 \r
 446 03 FFFFFFFF\r
 447 \r
 448 .fi\r
 449 Otherwise, the first of the senders SHOULD bootstrap him with all the peak hashes.\r
 450 \r
 451 \r
 452 .ti 0\r
 453 4.4.3. Hash trees for streams\r
 454 \r
 455 .fi\r
 456 In the case of live streaming a transfer is bootstrapped with a\r
 457 public key instead of a root hash, as the root hash is undefined\r
 458 or, more precisely, transient, as long as new data keeps coming.\r
 459 Streaming/download unification is achieved by sending signed peak\r
 460 hashes on-demand, ahead of the actual data. Similarly to the\r
 461 previous case, the sender mightuse acknowledgements to derive which\r
 462 data range the receiver has peak hashes for and to prepend the data\r
 463 hashes with the necessary (signed) peak hashes.\r
 464 Except for the fact that the set of peak hashes changes with the\r
 465 time, other parts of the algorithm work as described in 4.4.2. As we\r
 466 see, in both cases data length is not known on advance, but derived\r
 467 \%on-the-go from the peak hashes. Suppose, our 7KB stream extended to\r
 468 another kilobyte. Thus, now hash 7 becomes the only peak hash,\r
 469 eating hashes 3, 9 and 12. So, the source sends out a signed peak hash\r
 470 message (type 7) to announce the fact:\r
 471 \r
 472 .nf\r
 473 07 00000007 1234567890ABCDEF1234567890ABCDEF12345678 SOME-SIGN-HERE\r
 474 \r
 475 \r
 476 .ti 0\r
 477 4.5.  Peer exchange and NAT hole punching\r
 478 \r
 479 Peer exchange messages are common for many peer-to-peer protocols. By exchanging peer IP addresses in gossip fashion, peers relieve central coordinating entities (the trackers) from unnecessary work. Following the example of BitTorrent, swift features two types of PEX messages: "peer connected" (type 5) and "peer disconnected" (type 6). Peers are represented as IPv4 address-port pairs:\r
 480 .nf\r
 481 05 7F000000 1F40\r
 482 (connected to 127.0.0.1:8000)\r
 483 \r
 484 .fi\r
 485 To unify peer exchange and NAT hole punching functionality, the\r
 486 sending pattern of PEX messages is restricted. As swift handshake\r
 487 is able to do simple NAT hole punching [SNP] transparently, PEX\r
 488 messages must be emitted in the way to facilitate that. Namely,\r
 489 once peer A introduces peer B to peer C by sending a PEX message to\r
 490 C, it SHOULD also send a message to B introducing C. The messages\r
 491 SHOULD be within 2 seconds from each other, but MAY and better not be\r
 492 simultaneous, leaving a gap of twice the "typical" RTT, i.e.\r
 493 \%300-600ms. The peers are supposed to initiate handshakes to each\r
 494 other thus forming a simple NAT hole punching pattern where the\r
 495 introducing peer effectively acts as a STUN server. Still, peers\r
 496 MAY ignore PEX messages if uninterested in obtaining new peers or\r
 497 because of security considerations (rate limiting) or any other\r
 498 reason.\r
 499 \r
 500 \r
 501 .ti 0\r
 502 4.6.  Data requests (HINTs)\r
 503 \r
 504 .fi\r
 505 While bulk download protocols normally do explicit requests for\r
 506 certain ranges of data (e.g. BitTorrent's REQUEST message), live\r
 507 streaming protocols quite often do without to save round trips.\r
 508 Explicit requests are often needed for security purposes; consider\r
 509 that BitTorrent can only verify hashes of complete pieces that\r
 510 might consist of multiple blocks requested from many peers.\r
 511 As swift has no such implications, it is supposed to work both\r
 512 ways. Namely, a peer SHOULD send out requested pieces, while it\r
 513 also may send some other data in case it runs out of requests or\r
 514 on some other reason. To emphasize that, request messages are named\r
 515 HINTs; their only purpose is to coordinate peers and to avoid\r
 516 unnecessary data retransmission. A peer SHOULD to process\r
 517 HINTs sequentially. HINT message type is 8.\r
 518 .nf\r
 519 08 00000009\r
 520 (a peer requests fifth and sixth packets)\r
 521 \r
 522 \r
 523 .ti 0\r
 524 4.7.  Subsetting of the protocol\r
 525 \r
 526 .fi\r
 527 As the same protocol is supposed to serve diverse usecases,\r
 528 different peers may support different subsets of messages. The\r
 529 supported subset SHOULD be signaled in the handshake packets.\r
 530 The SWIFT_MSGTYPE_RCVD message (type 9) serves exactly this\r
 531 purpose. It contains a 32-bit big-endian number with bits set\r
 532 to 1 at offsets corresponding to supported message type ids.\r
 533 E.g. for a tracker peer which receives only handshakes and\r
 534 (root) hashes, sends out handshakes and PEX_ADD messages, that\r
 535 message will look like:\r
 536 09 00000011\r
 537 Peers running over TCP may not accept ACK messages, etc etc.\r
 538 \r
 539 \r
 540 .ti 0\r
 541 4.8.  Directory lists\r
 542 \r
 543 .fi\r
 544 Directory list files MUST start with magic bytes ".\n..\n\n". The rest of the file is a newline-separated list of hashes and file names for the content of the directory. An example:\r
 545 \r
 546 .nf\r
 547 \&.\r
 548 \&..\r
 549 1234567890ABCDEF1234567890ABCDEF12345678  readme.txt\r
 550 01234567890ABCDEF1234567890ABCDEF1234567  big_file.dat\r
 551 \r
 552 \r
 553 .ti 0\r
 554 5. Enveloping\r
 555 \r
 556 .ti 0\r
 557 5.1.  IP\r
 558 \r
 559 .fi\r
 560 The most theoretically correct way is to run swift on top of IP, as another transport protocol like TCP or UDP. Albeit, that option has significant downsides. First, that is inevitable NAT/firewall compatibility problems. Second, that necessitates in-kernel implementation for all peers.\r
 561 \r
 562 \r
 563 .ti 0\r
 564 5.2.  UDP\r
 565 \r
 566 .nf\r
 567 Currently, swift-over-UDP is the default deployment option. Effectively, UDP allows to use IP with minimal overhead, it also allows userspace implementations.\r
 568 Besides the classic 1KB packet scenario, the bin numbering allows to use swift over Jumbo frames/datagrams. Both data and acknowledgments may use e.g. 8KB packets instead of "standard" 1KB. Hashing scheme stays the same.\r
 569 Using swift with 512 or 256-byte packets is theoretically possible with 64-bit byte-precise bin numbers, but IP fragmentation might be a better method to achieve the same result.\r
 570 \r
 571 \r
 572 .ti 0\r
 573 5.3.  TCP\r
 574 \r
 575 .fi\r
 576 If ran over TCP, the swift becomes functionally equivalent to BitTorrent. Namely, most swift messages have corresponding BitTorrent messages and vice versa, except for BitTorrent's explicit interest declarations and choking/unchoking, which serve the classic implementation of the tit-for-tat algorithm [TIT4TAT].\r
 577 \r
 578 \r
 579 .ti 0\r
 580 6. Security Considerations\r
 581 \r
 582 As any other network protocol, the swift faces a common set of security challenges. An implementation must consider the possibility of buffer overruns, DoS attacks and manipulation (i.e. reflection attacks). Any guarantee of privacy seems unlikely, as the user is exposing its IP address to the peers. A probable exception is the case of user being hidden behind a public NAT or proxy.\r
 583 \r
 584 \r
 585 .ti 0\r
 586 7. Extensibility\r
 587 \r
 588 .ti 0\r
 589 7.1. 32 bit vs 64 bit\r
 590 \r
 591 .nf\r
 592 While in principle the protocol supports bigger (>1TB) files, all\r
 593 the mentioned counters are 32-bit. It is an optimization, as using\r
 594 \%64-bit numbers on-wire may cost ~2% practical overhead. 64-bit\r
 595 version of every message has typeid of 64+t, e.g. typeid 68 for\r
 596 \%64-bit hash message:\r
 597 44 000000000000000E 01234567890ABCDEF1234567890ABCDEF1234567\r
 598 Once 32-bit message is supported, its 64-bit version MUST be\r
 599 .bp\r
 600 .ti 0\r
 601 7.2. IPv6\r
 602 \r
 603 IPv6 versions of PEX messages use the same 64+t shift as in 6.1.1.\r
 604 \r
 605 \r
 606 .ti 0\r
 607 7.3. Congestion control algorithms\r
 608 \r
 609 .fi\r
 610 Congestion control algorithm is left to the implementation and may even vary from peer to peer. Congestion control is entirely implemented by the sending peer, the receiver only provides clues, such as hints, acknowledgments and timestamps.\r
 611 In general, it is expected that servers would use TCP-like congestion control schemes such as classic AIMD or CUBIC [CUBIC]. End-user peers are expected to use weaker-than-TCP (least than best effort) congestion control, such as [LEDBAT] to minimize seeding counter-incentives.\r
 612 \r
 613 \r
 614 .ti 0\r
 615 7.4. Piece picking algorithms\r
 616 \r
 617 Piece picking entirely depends on the receiving peer. The sender peer is made aware of preferred pieces by the means of HINT messages, but may ignore those hints and send unrequested data.\r
 618 \r
 619 \r
 620 .ti 0\r
 621 7.5. Reciprocity algorithms\r
 622 \r
 623 Reciprocity algorithms is the sole responsibility of the sender peer. Reciprocal intentions of the sender are not manifested by separate messages (as BitTorrent's CHOKE/UNCHOKE), as it does not guarantee anything anyway (the "snubbing" syndrome).\r
 624 \r
 625 \r
 626 .ti 0\r
 627 7.6. Different crypto/hashing schemes\r
 628 \r
 629 .fi\r
 630 Once a flavour of swift will need to use a different crypto scheme\r
 631 (e.g. SHA-256), a message should be allocated for that. As the root\r
 632 hash is supplied in the handshake message, the crypto scheme in use\r
 633 will be known from the very beginning. As the root hash is the\r
 634 content's identifier, different schemes of crypto cannot be mixed\r
 635 in the same swarm; different swarms may distribute the same content\r
 636 using different crypto.\r
 637 \r
 638 \r
 639 .ti 0\r
 640 References\r
 641 \r
 642 .nf\r
 643 .in 0\r
 644 [RFC2119] Key words for use in RFCs to Indicate Requirement Levels\r
 645 [HTTP1MLN] Richard Jones. "A Million-user Comet Application with\r
 646     Mochiweb", Part 3. http://www.metabrew.com/article/\r
 647     \%a-million-user-comet-application-with-mochiweb-part-3\r
 648 [MOLNAT] J.J.D. Mol, J.A. Pouwelse, D.H.J. Epema and H.J. Sips:\r
 649     \%"Free-riding, Fairness, and Firewalls in P2P File-Sharing"\r
 650 [LUCNAT] submitted\r
 651 [BINMAP] V. Grishchenko, J. Pouwelse: "Binmaps: hybridizing bitmaps\r
 652     and binary trees" http://bouillon.math.usu.ru/articles/\r
 653     \%binmaps-alenex.pdf\r
 654 [SNP] B. Ford, P. Srisuresh, D. Kegel: "Peer-to-Peer Communication\r
 655     Across Network Address Translators",\r
 656     http://www.brynosaurus.com/pub/net/p2pnat/\r
 657 [MERKLE] Merkle, R. A Digital Signature Based on a Conventional\r
 658     Encryption Function. Proceedings CRYPTO'87, Santa Barbara, CA,\r
 659     USA, Aug 1987. pp 369-378.\r
 660 [ABMRKL] Arno Bakker: "Merkle hash torrent extension", BEP 30,\r
 661     http://bittorrent.org/beps/bep_0030.html\r
 662 [CUBIC] Injong Rhee, and Lisong Xu: "CUBIC: A New TCP-Friendly\r
 663     \%High-Speed TCP Variant",\r
 664     \%http://www4.ncsu.edu/~rhee/export/bitcp/cubic-paper.pdf\r
 665 [LEDBAT] S. Shalunov: "Low Extra Delay Background Transport (LEDBAT)"\r
 666     \%http://www.ietf.org/id/draft-ietf-ledbat-congestion-00.txt\r
 667 [TIT4TAT] Bram Cohen: "Incentives Build Robustness in BitTorrent", 2003,\r
 668     http://www.bittorrent.org/bittorrentecon.pdf\r
 669 \r
 670 Author's address\r
 671 \r
 672 .in 3\r
 673 Victor Grishchenko\r
 674 TU Delft, EWI PDS\r
 675 Mekelweg 4, HB 9.240\r
 676 2628CD Delft\r
 677 The Netherlands\r
 678 \r
 679 Email: victor.grishchenko@gmail.com\r
 680 \r
 681 .ce 0\r