src/libswift_udp/doc/draft-ietf-ppsp-peer-protocol-00.nroff

   1 .\" Auto generated Nroff by NroffEdit on April 12, 2010\r
   2 .pl 10.0i\r
   3 .po 0\r
   4 .ll 7.2i\r
   5 .lt 7.2i\r
   6 .nr LL 7.2i\r
   7 .nr LT 7.2i\r
   8 .ds LF Grishchenko and Bakker\r
   9 .ds RF FORMFEED[Page %]\r
  10 .ds LH Internet-Draft\r
  11 .ds RH December 19, 2011\r
  12 .ds CH swift\r
  13 .ds CF Expires June 21, 2012\r
  14 .hy 0\r
  15 .nh\r
  16 .ad l\r
  17 .in 0\r
  18 .nf\r
  19 .tl 'PPSP' 'A. Bakker'\r
  20 .tl 'Internet-Draft' 'TU Delft'\r
  21 .tl 'Intended status: Informational' \r
  22 .tl 'Expires: June 21, 2012' 'December 19, 2011'\r
  23 \r
  24 .fi\r
  25 .in 3\r
  26 .in 12\r
  27 .ti 8\r
  28 Peer-to-Peer Streaming Protocol (PPSP) \%<draft-ietf-ppsp-peer-protocol-00.txt>\r
  29 \r
  30 .ti 0\r
  31 Abstract\r
  32 \r
  33 .in 3\r
  34 The Generic Multiparty Protocol (swift) is a peer-to-peer based transport\r
  35 protocol for content dissemination. It can be used for streaming on-demand\r
  36 and live video content, as well as conventional downloading. In swift, the\r
  37 clients consuming the content participate in the dissemination by forwarding\r
  38 the content to other clients via a mesh-like structure.  It is a generic\r
  39 protocol which can run directly on top of UDP, TCP, HTTP or as a RTP \r
  40 profile. Features of swift are short time-till-playback and extensibility. \r
  41 Hence, it can use different mechanisms to prevent freeriding, and work \r
  42 with different peer discovery schemes (centralized trackers or\r
  43 Distributed Hash Tables). Depending on the underlying transport\r
  44 protocol, swift can also use different congestion control algorithms, \r
  45 such as LEDBAT, and offer transparent NAT traversal. Finally, swift maintains \r
  46 only a small amount of state per peer and detects malicious modification of \r
  47 content. This documents describes swift and how it satisfies the requirements \r
  48 for the IETF Peer-to-Peer Streaming Protocol (PPSP) Working Group's peer \r
  49 protocol.\r
  50 \r
  51 \r
  52 .ti 0\r
  53 Status of this memo\r
  54 \r
  55 This Internet-Draft is submitted to IETF in full conformance with the\r
  56 provisions of BCP 78 and BCP 79.\r
  57 \r
  58 Internet-Drafts are working documents of the Internet Engineering Task Force\r
  59 (IETF), its areas, and its working groups.  Note that other groups may also\r
  60 distribute working documents as Internet- Drafts.\r
  61 \r
  62 Internet-Drafts are draft documents valid for a maximum of six months and\r
  63 may be updated, replaced, or obsoleted by other documents at any time.  It\r
  64 is inappropriate to use Internet-Drafts as reference material or to cite\r
  65 them other than as "work in progress."\r
  66 \r
  67 The list of current Internet-Drafts can be accessed at\r
  68 \%http://www.ietf.org/ietf/1id-abstracts.txt.\r
  69 \r
  70 The list of Internet-Draft Shadow Directories can be accessed at\r
  71 http://www.ietf.org/shadow.html.\r
  72 \r
  73 \r
  74 .nf\r
  75 Copyright (c) 2011 IETF Trust and the persons identified as the\r
  76 document authors.  All rights reserved.\r
  77 \r
  78 This document is subject to BCP 78 and the IETF Trust's Legal\r
  79 Provisions Relating to IETF Documents\r
  80 \%(http://trustee.ietf.org/license-info) in effect on the date of\r
  81 publication of this document.  Please review these documents\r
  82 carefully, as they describe your rights and restrictions with respect\r
  83 to this document.  Code Components extracted from this document must\r
  84 include Simplified BSD License text as described in Section 4.e of\r
  85 the Trust Legal Provisions and are provided without warranty as\r
  86 described in the Simplified BSD License.\r
  87 \r
  88 .\" \# TD4  -- Set TOC depth by altering this value (TD5 = depth 5)\r
  89 .\" \# TOC  -- Beginning of auto updated Table of Contents\r
  90 .in 0\r
  91 Table of Contents\r
  92 \r
  93 .nf\r
  94    1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3\r
  95      1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . .  3\r
  96      1.2. Conventions Used in This Document . . . . . . . . . . . . .  4\r
  97      1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . .  5\r
  98    2. Overall Operation . . . . . . . . . . . . . . . . . . . . . . .  6\r
  99      2.1. Joining a Swarm . . . . . . . . . . . . . . . . . . . . . .  6\r
 100      2.2. Exchanging Chunks . . . . . . . . . . . . . . . . . . . . .  6\r
 101      2.3. Leaving a Swarm . . . . . . . . . . . . . . . . . . . . . .  7\r
 102    3. Messages  . . . . . . . . . . . . . . . . . . . . . . . . . . .  7\r
 103      3.1. HANDSHAKE . . . . . . . . . . . . . . . . . . . . . . . . .  8\r
 104      3.3. HAVE  . . . . . . . . . . . . . . . . . . . . . . . . . . .  8\r
 105        3.3.1. Bin Numbers . . . . . . . . . . . . . . . . . . . . . .  8\r
 106        3.3.2. HAVE Message  . . . . . . . . . . . . . . . . . . . . .  9\r
 107      3.4. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . . .  9\r
 108      3.5. DATA and HASH . . . . . . . . . . . . . . . . . . . . . . . 10\r
 109        3.5.1. Merkle Hash Tree  . . . . . . . . . . . . . . . . . . . 10\r
 110        3.5.2. Content Integrity Verification  . . . . . . . . . . . . 11\r
 111        3.5.3. The Atomic Datagram Principle . . . . . . . . . . . . . 11\r
 112        3.5.4. DATA and HASH Messages  . . . . . . . . . . . . . . . . 12\r
 113      3.6. HINT  . . . . . . . . . . . . . . . . . . . . . . . . . . . 13\r
 114      3.7. Peer Address Exchange and NAT Hole Punching . . . . . . . . 13\r
 115      3.8. KEEPALIVE . . . . . . . . . . . . . . . . . . . . . . . . . 14\r
 116      3.9. VERSION . . . . . . . . . . . . . . . . . . . . . . . . . . 14\r
 117      3.10. Conveying Peer Capabilities  . . . . . . . . . . . . . . . 14\r
 118      3.11. Directory Lists  . . . . . . . . . . . . . . . . . . . . . 14\r
 119    4. Automatic Detection of Content Size . . . . . . . . . . . . . . 14\r
 120      4.1. Peak Hashes . . . . . . . . . . . . . . . . . . . . . . . . 15\r
 121      4.2. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 16\r
 122    5. Live streaming  . . . . . . . . . . . . . . . . . . . . . . . . 17\r
 123    6. Transport Protocols and Encapsulation . . . . . . . . . . . . . 17\r
 124      6.1. UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17\r
 125        6.1.1. Chunk Size  . . . . . . . . . . . . . . . . . . . . . . 17\r
 126        6.1.2. Datagrams and Messages  . . . . . . . . . . . . . . . . 18\r
 127        6.1.3. Channels  . . . . . . . . . . . . . . . . . . . . . . . 18\r
 128        6.1.4. HANDSHAKE and VERSION . . . . . . . . . . . . . . . . . 19\r
 129        6.1.5. HAVE  . . . . . . . . . . . . . . . . . . . . . . . . . 20\r
 130        6.1.6. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . 20\r
 131        6.1.7. HASH  . . . . . . . . . . . . . . . . . . . . . . . . . 20\r
 132        6.1.8. DATA  . . . . . . . . . . . . . . . . . . . . . . . . . 20\r
 133        6.1.9. KEEPALIVE . . . . . . . . . . . . . . . . . . . . . . . 20\r
 134        6.1.10. Flow and Congestion Control  . . . . . . . . . . . . . 21\r
 135      6.2. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21\r
 136      6.3. RTP Profile for PPSP  . . . . . . . . . . . . . . . . . . . 21\r
 137        6.3.1. Design  . . . . . . . . . . . . . . . . . . . . . . . . 22\r
 138        6.3.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 24\r
 139      6.4. HTTP (as PPSP)  . . . . . . . . . . . . . . . . . . . . . . 27\r
 140        6.4.1. Design  . . . . . . . . . . . . . . . . . . . . . . . . 27\r
 141        6.4.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 29\r
 142    7. Security Considerations . . . . . . . . . . . . . . . . . . . . 32\r
 143    8. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . 32\r
 144      8.1. 32 bit vs 64 bit  . . . . . . . . . . . . . . . . . . . . . 32\r
 145      8.2. IPv6  . . . . . . . . . . . . . . . . . . . . . . . . . . . 32\r
 146      8.3. Congestion Control Algorithms . . . . . . . . . . . . . . . 32\r
 147      8.4. Piece Picking Algorithms  . . . . . . . . . . . . . . . . . 33\r
 148      8.5. Reciprocity Algorithms  . . . . . . . . . . . . . . . . . . 33\r
 149      8.6. Different crypto/hashing schemes  . . . . . . . . . . . . . 33\r
 150    9. Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 33\r
 151      9.1.  Design Goals . . . . . . . . . . . . . . . . . . . . . . . 34\r
 152      9.2.  Not TCP  . . . . . . . . . . . . . . . . . . . . . . . . . 35\r
 153      9.3.  Generic Acknowledgments  . . . . . . . . . . . . . . . . . 36\r
 154    Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 37\r
 155    References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37\r
 156    Authors' addresses . . . . . . . . . . . . . . . . . . . . . . . . 39\r
 157 .fi\r
 158 .in 3\r
 159 \r
 160 .\" \# ETC -- End of auto updated Table of Contents\r
 161 \r
 162 \r
 163 \r
 164 .ti 0\r
 165 1.  Introduction\r
 166 \r
 167 .ti 0\r
 168 1.1. Purpose\r
 169 \r
 170 This document describes the Generic Multiparty Protocol (swift), designed\r
 171 from the ground up for the task of disseminating the same content to a group\r
 172 of interested parties. Swift supports streaming on-demand and\r
 173 live video content, as well as conventional downloading, thus covering\r
 174 today's three major use cases for content distribution. To fulfil this task,\r
 175 clients consuming the content are put on equal footing with the servers\r
 176 initially providing the content to create a peer-to-peer system where\r
 177 everyone can provide data. Each peer connects to a random set of other peers\r
 178 resulting in a mesh-like structure. \r
 179 \r
 180 Swift uses a simple method of naming content based on self-certification. In\r
 181 particular, content in swift is identified by a single cryptographic hash\r
 182 that is the root hash in a Merkle hash tree calculated recursively from the\r
 183 content [ABMRKL].  This self-certifying hash tree allows every peer to\r
 184 directly detect when a malicious peer tries to distribute fake content. It\r
 185 also ensures only a small amount of information is needed to start a\r
 186 download (just the root hash and some peer addresses).\r
 187 \r
 188 Swift uses a novel method of addressing chunks of content called "bin\r
 189 numbers". Bin numbers allow the addressing of a binary interval of data\r
 190 using a single integer. This reduces the amount of state that needs to be\r
 191 recorded per peer and the space needed to denote intervals on the wire,\r
 192 making the protocol light-weight. In general, this numbering system allows \r
 193 swift to work with simpler data structures, e.g. to use arrays instead of \r
 194 binary trees, thus reducing complexity.\r
 195 \r
 196 Swift is a generic protocol which can run directly on top of UDP, TCP, HTTP, \r
 197 or as a layer below RTP, similar to SRTP [RFC3711]. As such, swift defines a \r
 198 common set of messages that make up the protocol, which can have different\r
 199 representations on the wire depending on the lower-level protocol used. When\r
 200 the lower-level transport is UDP, swift can also use different congestion \r
 201 control algorithms and facilitate NAT traversal. \r
 202 \r
 203 In addition, swift is extensible in the mechanisms it uses to promote client\r
 204 contribution and prevent freeriding, that is, how to deal with peers \r
 205 that only download content but never upload to others. Furthermore, \r
 206 it can work with different peer discovery schemes, \r
 207 such as centralized trackers or fast Distributed Hash Tables [JIM11]. \r
 208 \r
 209 This documents describes not only the swift protocol but also how it\r
 210 satisfies the requirements for the IETF Peer-to-Peer Streaming Protocol\r
 211 (PPSP) Working Group's peer protocol [PPSPCHART,I-D.ietf-ppsp-reqs].\r
 212 A reference implementation of swift over UDP is available [SWIFTIMPL].\r
 213 \r
 214 \r
 215 .ti 0\r
 216 1.2. Conventions Used in This Document\r
 217 \r
 218 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",\r
 219 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in\r
 220 this document are to be interpreted as described in [RFC2119].\r
 221 \r
 222 .ti 0\r
 223 1.3. Terminology\r
 224 \r
 225 .in 3\r
 226 message\r
 227 .br\r
 228 .in 8\r
 229 The basic unit of swift communication. A message will have different \r
 230 representations on the wire depending on the transport protocol used. Messages \r
 231 are typically multiplexed into a datagram for transmission.\r
 232 \r
 233 .in 3\r
 234 datagram\r
 235 .br\r
 236 .in 8\r
 237 A sequence of messages that is offered as a unit to the underlying transport \r
 238 protocol (UDP, etc.). The datagram is swift's Protocol Data Unit (PDU).\r
 239 \r
 240 .in 3\r
 241 content\r
 242 .br\r
 243 .in 8\r
 244 Either a live transmission, a pre-recorded multimedia asset, or a file.\r
 245 \r
 246 .in 3\r
 247 bin\r
 248 .br\r
 249 .in 8\r
 250 A number denoting a specific binary interval of the content (i.e., one \r
 251 or more consecutive chunks).\r
 252 \r
 253 .in 3\r
 254 chunk\r
 255 .br\r
 256 .in 8\r
 257 The basic unit in which the content is divided. E.g. a block of N \r
 258 kilobyte.\r
 259 \r
 260 .in 3\r
 261 hash\r
 262 .br\r
 263 .in 8\r
 264 The result of applying a cryptographic hash function, more specifically \r
 265 a modification detection code (MDC) [HAC01], such as SHA1 [FIPS180-2], \r
 266 to a piece of data. \r
 267 \r
 268 .in 3\r
 269 root hash\r
 270 .br\r
 271 .in 8\r
 272 The root in a Merkle hash tree calculated recursively from the content.\r
 273 \r
 274 .in 3\r
 275 swarm\r
 276 .br\r
 277 .in 8\r
 278 A group of peers participating in the distribution of the same content.\r
 279 \r
 280 .in 3\r
 281 swarm ID\r
 282 .br\r
 283 .in 8\r
 284 Unique identifier for a swarm of peers, in swift the root hash of the \r
 285 content (video-on-demand,download) or a public key (live streaming).\r
 286 \r
 287 .in 3\r
 288 tracker\r
 289 .br\r
 290 .in 8\r
 291 An entity that records the addresses of peers participating in a swarm,\r
 292 usually for a set of swarms, and makes this membership information \r
 293 available to other peers on request.\r
 294 \r
 295 .in 3\r
 296 choking\r
 297 .br\r
 298 .in 8\r
 299 When a peer A is choking peer B it means that A is currently not \r
 300 willing to accept requests for content from B.\r
 301 .in 3\r
 302 \r
 303 \r
 304 .ti 0\r
 305 2. Overall Operation\r
 306 \r
 307 The basic unit of communication in swift is the message. Multiple messages are\r
 308 multiplexed into a single datagram for transmission. A datagram (and hence the \r
 309 messages it contains) will have different representations on the wire depending \r
 310 on the transport protocol used (see Sec. 6).\r
 311 \r
 312 \r
 313 .ti 0\r
 314 2.1. Joining a Swarm\r
 315 \r
 316 Consider a peer A that wants to download a certain content asset.\r
 317 To commence a swift download, peer A must have the swarm ID of the content \r
 318 and a list of one or more tracker contact points (e.g. host+port). The list \r
 319 of trackers is optional in the presence of a decentralized tracking mechanism. \r
 320 The swarm ID consists of the swift root hash of the content (video-on-demand,\r
 321 downloading) or a public key (live streaming).\r
 322 \r
 323 Peer A now registers with the tracker following e.g. the PPSP tracker\r
 324 protocol [I-D.ietf.ppsp-reqs] and receives the IP address and port of peers\r
 325 already in the swarm, say B, C, and D. Peer A now sends a datagram\r
 326 containing a HANDSHAKE message to B, C, and D. This message serves as an\r
 327 end-to-end check that the peers are actually in the correct swarm, and\r
 328 contains the root hash of the swarm.  Peer B and C respond with datagrams\r
 329 containing a HANDSHAKE message and one or more HAVE messages. A HAVE message\r
 330 conveys (part of) the chunk availability of a peer and thus contains a bin\r
 331 number that denotes what chunks of the content peer B, resp. C have. Peer D\r
 332 sends a datagram with just a HANDSHAKE and omits HAVE messages as a way of \r
 333 choking A.\r
 334 \r
 335 .ti 0\r
 336 2.2. Exchanging Chunks\r
 337 \r
 338 In response to B and C, A sends new datagrams to B and C containing HINT messages.\r
 339 A HINT or request message indicates the chunks that a peer wants to\r
 340 download, and contains a bin number. The HINT messages to B and C refer to\r
 341 disjunct sets of chunks.  B and C respond with datagrams containing HASH,\r
 342 HAVE and DATA messages.  The HASH messages contains all cryptographic hashes\r
 343 that peer A needs to verify the integrity of the content chunk sent in the\r
 344 DATA message, using the content's root hash as trusted anchor, see Sec. 3.5.\r
 345 Using these hashes peer A verifies that the chunks received from B and C are \r
 346 correct. It also updates the chunk availability of B and C using the information\r
 347 in the received HAVE messages. \r
 348 \r
 349 After processing, A sends a datagram containing HAVE messages for the chunks\r
 350 it just received to all its peers. In the datagram to B and C it includes an\r
 351 ACK message acknowledging the receipt of the chunks, and adds HINT messages\r
 352 for new chunks. ACK messages are not used when a reliable transport protocol\r
 353 is used. When e.g. C finds that A obtained a chunk (from B) that C did not\r
 354 yet have, C's next datagram includes a HINT for that chunk.\r
 355 \r
 356 Peer D does not send HAVE messages to A when it downloads chunks from other peers,\r
 357 until D decides to unchoke peer A. In the case, it sends a datagram with\r
 358 HAVE messages to inform A of its current availability. If B or C decide to\r
 359 choke A they stop sending HAVE and DATA messages and A should then rerequest\r
 360 from other peers. They may continue to send HINT messages, or periodic\r
 361 KEEPALIVE messages such that A keeps sending them HAVE messages.\r
 362 \r
 363 Once peer A has received all content (video-on-demand use case) it stops\r
 364 sending messages to all other peers that have all content (a.k.a. seeders).\r
 365 Peer A MAY also contact the tracker or another source again to obtain more \r
 366 peer addresses.\r
 367 \r
 368 \r
 369 .ti 0\r
 370 2.3. Leaving a Swarm\r
 371 \r
 372 Depending on the transport protocol used, peers should either use explicit \r
 373 leave messages or implicitly leave a swarm by stopping to respond\r
 374 to messages. Peers that learn about the departure should remove these peers \r
 375 from the current peer list. The implicit-leave mechanism works for both graceful and \r
 376 ungraceful leaves (i.e., peer crashes or disconnects). When leaving gracefully, a \r
 377 peer should deregister from the tracker following the (PPSP) tracker protocol.\r
 378 \r
 379 \r
 380 .ti 0\r
 381 3. Messages\r
 382 \r
 383 .fi\r
 384 In general, no error codes or responses are used in the protocol; absence \r
 385 of any response indicates an error. Invalid messages are discarded. \r
 386 \r
 387 For the sake of simplicity, one swarm of peers always deals with one content \r
 388 asset (e.g. file) only. Retrieval of large collections of files is done by \r
 389 retrieving a directory list file and then recursively retrieving files, which \r
 390 might also turn to be directory lists, as described in Sec. 3.11. \r
 391 \r
 392 .ti 0\r
 393 3.1. HANDSHAKE\r
 394 \r
 395 As an end-to-end check that the peers are actually in the correct swarm, the\r
 396 initiating peer and the addressed peer SHOULD send a HANDSHAKE message in\r
 397 the first datagrams they exchange. The only payload of the HANDSHAKE message\r
 398 is the root hash of the content.  \r
 399 \r
 400 After the handshakes are exchanged, the initiator knows that the peer really\r
 401 responds. Hence, the second datagram the initiator sends MAY already contain\r
 402 some heavy payload. To minimize the number of initialization roundtrips, \r
 403 implementations MAY dispense with the HANDSHAKE message. To the same end,\r
 404 the first two datagrams exchanged MAY also contain some minor payload, e.g.\r
 405 HAVE messages to indicate the current progress of a peer or a HINT \r
 406 (see Sec. 3.6).\r
 407 \r
 408 \r
 409 .ti 0\r
 410 3.3. HAVE\r
 411 \r
 412 The HAVE message is used to convey which chunks a peers has available,\r
 413 expressed in a new content addressing scheme called "bin numbers".\r
 414 \r
 415 .ti 0\r
 416 3.3.1. Bin Numbers\r
 417 \r
 418 Swift employs a generic content addressing scheme based on binary intervals \r
 419 ("bins" in short). The smallest interval is a chunk (e.g. a N kilobyte \r
 420 block), the top interval is the complete 2**63 range.  A novel addition \r
 421 to the classical scheme are "bin numbers", a scheme of numbering binary \r
 422 intervals which lays them out into a vector nicely. Consider an chunk\r
 423 interval of width W. To derive the bin numbers of the complete interval and\r
 424 the subintervals, a minimal balanced binary tree is built that is at least W \r
 425 chunks wide at the base. The leaves from left-to-right correspond to \r
 426 the chunks 0..W in the interval, and have bin number I*2 where I is the \r
 427 index of the chunk (counting beyond W-1 to balance the tree). The higher \r
 428 level nodes P in the tree have bin number\r
 429 \r
 430          binP = (binL + binR) / 2\r
 431 \r
 432 .br      \r
 433 where binL is the bin of node P's left-hand child and binR is the bin of node\r
 434 P's right-hand child. Given that each node in the tree represents a\r
 435 subinterval of the original interval, each such subinterval now is\r
 436 addressable by a bin number, a single integer. The bin number tree of a\r
 437 interval of width W=8 looks like this:\r
 438 \r
 439 \r
 440 \r
 441 \r
 442                         7\r
 443 .br\r
 444                        / \\\r
 445 .br\r
 446                      /     \\\r
 447 .br\r
 448                    /         \\\r
 449 .br\r
 450                  /             \\\r
 451 .br\r
 452                3                11\r
 453 .br\r
 454               / \\              / \\\r
 455 .br\r
 456              /   \\            /   \\\r
 457 .br\r
 458             /     \\          /     \\\r
 459 .br\r
 460            1       5        9       13 \r
 461 .br\r
 462           / \\     / \\      / \\      / \\\r
 463 .br\r
 464          0   2   4   6    8   10  12   14\r
 465 .br \r
 466 \r
 467 .fi\r
 468 So bin 7 represents the complete interval, 3 represents the interval of\r
 469 chunk 0..3 and 1 represents the interval of chunks 0 and 1. The special\r
 470 numbers 0xFFFFFFFF (32-bit) or 0xFFFFFFFFFFFFFFFF (64-bit) stands for an\r
 471 empty interval, and 0x7FFF...FFF stands for "everything". \r
 472 \r
 473 \r
 474 .ti 0\r
 475 3.3.2. HAVE Message\r
 476 \r
 477 When a receiving peer has successfully checked the integrity of a chunk or\r
 478 interval of chunks it MUST send a HAVE message to all peers it wants to \r
 479 interact with. The latter allows the HAVE message to be used as a method of \r
 480 choking. The HAVE message MUST contain the bin number of the biggest\r
 481 complete interval of all chunks the receiver has received and checked so far\r
 482 that fully includes the interval of chunks just received. So the bin number\r
 483 MUST denote at least the interval received, but the receiver is supposed to \r
 484 aggregate and acknowledge bigger bins, when possible.\r
 485 \r
 486 As a result, every single chunk is acknowledged a logarithmic number of times.\r
 487 That provides some necessary redundancy of acknowledgments and sufficiently \r
 488 compensates for unreliable transport protocols.\r
 489 \r
 490 To record which chunks a peer has in the state that an implementation keeps\r
 491 for each peer, an implementation MAY use the "binmap" data structure, which\r
 492 is a hybrid of a bitmap and a binary tree, discussed in detail in [BINMAP].\r
 493 \r
 494 \r
 495 .ti 0\r
 496 3.4. ACK\r
 497 \r
 498 When swift is run over an unreliable transport protocol, an implementation MAY\r
 499 choose to use ACK messages to acknowledge received data. When a receiving \r
 500 peer has successfully checked the integrity of a chunk or interval of chunks\r
 501 C it MUST send a ACK message containing the bin number of its biggest, complete,\r
 502 interval covering C to the sending peer (see HAVE). To facilitate delay-based\r
 503 congestion control, an ACK message contains a timestamp.\r
 504 \r
 505 \r
 506 .ti 0\r
 507 3.5. DATA and HASH\r
 508 \r
 509 The DATA message is used to transfer chunks of content. The associated HASH message \r
 510 carries cryptographic hashes that are necessary for a receiver to check the\r
 511 the integrity of the chunk. Swift's content integrity protection is based on a \r
 512 Merkle hash tree and works as follows.\r
 513 \r
 514 .ti 0\r
 515 3.5.1. Merkle Hash Tree\r
 516 \r
 517 Swift uses a method of naming content based on self-certification. In particular, \r
 518 content in swift is identified by a single cryptographic hash that is the \r
 519 root hash in a Merkle hash tree calculated recursively from the content [ABMRKL]. \r
 520 This self-certifying hash tree allows every peer to directly detect when a malicious \r
 521 peer tries to distribute fake content. It also ensures only a small the amount \r
 522 of information is needed to start a download (the root hash and some peer \r
 523 addresses). For live streaming public keys and dynamic trees are used, see below.\r
 524 \r
 525 The Merkle hash tree of a content asset that is divided into N chunks\r
 526 is constructed as follows. Note the construction does not assume chunks\r
 527 of content to be fixed size. Given a cryptographic hash function, more specifically \r
 528 a modification detection code (MDC) [HAC01], such as SHA1, the hashes of all the \r
 529 chunks of the content are calculated. Next, a binary tree of sufficient height \r
 530 is created. Sufficient \r
 531 height means that the lowest level in the tree has enough nodes to hold all \r
 532 chunk hashes in the set, as before, see HAVE message. The figure below shows \r
 533 the tree for a content asset consisting of 7 chunks. As before with the\r
 534 content addressing scheme, the leaves of the tree correspond to a chunk and \r
 535 in this case are assigned the hash of that chunk, starting at the left-most leaf. \r
 536 As the base of the tree may be wider than the number of chunks, any remaining \r
 537 leaves in the tree are assigned a empty hash value of all zeros. Finally, the hash \r
 538 values of the higher levels in the tree are calculated, by concatenating the hash \r
 539 values of the two children (again left to right) and computing the hash of that \r
 540 aggregate. This process ends in a hash value for the root node, which is called\r
 541 the "root hash". Note the root hash only depends on the content and any modification\r
 542 of the content will result in a different root hash. \r
 543 \r
 544 \r
 545 \r
 546 \r
 547                         7 = root hash\r
 548 .br\r
 549                        / \\\r
 550 .br\r
 551                      /     \\\r
 552 .br\r
 553                    /         \\\r
 554 .br\r
 555                  /             \\\r
 556 .br\r
 557                3*               11\r
 558 .br\r
 559               / \\              / \\\r
 560 .br\r
 561              /   \\            /   \\\r
 562 .br\r
 563             /     \\          /     \\\r
 564 .br\r
 565            1       5        9       13* = uncle hash\r
 566 .br\r
 567           / \\     / \\      / \\      / \\\r
 568 .br\r
 569          0   2   4   6    8   10* 12   14\r
 570 .br\r
 571         \r
 572          C0  C1  C2  C3   C4  C5  C6   E\r
 573 .br\r
 574          =chunk index     ^^           = empty hash\r
 575 .br\r
 576                            \r
 577 \r
 578 .ti 0\r
 579 3.5.2. Content Integrity Verification\r
 580 \r
 581 .fi\r
 582 Assuming a peer receives the root hash of the content it wants to download\r
 583 from a trusted source, it can can check the integrity of any chunk of that \r
 584 content it receives as follows. It first calculates the hash of the chunk \r
 585 it received, for example chunk C4 in the previous figure. Along with this \r
 586 chunk it MUST receive the hashes required to check the integrity of that \r
 587 chunk. In principle, these are the hash of the chunk's sibling (C5) and \r
 588 that of its "uncles". A chunk's uncles are the sibling Y of its parent X, \r
 589 and the uncle of that Y, recursively until the root is reached. For chunk C4 \r
 590 its uncles are bins 13 and 3, marked with * in the figure. Using this information \r
 591 the peer recalculates the root hash of the tree, and compares it to the \r
 592 root hash it received from the trusted source. If they match the chunk of \r
 593 content has been positively verified to be the requested part of the content. \r
 594 Otherwise, the sending peer either sent the wrong content or the wrong \r
 595 sibling or uncle hashes. For simplicity, the set of sibling and uncles \r
 596 hashes is collectively referred to as the "uncle hashes". \r
 597 \r
 598 In the case of live streaming the tree of chunks grows dynamically and\r
 599 content is identified with a public key instead of a root hash, as the root \r
 600 hash is undefined or, more precisely, transient, as long as new data is \r
 601 generated by the live source. Live streaming is described in more detail\r
 602 below, but content verification works the same for both live and predefined\r
 603 content.\r
 604 \r
 605 .ti 0\r
 606 3.5.3. The Atomic Datagram Principle\r
 607 \r
 608 As explained above, a datagram consists of a sequence of messages. Ideally,\r
 609 every datagram sent must be independent of other datagrams, so each \r
 610 datagram SHOULD be processed separately and a loss of one datagram MUST NOT\r
 611 disrupt the flow.  Thus, as a datagram carries zero or more messages,\r
 612 neither messages nor message interdependencies should span over multiple\r
 613 datagrams. \r
 614 \r
 615 This principle implies that as any chunk is verified using its uncle \r
 616 hashes the necessary hashes MUST be put into the same datagram as the \r
 617 chunk's data (Sec. 3.5.4).  As a general rule, if some additional data is \r
 618 still missing to process a message within a datagram, the message SHOULD be \r
 619 dropped. \r
 620 \r
 621 The hashes necessary to verify a chunk are in principle its sibling's hash\r
 622 and all its uncle hashes, but the set of hashes to sent can be optimized. \r
 623 Before sending a packet of data to the receiver, the sender inspects the \r
 624 receiver's previous acknowledgments (HAVE or ACK) to derive which hashes the\r
 625 receiver already has for sure. Suppose, the receiver had acknowledged bin 1\r
 626 (first two chunks of the file), then it must already have uncle hashes 5,\r
 627 11 and so on. That is because those hashes are necessary to check packets of\r
 628 bin 1 against the root hash. Then, hashes 3, 7 and so on must be also known\r
 629 as they are calculated in the process of checking the uncle hash chain.\r
 630 Hence, to send bin 12 (i.e. the 7th chunk of content), the sender needs to\r
 631 include just the hashes for bins 14 and 9, which let the data be checked \r
 632 against hash 11 which is already known to the receiver. \r
 633 \r
 634 The sender MAY optimistically skip hashes which were sent out in previous,\r
 635 still unacknowledged datagrams. It is an optimization tradeoff between\r
 636 redundant hash transmission and possibility of collateral data loss in the\r
 637 case some necessary hashes were lost in the network so some delivered data\r
 638 cannot be verified and thus has to be dropped. In either case, the receiver\r
 639 builds the Merkle tree on-demand, incrementally, starting from the root\r
 640 hash, and uses it for data validation.\r
 641 \r
 642 In short, the sender MUST put into the datagram the missing hashes necessary \r
 643 for the receiver to verify the chunk.\r
 644 \r
 645 .ti 0\r
 646 3.5.4. DATA and HASH Messages\r
 647 \r
 648 Concretely, a peer that wants to send a chunk of content creates a datagram \r
 649 that MUST consist of one or more HASH messages and a DATA message. The datagram \r
 650 MUST contain a HASH message for each hash the receiver misses for integrity \r
 651 checking. A HASH message MUST contain the bin number and hash data for each \r
 652 of those hashes. The DATA message MUST contain the bin number of the chunk \r
 653 and chunk itself. A peer MAY send the required messages for multiple chunks \r
 654 in the same datagram. \r
 655 \r
 656 \r
 657 .ti 0\r
 658 3.6. HINT\r
 659 \r
 660 While bulk download protocols normally do explicit requests for certain ranges \r
 661 of data (i.e., use a pull model, for example, BitTorrent [BITTORRENT]), live \r
 662 streaming protocols quite often use a request-less push model to save round \r
 663 trips. Swift supports both models of operation. \r
 664 \r
 665 A peer MUST send a HINT message containing the bin of the chunk interval it \r
 666 wants to download. A peer receiving a HINT message MAY send out requested \r
 667 pieces. When it receives multiple HINTs (either in one datagram or in multiple), \r
 668 the peer SHOULD process the HINTs sequentially. When live streaming, \r
 669 it also may send some other chunks in case it runs out of requests or\r
 670 for some other reason. In that case the only purpose of HINT messages is \r
 671 to coordinate peers and to avoid unnecessary data retransmission, hence \r
 672 the name. \r
 673 \r
 674 \r
 675 \r
 676 .ti 0\r
 677 3.7. Peer Address Exchange and NAT Hole Punching\r
 678 \r
 679 Peer address exchange messages (or PEX messages for short) are common for \r
 680 many peer-to-peer protocols. By exchanging peer addresses in gossip fashion, \r
 681 peers relieve central coordinating entities (the trackers) from unnecessary \r
 682 work. swift optionally features two types of PEX messages: PEX_REQ and PEX_ADD.\r
 683 A peer that wants to retrieve some peer addresses MUST send a PEX_REQ message.\r
 684 The receiving peer MAY respond with a PEX_ADD message containing the addresses\r
 685 of several peers. The addresses MUST be of peers it has recently exchanged\r
 686 messages with to guarantee liveliness. \r
 687 \r
 688 .fi\r
 689 To unify peer exchange and NAT hole punching functionality, the\r
 690 sending pattern of PEX messages is restricted. As the swift handshake\r
 691 is able to do simple NAT hole punching [SNP] transparently, PEX\r
 692 messages must be emitted in the way to facilitate that. Namely,\r
 693 once peer A introduces peer B to peer C by sending a PEX_ADD message to\r
 694 C, it SHOULD also send a message to B introducing C. The messages\r
 695 SHOULD be within 2 seconds from each other, but MAY not be,\r
 696 simultaneous, instead leaving a gap of twice the "typical" RTT, i.e.\r
 697 \%300-600ms. The peers are supposed to initiate handshakes to each\r
 698 other thus forming a simple NAT hole punching pattern where the\r
 699 introducing peer effectively acts as a STUN server [RFC5389]. Still, peers\r
 700 MAY ignore PEX messages if uninterested in obtaining new peers or\r
 701 because of security considerations (rate limiting) or any other\r
 702 reason.\r
 703 \r
 704 The PEX messages can be used to construct a dedicated tracker peer.\r
 705 \r
 706 \r
 707 .ti 0\r
 708 3.8. KEEPALIVE\r
 709 \r
 710 A peer MUST send a datagram containing a KEEPALIVE message periodically\r
 711 to each peer it wants to interact with in the future but has no\r
 712 other messages to send them at present.\r
 713 \r
 714 \r
 715 .ti 0\r
 716 3.9. VERSION\r
 717 .fi\r
 718 Peers MUST convey which version of the swift protocol they support using a \r
 719 VERSION message. This message MUST be included in the initial (handshake) \r
 720 datagrams and MUST indicate which version of the swift protocol the sending \r
 721 peer supports. \r
 722 \r
 723 \r
 724 .ti 0\r
 725 3.10. Conveying Peer Capabilities\r
 726 .fi\r
 727 Peers may support just a subset of the swift messages. For example, peers \r
 728 running over TCP may not accept ACK messages, or peers used with a centralized \r
 729 tracking infrastructure may not accept PEX messages. For these reasons, peers \r
 730 SHOULD signal which subset of the swift messages they support by means of \r
 731 the MSGTYPE_RCVD message. This message SHOULD be included in the initial \r
 732 (handshake) datagrams and MUST indicate which swift protocol messages \r
 733 the sending peer supports. \r
 734 \r
 735 \r
 736 .ti 0\r
 737 3.11. Directory Lists\r
 738 \r
 739 Directory list files MUST start with magic bytes ".\\n..\\n". The rest of\r
 740 the file is a newline-separated list of hashes and file names for the\r
 741 content of the directory. An example:\r
 742 \r
 743 .nf\r
 744 \&.\r
 745 \&..\r
 746 1234567890ABCDEF1234567890ABCDEF12345678  readme.txt\r
 747 01234567890ABCDEF1234567890ABCDEF1234567  big_file.dat\r
 748 \r
 749 \r
 750 \r
 751 .ti 0\r
 752 4. Automatic Detection of Content Size\r
 753 \r
 754 .fi\r
 755 In swift, the root hash of a static content asset, such as a video file,\r
 756 along with some peer addresses is sufficient to start a download.\r
 757 In addition, swift can reliably and automatically derive the size\r
 758 of such content from information received from the network when fixed\r
 759 sized chunks are used. As a result, it is not necessary to include \r
 760 the size of the content asset as the metadata of the content,\r
 761 in addition to the root hash. Implementations of swift MAY use \r
 762 this automatic detection feature.  \r
 763 \r
 764 .ti 0\r
 765 4.1. Peak Hashes\r
 766 \r
 767 The ability for a newcomer peer to detect the size of the content \r
 768 depends heavily on the concept of peak hashes. Peak hashes,\r
 769 in general, enable two cornerstone features of swift: reliable file \r
 770 size detection and download/live streaming unification (see Sec. 5). \r
 771 The concept of peak hashes depends on the concepts of filled and\r
 772 incomplete bins. Recall that when constructing the binary trees\r
 773 for content verification and addressing the base of the tree may\r
 774 have more leaves than the number of chunks in the content. In the\r
 775 Merkle hash tree these leaves were assigned empty all-zero \r
 776 hashes to be able to calculate the higher level hashes. A filled\r
 777 bin is now defined as a bin number that addresses an interval\r
 778 of leaves that consists only of hashes of content chunks, not\r
 779 empty hashes. Reversely, an incomplete (not filled) bin \r
 780 addresses an interval that contains also empty hashes,\r
 781 typically an interval that extends past the end of the file.\r
 782 In the following figure bins 7, 11, 13 and 14 are incomplete\r
 783 the rest is filled.\r
 784 \r
 785 Formally, a peak hash is a hash in the Merkle tree defined \r
 786 over a filled bin, whose sibling is defined over an incomplete bin. \r
 787 Practically, suppose a file is 7162 bytes long and a chunk \r
 788 is 1 kilobyte. That file fits into 7 chunks, the tail chunk being \r
 789 1018 bytes long. The Merkle tree for that file looks as follows. \r
 790 Following the definition the peak hashes of this file are in \r
 791 bins 3, 9 and 12, denoted with a *. E denotes an empty \r
 792 hash.\r
 793 \r
 794                         7\r
 795 .br\r
 796                        / \\\r
 797 .br\r
 798                      /     \\\r
 799 .br\r
 800                    /         \\\r
 801 .br\r
 802                  /             \\\r
 803 .br\r
 804                3*               11\r
 805 .br\r
 806               / \\              / \\\r
 807 .br\r
 808              /   \\            /   \\\r
 809 .br\r
 810             /     \\          /     \\\r
 811 .br\r
 812            1       5        9*      13 \r
 813 .br\r
 814           / \\     / \\      / \\      / \\\r
 815 .br\r
 816          0   2   4   6    8   10  12*  14\r
 817 .br\r
 818 \r
 819          C0  C1  C2  C3   C4  C5  C6   E\r
 820 .br\r
 821                                   = 1018 bytes\r
 822 .br\r
 823 \r
 824 Peak hashes can be explained by the binary representation of the \r
 825 number of chunks the file occupies. The binary representation for \r
 826 7 is 111. Every "1" in binary representation of the file's packet \r
 827 length corresponds to a peak hash. For this particular file there\r
 828 are indeed three peaks, bin numbers 3, 9, 12. The number of peak\r
 829 hashes for a file is therefore also at most logarithmic with its \r
 830 size.\r
 831 \r
 832 .fi\r
 833 A peer knowing which bins contain the peak hashes for the file \r
 834 can therefore calculate the number of chunks it consists of, and \r
 835 thus get an estimate of the file size (given all chunks but the last \r
 836 are fixed size). Which bins are the peaks can  be securely communicated \r
 837 from one (untrusted) peer A to another B by letting A send the \r
 838 peak hashes and their bin numbers to B. It can be shown that \r
 839 the root hash that B obtained from a trusted source is sufficient \r
 840 to verify that these are indeed the right peak hashes, as follows.  \r
 841 \r
 842 Lemma: Peak hashes can be checked against the root hash.\r
 843 \r
 844 Proof: (a) Any peak hash is always the left sibling. Otherwise, be\r
 845 it the right sibling, its left neighbor/sibling must also be\r
 846 defined over a filled bin, because of the way chunks are laid\r
 847 out in the leaves, contradiction. (b) For the rightmost \r
 848 peak hash, its right sibling is zero. (c) For any peak hash, \r
 849 its right sibling might be calculated using peak hashes to the \r
 850 left and zeros for empty bins. (d) Once the right sibling of \r
 851 the leftmost peak hash is calculated, its parent might be \r
 852 calculated. (e) Once that parent is calculated, we might \r
 853 trivially get to the root hash by concatenating the hash with \r
 854 zeros and hashing it repeatedly.\r
 855 \r
 856 .fi\r
 857 Informally, the Lemma might be expressed as follows: peak hashes cover all\r
 858 data, so the remaining hashes are either trivial (zeros) or might be\r
 859 calculated from peak hashes and zero hashes.\r
 860 \r
 861 Finally, once peer B has obtained the number of chunks in the content it\r
 862 can determine the exact file size as follows. Given that all chunks\r
 863 except the last are fixed size B just needs to know the size of the last\r
 864 chunk. Knowing the number of chunks B can calculate the bin number of the \r
 865 last chunk and download it. As always B verifies the integrity of this \r
 866 chunk against the trusted root hash. As there is only one chunk of data \r
 867 that leads to a successful verification the size of this chunk must \r
 868 be correct. B can then determine the exact file size as \r
 869 \r
 870     (number of chunks -1) * fixed chunk size + size of last chunk\r
 871 \r
 872 \r
 873 .ti 0\r
 874 4.2. Procedure\r
 875 \r
 876 A swift implementation that wants to use automatic size detection MUST\r
 877 operate as follows. When a peer B sends a DATA message for the first time\r
 878 to a peer A, B MUST include all the peak hashes for the content in the\r
 879 same datagram, unless A has already signalled earlier in the exchange \r
 880 that it knows the peak hashes by having acknowledged any bin, even the empty \r
 881 one. The receiver A MUST check the peak hashes against the root hash\r
 882 to determine the approximate content size. To obtain the definite content size\r
 883 peer A MUST download the last chunk of the content from any peer that offers it.\r
 884 \r
 885 \r
 886 \r
 887 \r
 888 \r
 889 .ti 0\r
 890 5. Live streaming\r
 891 \r
 892 .fi\r
 893 In the case of live streaming a transfer is bootstrapped with a public key\r
 894 instead of a root hash, as the root hash is undefined or, more precisely,\r
 895 transient, as long as new data is being generated by the live source. \r
 896 Live/download unification is achieved by sending signed peak hashes on-demand, \r
 897 ahead of the actual data. As before, the sender might use acknowledgements \r
 898 to derive which content range the receiver has peak hashes for and to prepend \r
 899 the data hashes with the necessary (signed) peak hashes. \r
 900 Except for the fact that the set of peak hashes changes with time, \r
 901 other parts of the algorithm work as described in Sec. 3. \r
 902 \r
 903 As with static content assets in the previous section, in live streaming \r
 904 content length is not known on advance, but derived \%on-the-go from the peak \r
 905 hashes. Suppose, our 7 KB stream extended to another kilobyte. Thus, now hash 7 \r
 906 becomes the only peak hash, eating hashes 3, 9 and 12. So, the source sends \r
 907 out a SIGNED_HASH message to announce the fact.\r
 908 \r
 909 The number of cryptographic operations will be limited. For example,\r
 910 consider a 25 frame/second video transmitted over UDP. When each frame is \r
 911 transmitted in its own chunk, only 25 signature verification operations \r
 912 per second are required at the receiver for bitrates up to ~12.8 \r
 913 megabit/second. For higher bitrates multiple UDP packets per frame \r
 914 are needed and the number of verifications doubles.\r
 915 \r
 916 \r
 917 \r
 918 \r
 919 \r
 920 .ti 0\r
 921 6. Transport Protocols and Encapsulation\r
 922 \r
 923 .ti 0\r
 924 6.1. UDP\r
 925 \r
 926 .ti 0\r
 927 6.1.1. Chunk Size\r
 928 \r
 929 Currently, swift-over-UDP is the preferred deployment option. Effectively, UDP\r
 930 allows the use of IP with minimal overhead and it also allows userspace\r
 931 implementations. The default is to use chunks of 1 kilobyte such that a\r
 932 datagram fits in an Ethernet-sized IP packet. The bin numbering\r
 933 allows to use swift over Jumbo frames/datagrams. Both DATA and\r
 934 HAVE/ACK messages may use e.g. 8 kilobyte packets instead of the standard 1 \r
 935 KiB. The hashing scheme stays the same. Using swift with 512 or 256-byte \r
 936 packets is theoretically possible with 64-bit byte-precise bin numbers, but IP\r
 937 fragmentation might be a better method to achieve the same result.\r
 938 \r
 939 \r
 940 .ti 0\r
 941 6.1.2. Datagrams and Messages\r
 942 \r
 943 When using UDP, the abstract datagram described above corresponds directly\r
 944 to a UDP datagram. Each message within a datagram has a fixed length, which\r
 945 depends on the type of the message. The first byte of a message denotes its type.\r
 946 The currently defined types are:\r
 947 \r
 948         HANDSHAKE = 0x00\r
 949 .br        \r
 950         DATA = 0x01\r
 951 .br\r
 952         ACK = 0x02\r
 953 .br        \r
 954         HAVE = 0x03\r
 955 .br        \r
 956         HASH = 0x04\r
 957 .br        \r
 958         PEX_ADD = 0x05\r
 959 .br        \r
 960         PEX_REQ = 0x06\r
 961 .br        \r
 962         SIGNED_HASH = 0x07\r
 963 .br        \r
 964         HINT = 0x08\r
 965 .br        \r
 966         MSGTYPE_RCVD = 0x09\r
 967 .br        \r
 968         VERSION = 0x10\r
 969 .br        \r
 970 \r
 971 \r
 972 Furthermore, integers are serialized in the network \%(big-endian) byte order. \r
 973 So consider the example of an ACK message (Sec 3.4). It has message type of 0x02\r
 974 and a payload of a bin number, a four-byte integer (say, 1); hence, its on the wire\r
 975 representation for UDP can be written in hex as: "02 00000001". This \%hex-like two \r
 976 character-per-byte notation is used to represent message formats in the rest of \r
 977 this section.\r
 978 \r
 979 .ti 0\r
 980 6.1.3. Channels  \r
 981 \r
 982 As it is increasingly complex for peers to enable UDP communication\r
 983 between each other due to NATs and firewalls, swift-over-UDP uses\r
 984 a multiplexing scheme, called "channels", to allow multiple swarms to use the \r
 985 same UDP port. Channels loosely correspond to TCP connections and each channel\r
 986 belongs to a single swarm. When channels are used, each datagram starts with \r
 987 four bytes corresponding to the receiving channel number. \r
 988 \r
 989  \r
 990 .ti 0\r
 991 6.1.4. HANDSHAKE and VERSION\r
 992 \r
 993 A channel is established with a handshake. To start a handshake, the initiating \r
 994 peer needs to know:\r
 995 \r
 996 .nf\r
 997 (1) the IP address of a peer\r
 998 (2) peer's UDP port and\r
 999 (3) the root hash of the content (see Sec. 3.5.1).\r
1000 .fi\r
1001 \r
1002 To do the handshake the initiating peer sends a datagram that MUST start \r
1003 with an all 0-zeros channel number followed by a VERSION message, then a \r
1004 HASH message whose payload is the root hash, and a HANDSHAKE message, whose \r
1005 only payload is a locally unused channel number. \r
1006 \r
1007 On the wire the datagram will look something like this:\r
1008 .nf\r
1009    00000000 10 01  \r
1010    04 7FFFFFFF 1234123412341234123412341234123412341234\r
1011    00 00000011\r
1012 .fi\r
1013 (to unknown channel, handshake from channel 0x11 speaking protocol version 0x01, \r
1014 initiating a transfer of a file with a root hash 123...1234)\r
1015 \r
1016 The receiving peer MUST respond with a datagram that starts with the\r
1017 channel number from the sender's HANDSHAKE message, followed by a \r
1018 VERSION message, then a HANDSHAKE message, whose only payload is a locally \r
1019 unused channel number, followed by any other messages it wants to send. \r
1020 \r
1021 Peer's response datagram on the wire:\r
1022 .nf\r
1023    00000011 10 01  \r
1024    00 00000022  03 00000003\r
1025 .fi\r
1026 (peer to the initiator: use channel number 0x22 for this transfer and \r
1027 proto version 0x01; I also have first 4 chunks of the file, see Sec. 4.3)\r
1028 \r
1029 .fi\r
1030 At this point, the initiator knows that the peer really responds; for that\r
1031 purpose channel ids MUST be random enough to prevent easy guessing. So, the\r
1032 third datagram of a handshake MAY already contain some heavy payload. To\r
1033 minimize the number of initialization roundtrips, the first two datagrams\r
1034 MAY also contain some minor payload, e.g. a couple of HAVE messages roughly\r
1035 indicating the current progress of a peer or a HINT (see Sec. 3.6).\r
1036 When receiving the third datagram, both peers have the proof they really talk \r
1037 to each other; three-way handshake is complete.\r
1038 \r
1039 A peer MAY explicit close a channel by sending a HANDSHAKE message that MUST\r
1040 contain an all 0-zeros channel number. \r
1041 \r
1042 On the wire:\r
1043 .nf \r
1044    00 00000000\r
1045 \r
1046 \r
1047 .ti 0\r
1048 6.1.5. HAVE\r
1049 \r
1050 A HAVE message (type 0x03) states that the sending peer has the \r
1051 complete specified bin and successfully checked its integrity:\r
1052 .nf\r
1053    03 00000003\r
1054 (got/checked first four kilobytes of a file/stream)\r
1055 \r
1056 \r
1057 \r
1058 .ti 0\r
1059 6.1.6. ACK\r
1060 \r
1061 An ACK message (type 0x02) acknowledges data that was received from\r
1062 its addressee; to facilitate delay-based congestion control, an\r
1063 ACK message contains a timestamp, in particular, a 64-bit microsecond\r
1064 time. \r
1065 .nf\r
1066    02 00000002 12345678\r
1067 (got the second kilobyte of the file from you; my microsecond\r
1068 timer was showing 0x12345678 at that moment)\r
1069 \r
1070 \r
1071 .ti 0\r
1072 6.1.7. HASH\r
1073 \r
1074 A HASH message (type 0x04) consists of a four-byte bin number and\r
1075 the cryptographic hash (e.g. a 20-byte SHA1 hash)\r
1076 .nf\r
1077    04 7FFFFFFF 1234123412341234123412341234123412341234\r
1078 \r
1079 \r
1080 .ti 0\r
1081 6.1.8. DATA\r
1082 \r
1083 .fi\r
1084 A DATA message (type 0x01) consists of a four-byte bin number and the \r
1085 actual chunk. In case a datagram contains a DATA message, a sender \r
1086 MUST always put the data message in the tail of the datagram. For \r
1087 example:\r
1088 .nf\r
1089    01 00000000 48656c6c6f20776f726c6421\r
1090 (This message accommodates an entire file: "Hello world!")\r
1091 \r
1092 \r
1093 .ti 0\r
1094 6.1.9. KEEPALIVE\r
1095 \r
1096 Keepalives do not have a message type on UDP. They are just simple \r
1097 datagrams consisting of a 4-byte channel id only.\r
1098 \r
1099 On the wire:\r
1100 .nf\r
1101    00000022\r
1102 \r
1103 .ti 0\r
1104 6.1.10. Flow and Congestion Control\r
1105 \r
1106 .fi\r
1107 Explicit flow control is not necessary in swift-over-UDP. In the case of\r
1108 video-on-demand the receiver will request data explicitly from peers and is \r
1109 therefore in control of how much data is coming towards it. In the case of\r
1110 live streaming, where a push-model may be used, the amount of data incoming\r
1111 is limited to the bitrate, which the receiver must be able to process otherwise\r
1112 it cannot play the stream. Should, for any reason, the receiver get saturated \r
1113 with data that situation is perfectly detected by the congestion control.\r
1114 Swift-over-UDP can support different congestion control algorithms, in particular,\r
1115 it supports the new IETF Low Extra Delay Background Transport (LEDBAT) congestion \r
1116 control algorithm that ensures that peer-to-peer traffic yields to regular \r
1117 best-effort traffic [LEDBAT]. \r
1118 \r
1119 \r
1120 .ti 0\r
1121 6.2. TCP\r
1122 \r
1123 .fi\r
1124 When run over TCP, swift becomes functionally equivalent to BitTorrent.\r
1125 Namely, most swift messages have corresponding BitTorrent messages and vice\r
1126 versa, except for BitTorrent's explicit interest declarations and\r
1127 choking/unchoking, which serve the classic implementation of the tit-for-tat\r
1128 algorithm [TIT4TAT]. However, TCP is not well suited for multiparty communication,\r
1129 as argued in Sec. 9.\r
1130 \r
1131 \r
1132 .ti 0\r
1133 6.3. RTP Profile for PPSP\r
1134 \r
1135 .fi\r
1136 In this section we sketch how swift can be integrated into RTP [RFC3550] to\r
1137 form the Peer-to-Peer Streaming Protocol (PPSP) [I-D.ietf-ppsp-reqs] running\r
1138 over UDP. The PPSP charter requires existing media transfer protocols be\r
1139 used [PPSPCHART]. Hence, the general idea is to define swift as a profile\r
1140 of RTP, in the same way as the Secure Real-time Transport Protocol (SRTP) \r
1141 [RFC3711]. SRTP, and therefore swift is considered ``a "bump in the stack"\r
1142 implementation which resides between the RTP application and the transport\r
1143 layer. [swift] intercepts RTP packets and then forwards an equivalent\r
1144 [swift] packet on the sending side, and intercepts [swift] packets and passes \r
1145 an equivalent RTP packet up the stack on the receiving side.'' [RFC3711].\r
1146 \r
1147 In particular, to encode a swift datagram in an RTP packet all the \r
1148 non-DATA messages of swift such as HINT and HAVE are postfixed to the RTP \r
1149 packet using the UDP encoding and the content of DATA messages is sent \r
1150 in the payload field. Implementations MAY omit the RTP header for packets\r
1151 without payload. This construction allows the streaming application to use \r
1152 of all RTP's current features, and with a modification to the Merkle tree \r
1153 hashing scheme (see below) meets swift's atomic datagram principle. The latter \r
1154 means that a receiving peer can autonomously verify the RTP packet as being \r
1155 correct content, thus preventing the spread of corrupt data (see \r
1156 requirement PPSP.SEC-REQ-4). \r
1157 \r
1158 The use of ACK messages for reliability is left as a choice of the\r
1159 application using PPSP.\r
1160 \r
1161 \r
1162 .ti 0\r
1163 6.3.1. Design\r
1164 \r
1165 6.3.1.1. Joining a Swarm\r
1166 \r
1167 To commence a PPSP download a peer A must have the swarm ID of the stream\r
1168 and a list of one or more tracker contact points (e.g. host+port). The list\r
1169 of trackers is optional in the presence of a decentralized tracking\r
1170 mechanism. The swarm ID consists of the swift root hash of the content,\r
1171 which is divided into chunks (see Discussion).\r
1172 \r
1173 Peer A now registers with the PPSP tracker following the tracker protocol\r
1174 [I-D.ietf.ppsp-reqs] and receives the IP address and RTP port of peers\r
1175 already in the swarm, say B, C, and D. Peer A now sends an RTP packet\r
1176 containing a HANDSHAKE without channel information to B, C, and D. This\r
1177 serves as an end-to-end check that the peers are actually in the correct\r
1178 swarm. Optionally A could include a HINT message in some RTP packets if it\r
1179 wants to start receiving content immediately. B and C respond with a\r
1180 HANDSHAKE and HAVE messages.  D sends just a HANDSHAKE and omits HAVE\r
1181 messages as a way of choking A.\r
1182 \r
1183 \r
1184 6.3.1.2. Exchanging Chunks\r
1185 \r
1186 In response to B and C, A sends new RTP packets to B and C with HINTs for\r
1187 disjunct sets of chunks. B and C respond with the requested chunks in the\r
1188 payload and HAVE messages, updating their chunk availability. Upon receipt, \r
1189 A sends HAVE for the chunks received and new HINT messages to B and C. When \r
1190 e.g. C finds that A obtained a chunk (from B) that C did not yet have, C's \r
1191 response includes a HINT for that chunk.\r
1192 \r
1193 D does not send HAVE messages, instead if D decides to unchoke peer A, it\r
1194 sends an RTP packet with HAVE messages to inform A of its current\r
1195 availability. If B or C decide to choke A they stop sending HAVE and DATA\r
1196 messages and A should then rerequest from other peers. They may continue to\r
1197 send HINT messages, or exponentially slowing KEEPALIVE messages such that A\r
1198 keeps sending them HAVE messages.\r
1199 \r
1200 Once A has received all content (video-on-demand use case) it stops\r
1201 sending messages to all other peers that have all content (a.k.a. seeders).\r
1202 \r
1203 \r
1204 6.3.1.3. Leaving a Swarm\r
1205 \r
1206 Peers can implicitly leave a swarm by stopping to respond to messages.\r
1207 Sending peers should remove these peers from the current peer list. This\r
1208 mechanism works for both graceful and ungraceful leaves (i.e., peer crashes\r
1209 or disconnects). When leaving gracefully, a peer should deregister from the\r
1210 tracker following the PPSP tracker protocol.\r
1211 \r
1212 More explicit graceful leaves could be implemented using RTCP. In\r
1213 particular, a peer could send a RTCP BYE on the RTCP port that is derivable\r
1214 from a peer's RTP port for all peers in its current peer list. However, to\r
1215 prevent malicious peers from sending BYEs a form of peer authentication is\r
1216 required (e.g. using public keys as peer IDs [PERMIDS].)\r
1217 \r
1218 \r
1219 6.3.1.4. Discussion\r
1220 \r
1221 Using swift as an RTP profile requires a change to the content\r
1222 integrity protection scheme (see Sec. 3.5). The fields in the RTP \r
1223 header, such as the timestamp and PT fields, must be protected by the Merkle \r
1224 tree hashing scheme to prevent malicious alterations. Therefore, the Merkle\r
1225 tree is no longer constructed from pure content chunks, but from the complete \r
1226 RTP packet for a chunk as it would be transmitted (minus the non-DATA swift messages).\r
1227 In other words, the hash of the leaves in the tree is the hash over the\r
1228 Authenticated Portion of the RTP packet as defined by SRTP, illustrated\r
1229 in the following figure (extended from [RFC3711]). There is no need for the \r
1230 RTP packets to be fixed size, as the hashing scheme can deal with \r
1231 variable-sized leaves. \r
1232 \r
1233 \r
1234 \r
1235 \r
1236 \r
1237 \r
1238 \r
1239 \r
1240 \r
1241 \r
1242 \r
1243 \r
1244 \r
1245 \r
1246 \r
1247 \r
1248 \r
1249 \r
1250 \r
1251 \r
1252 \r
1253 \r
1254 \r
1255 \r
1256 \r
1257 \r
1258 \r
1259 \r
1260 \r
1261 \r
1262 \r
1263 \r
1264 \r
1265 \r
1266 \r
1267 \r
1268 \r
1269 .in 0\r
1270         0                   1                   2                   3\r
1271 .br\r
1272       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1\r
1273 .br\r
1274      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<+\r
1275 .br\r
1276      |V=2|P|X|  CC   |M|     PT      |       sequence number         | |\r
1277 .br\r
1278      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |\r
1279 .br\r
1280      |                           timestamp                           | |\r
1281 .br\r
1282      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |\r
1283 .br\r
1284      |           synchronization source (SSRC) identifier            | |\r
1285 .br\r
1286      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |\r
1287 .br\r
1288      |            contributing source (CSRC) identifiers             | |\r
1289 .br\r
1290      |                               ....                            | |\r
1291 .br\r
1292      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |\r
1293 .br\r
1294      |                   RTP extension (OPTIONAL)                    | |\r
1295 .br\r
1296      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |\r
1297 .br\r
1298      |                          payload  ...                         | |\r
1299 .br\r
1300      |                               +-------------------------------+ |\r
1301 .br\r
1302      |                               | RTP padding   | RTP pad count | |\r
1303 .br\r
1304      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<+\r
1305 .br\r
1306      ~                     swift non-DATA messages (REQUIRED)        ~ |\r
1307 .br\r
1308      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |\r
1309 .br\r
1310      |                 length of swift messages (REQUIRED)           | |\r
1311 .br\r
1312      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |\r
1313 .br\r
1314                                                                        |\r
1315 .br\r
1316                                               Authenticated Portion ---+\r
1317 .br\r
1318 \r
1319            Figure: The format of an RTP-Swift packet.                                         \r
1320 .in 3     \r
1321                                                                   \r
1322 As a downside, with variable-sized payloads the automatic content size \r
1323 detection of Section 4 no longer works, so content length MUST be explicit in the \r
1324 metadata. In addition, storage on disk is more complex with out-of-order, \r
1325 variable-sized packets. On the upside, carrying RTP over swift allow \r
1326 decryption-less caching.\r
1327  \r
1328 As with UDP, another matter is how much data is carried inside each packet.\r
1329 An important swift-specific factor here is the resulting number of hash \r
1330 calculations per second needed to verify chunks. Experiments should be \r
1331 conducted to ensure they are not excessive for, e.g., mobile hardware.\r
1332 \r
1333 At present, Peer IDs are not required in this design.\r
1334 \r
1335 \r
1336 .ti 0\r
1337 6.3.2. PPSP Requirements\r
1338 \r
1339 6.3.2.1. Basic Requirements\r
1340 \r
1341 - PPSP.REQ-1: The swift PEX message can also be used as the basis for a tracker protocol, to be discussed elsewhere.\r
1342 \r
1343 - PPSP.REQ-2: This draft preserves the properties of RTP.\r
1344 \r
1345 - PPSP.REQ-3: This draft does not place requirements on peer IDs, IP+port is sufficient. \r
1346 \r
1347 - PPSP.REQ-4: The content is identified by its root hash (video-on-demand) or a public key (live streaming).\r
1348 \r
1349 - PPSP.REQ-5: The content is partitioned by the streaming application.\r
1350 \r
1351 - PPSP.REQ-6: Each chunk is identified by a bin number (and its cryptographic hash.)\r
1352 \r
1353 - PPSP.REQ-7: The protocol is carried over UDP because RTP is.\r
1354 \r
1355 - PPSP.REQ-8: The protocol has been designed to allow meaningful data transfer between\r
1356 peers as soon as possible and to avoid unnecessary round-trips. It supports small and\r
1357 variable chunk sizes, and its content integrity protection enables wide scale caching.\r
1358 \r
1359 \r
1360 6.3.2.2. Peer Protocol Requirements\r
1361 \r
1362 - PPSP.PP.REQ-1: A GET_HAVE would have to be added to request which\r
1363 chunks are available from a peer, if the proposed push-based HAVE\r
1364 mechanism is not sufficient. \r
1365 \r
1366 - PPSP.PP.REQ-2: A set of HAVE messages satisfies this. \r
1367 \r
1368 - PPSP.PP.REQ-3: The PEX_REQ message satisfies this. Care should be taken with \r
1369 peer address exchange in general, as the use of such hearsay is a risk for the \r
1370 protocol as it may be exploited by malicious peers (as a DDoS attack mechanism).\r
1371 A secure tracking / peer sampling protocol like [PUPPETCAST] may be needed \r
1372 to make peer-address exchange safe.\r
1373 \r
1374 - PPSP.PP.REQ-4: HAVE messages convey current availability via a push model.\r
1375 \r
1376 - PPSP.PP.REQ-5: Bin numbering enables a compact representation of chunk availability.\r
1377 \r
1378 - PPSP.PP.REQ-6: A new PPSP specific Peer Report message would have to be\r
1379 added to RTCP. \r
1380 \r
1381 - PPSP.PP.REQ-7: Transmission and chunk requests are integrated in this\r
1382 protocol.\r
1383 \r
1384 \r
1385 6.3.2.3. Security Requirements\r
1386 \r
1387 - PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms\r
1388 [CLOSED] would have to be added. \r
1389 \r
1390 - PPSP.SEC.REQ-2: As RTP is carried verbatim over swift, RTP encryption\r
1391 can be used. Note that just encrypting the RTP part will allow for \r
1392 caching servers that are part of the swarm but do not need access to the \r
1393 decryption keys. They just need access to the swift HASHES in the postfix to \r
1394 verify the packet's integrity.\r
1395 \r
1396 - PPSP.SEC.REQ-3: RTP encryption or IPsec [RFC4303] can be used, if the\r
1397 swift messages must also be encrypted. \r
1398 \r
1399 - PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the indirect\r
1400 spread of corrupt content, as peers will only forward chunks to others if\r
1401 their integrity check out. Another protection mechanism is to not depend on\r
1402 hearsay (i.e., do not forward other peers' availability information), or to\r
1403 only use it when the information spread is self-certified by its subjects.\r
1404 \r
1405 Other attacks, such as a malicious peer claiming it has content but not\r
1406 replying, are still possible. Or wasting CPU and bandwidth at a receiving\r
1407 peer by sending packets where the DATA doesn't match the HASHes.\r
1408 \r
1409 \r
1410 - PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving\r
1411 peer to detect a malicious or faulty sender, which it can subsequently\r
1412 ignore. Spreading this knowledge to other peers such that they know\r
1413 about this bad behavior is hearsay. \r
1414 \r
1415 \r
1416 - PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that malicious\r
1417 peers launch an Eclipse [ECLIPSE] attack on the initial injectors of the\r
1418 content (in particular in live streaming). The attack tries to let the\r
1419 injector upload to just malicious peers which then do not forward the\r
1420 content to others, thus stopping the distribution. An Eclipse attack could\r
1421 also be launched on an individual peer. Letting these injectors only use\r
1422 trusted trackers that provide true random samples of the population or using\r
1423 a secure peer sampling service [PUPPETCAST] can help negate such an attack. \r
1424 \r
1425 \r
1426 - PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or\r
1427 additional mechanisms such as DHTs [SECDHTS], but self-certification of\r
1428 addresses is needed. Self-certification means For example, that each peer \r
1429 has a public/private key pair [PERMIDS] and creates self-certified address \r
1430 changes that include the swarm ID and a timestamp, which are then exchanged\r
1431 among peers or stored in DHTs. See also discussion of PPSP.PP.REQ-3 \r
1432 above. Content distribution can continue as long as there are peers that \r
1433 have it available.\r
1434 \r
1435 - PPSP.SEC.REQ-8: The verification of data via hashes obtained from a\r
1436 trusted source is well-established in the BitTorrent protocol [BITTORRENT].\r
1437 The proposed Merkle tree scheme is a secure extension of this idea.\r
1438 Self-certification and not using hearsay are other lessons learned from\r
1439 existing distributed systems.\r
1440 \r
1441 - PPSP.SEC.REQ-9: Swift has built-in content integrity protection via\r
1442 self-certified naming of content, see SEC.REQ-5 and Sec. 3.5.1.\r
1443 \r
1444 \r
1445 .ti 0\r
1446 6.4. HTTP (as PPSP)\r
1447 \r
1448 In this section we sketch how swift can be carried over HTTP [RFC2616] to\r
1449 form the PPSP running over TCP. The general idea is to encode a swift\r
1450 datagram in HTTP GET and PUT requests and their replies by transmitting all\r
1451 the non-DATA messages such as HINTs and HAVEs as headers and send DATA\r
1452 messages in the body. This idea follows the atomic datagram principle for\r
1453 each request and reply. So a receiving peer can autonomously verify the\r
1454 message as carrying correct data, thus preventing the spread of corrupt data\r
1455 (see requirement PPSP.SEC-REQ-4). \r
1456 \r
1457 A problem with HTTP is that it is a client/server protocol. To overcome this\r
1458 problem, a peer A uses a PUT request instead of a GET request if the peer B\r
1459 has indicated in a reply that it wants to retrieve a chunk from A. In cases\r
1460 where peer A is no longer interested in receiving requests from B (described\r
1461 below) B may need to establish a new HTTP connection to A to quickly\r
1462 download a chunk, instead of waiting for a convenient time when A sends\r
1463 another request. As an alternative design, two HTTP connections could be\r
1464 used always., but this is inefficient.\r
1465 \r
1466 .ti 0\r
1467 6.4.1. Design\r
1468 \r
1469 6.4.1.1. Joining a Swarm\r
1470 \r
1471 To commence a PPSP download a peer A must have the swarm ID of the stream\r
1472 and a list of one or more tracker contact points, as above. The swarm ID as\r
1473 earlier also consists of the swift root hash of the content, divided in\r
1474 chunks by the streaming application (e.g. fixed-size chunks of 1 kilobyte\r
1475 for video-on-demand). \r
1476 \r
1477 Peer A now registers with the PPSP tracker following the tracker protocol\r
1478 [I-D.ietf-ppsp-reqs] and receives the IP address and HTTP port of peers\r
1479 already in the swarm, say B, C, and D. Peer A now establishes persistent\r
1480 HTTP connections with B, C, D and sends GET requests with the Request-URI\r
1481 set to /<encoded roothash>. Optionally A could include a HINT message in\r
1482 some requests if it wants to start receiving content immediately. A HINT is\r
1483 encoded as a Range header with a new "bins" unit [RFC2616,$14.35].\r
1484 \r
1485 B and C respond with a 200 OK reply with header-encoded HAVE messages.  A\r
1486 HAVE message is encoded as an extended Accept-Ranges: header [RFC2616,$14.5] \r
1487 with the new bins unit and the possibility of listing the set of accepted bins. \r
1488 If no HINT/Range header was present in the request, the body of the reply is\r
1489 empty. D sends just a 200 OK reply and omits the HAVE/Accept-Ranges header\r
1490 as a way of choking A.\r
1491 \r
1492 6.4.1.2. Exchanging Chunks\r
1493 \r
1494 In response to B and C, A sends GET requests with Range headers, requesting\r
1495 disjunct sets of chunks. B and C respond with 206 Partial Content replies\r
1496 with the requested chunks in the body and Accept-Ranges headers, updating\r
1497 their chunk availability. The HASHES for the chunks are encoded in a new\r
1498 Content-Merkle header and the Content-Range is set to identify the chunk \r
1499 [RFC2616,$14.16]. A new "multipart-bin ranges" equivalent to the \r
1500 "multipart-bytes ranges" media type may be used to transmit multiple chunks \r
1501 in one reply.\r
1502 \r
1503 Upon receipt, A sends a new GET request with a HAVE/Accept-Ranges header for\r
1504 the chunks received and new HINT/Range headers to B and C. Now when e.g. C\r
1505 finds that A obtained a chunk (from B) that C did not yet have, C's response\r
1506 includes a HINT/Range for that chunk. In this case, A's next request to C is\r
1507 not a GET request, but a PUT request with the requested chunk sent in the\r
1508 body.\r
1509 \r
1510 Again, working around the fact that HTTP is a client/server protocol, peer A\r
1511 periodically sends HEAD requests to peer D (which was virtually choking A)\r
1512 that serve as keepalives and may contain HAVE/Accept-Ranges headers. If D\r
1513 decides to unchoke peer A, it includes an Accept-Ranges header in the "200 OK"\r
1514 reply to inform A of its current chunk availability. \r
1515 \r
1516 If B or C decide to choke A they start responding with 204 No Content\r
1517 replies without HAVE/Accept-Ranges headers and A should then re-request from\r
1518 other peers. However, if their replies contain HINT/Range headers A should\r
1519 keep on sending PUT requests with the desired data (another client/server\r
1520 workaround). If not, A should slowly send HEAD requests as keepalive and\r
1521 content availability update.\r
1522 \r
1523 Once A has received all content (video-on-demand use case) it closes the\r
1524 persistent connections to all other peers that have all content (a.k.a.\r
1525 seeders).\r
1526 \r
1527 \r
1528 6.4.1.3. Leaving a Swarm\r
1529 \r
1530 Peers can explicitly leave a swarm by closing the connection. This mechanism\r
1531 works for both graceful and ungraceful leaves (i.e., peer crashes or\r
1532 disconnects). When leaving gracefully, a peer should deregister from the\r
1533 tracker following the PPSP tracker protocol.\r
1534 \r
1535 \r
1536 6.4.1.4. Discussion\r
1537 \r
1538 As mentioned earlier, this design suffers from the fact that HTTP is a\r
1539 client/server protocol. A solution where a peer establishes two HTTP\r
1540 connections with every other peer may be more elegant, but inefficient. \r
1541 The mapping of swift messages to headers remains the same:\r
1542 \r
1543         HINT = Range\r
1544 .br\r
1545         HAVE = Accept-Ranges\r
1546 .br\r
1547         HASH = Content-Merkle\r
1548 .br     \r
1549         PEX = e.g. extended Content-Location\r
1550 .br\r
1551 \r
1552 The Content-Merkle header should include some parameters to\r
1553 indicate the hash function and chunk size (e.g. SHA1 and 1K) used to\r
1554 build the Merkle tree.\r
1555 \r
1556 \r
1557 .ti 0\r
1558 6.4.2. PPSP Requirements\r
1559 \r
1560 6.4.2.1. Basic Requirements\r
1561 \r
1562 - PPSP.REQ-1: The HTTP-based BitTorrent tracker protocol [BITTORRENT] can be\r
1563 used as the basis for a tracker protocol, to be discussed elsewhere.\r
1564 \r
1565 - PPSP.REQ-2: This draft preserves the properties of HTTP, but \r
1566 extra mechanisms may be necessary to protect against faulty\r
1567 or malicious peers.\r
1568 \r
1569 - PPSP.REQ-3: This draft does not place requirements on peer IDs,\r
1570 IP+port is sufficient. \r
1571 \r
1572 - PPSP.REQ-4: The content is identified by its root hash (video-on-demand)\r
1573 or a public key (live streaming).\r
1574 \r
1575 - PPSP.REQ-5: The content is partitioned into chunks by the streaming application (see 6.4.1.1.)\r
1576 \r
1577 - PPSP.REQ-6: Each chunk is identified by a bin number (and its cryptographic hash.)\r
1578 \r
1579 - PPSP.REQ-7: The protocol is carried over TCP because HTTP is.\r
1580 \r
1581 \r
1582 6.4.2.2. Peer Protocol Requirements\r
1583 \r
1584 - PPSP.PP.REQ-1: A HEAD request can be used to find out which\r
1585 chunks are available from a peer, which returns the new Accept-Ranges header.\r
1586 \r
1587 - PPSP.PP.REQ-2: The new Accept-Ranges header satisfies this. \r
1588 \r
1589 - PPSP.PP.REQ-3: A GET with a request-URI requesting the peers of a resource\r
1590 (e.g. /<encoded roothash>/peers) would have to be added to request known\r
1591 peers from a peer, if the proposed push-based PEX/~Content-Location\r
1592 mechanism is not sufficient. Care should be taken with peer address exchange \r
1593 in general, as the use of such hearsay is a risk for the protocol as it may be \r
1594 exploited by malicious peers (as a DDoS attack mechanism). A secure \r
1595 tracking / peer sampling protocol like [PUPPETCAST] may be needed \r
1596 to make peer-address exchange safe.\r
1597 \r
1598 \r
1599 - PPSP.PP.REQ-4: HAVE/Accept-Ranges headers convey current availability.\r
1600 \r
1601 - PPSP.PP.REQ-5: Bin numbering enables a compact representation of chunk availability.\r
1602 \r
1603 - PPSP.PP.REQ-6: A new PPSP specific Peer-Report header would have to be\r
1604 added. \r
1605 \r
1606 - PPSP.PP.REQ-7: Transmission and chunk requests are integrated in this\r
1607 protocol.\r
1608 \r
1609 \r
1610 \r
1611 \r
1612 6.4.2.3. Security Requirements\r
1613 \r
1614 - PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms\r
1615 [CLOSED] would have to be added. \r
1616 \r
1617 - PPSP.SEC.REQ-2: As swift is carried over HTTP, HTTPS encryption can be\r
1618 used instead. Alternatively, just the body could be encrypted. The latter\r
1619 allows for caching servers that are part of the swarm but do not need access\r
1620 to the decryption keys (they need access to the swift HASHES in the headers\r
1621 to verify the packet's integrity).\r
1622 \r
1623 - PPSP.SEC.REQ-3: HTTPS encryption or the content encryption facilities of\r
1624 HTTP can be used.\r
1625 \r
1626 - PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the indirect\r
1627 spread of corrupt content, as peers will only forward content to others if\r
1628 its integrity checks out. Another protection mechanism is to not depend on\r
1629 hearsay (i.e., do not forward other peers' availability information), or to\r
1630 only use it when the information spread is self-certified by its subjects.\r
1631 \r
1632 Other attacks such as a malicious peer claiming it has content, but not\r
1633 replying are still possible. Or wasting CPU and bandwidth at a receiving\r
1634 peer by sending packets where the body doesn't match the HASH/Content-Merkle\r
1635 headers.\r
1636 \r
1637 \r
1638 - PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving peer to\r
1639 detect a malicious or faulty sender, which it can subsequently close its\r
1640 connection to and ignore. Spreading this knowledge to other peers such that\r
1641 they know about this bad behavior is hearsay. \r
1642 \r
1643 \r
1644 - PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that malicious\r
1645 peers launch an Eclipse [ECLIPSE] attack on the initial injectors of the\r
1646 content (in particular in live streaming). The attack tries to let the\r
1647 injector upload to just malicious peers which then do not forward the\r
1648 content to others, thus stopping the distribution. An Eclipse attack could\r
1649 also be launched on an individual peer. Letting these injectors only use\r
1650 trusted trackers that provide true random samples of the population or using\r
1651 a secure peer sampling service [PUPPETCAST] can help negate such an attack. \r
1652 \r
1653 \r
1654 - PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or\r
1655 additional mechanisms such as DHTs [SECDHTS], but self-certification of\r
1656 addresses is needed. Self-certification means For example, that each peer \r
1657 has a public/private key pair [PERMIDS] and creates self-certified address \r
1658 changes that include the swarm ID and a timestamp, which are then exchanged\r
1659 among peers or stored in DHTs. See also discussion of PPSP.PP.REQ-3 \r
1660 above. Content distribution can continue as long as there are peers that \r
1661 have it available.\r
1662 \r
1663 - PPSP.SEC.REQ-8: The verification of data via hashes obtained from a\r
1664 trusted source is well-established in the BitTorrent protocol [BITTORRENT].\r
1665 The proposed Merkle tree scheme is a secure extension of this idea.\r
1666 Self-certification and not using hearsay are other lessons learned from\r
1667 existing distributed systems.\r
1668 \r
1669 - PPSP.SEC.REQ-9: Swift has built-in content integrity protection via\r
1670 self-certified naming of content, see SEC.REQ-5 and Sec. 3.5.1.\r
1671 \r
1672 \r
1673 .ti 0\r
1674 7. Security Considerations\r
1675 \r
1676 As any other network protocol, the swift faces a common set of security\r
1677 challenges. An implementation must consider the possibility of buffer\r
1678 overruns, DoS attacks and manipulation (i.e. reflection attacks). Any\r
1679 guarantee of privacy seems unlikely, as the user is exposing its IP address\r
1680 to the peers. A probable exception is the case of the user being hidden\r
1681 behind a public NAT or proxy.\r
1682 \r
1683 \r
1684 .ti 0\r
1685 8. Extensibility\r
1686 \r
1687 .ti 0\r
1688 8.1. 32 bit vs 64 bit\r
1689 \r
1690 .nf\r
1691 While in principle the protocol supports bigger (>1TB) files, all the\r
1692 mentioned counters are 32-bit. It is an optimization, as using\r
1693 \%64-bit numbers on-wire may cost ~2% practical overhead. The 64-bit\r
1694 version of every message has typeid of 64+t, e.g. typeid 68 for\r
1695 \%64-bit hash message:\r
1696 .nf\r
1697    44 000000000000000E 01234567890ABCDEF1234567890ABCDEF1234567\r
1698 \r
1699 .ti 0\r
1700 8.2. IPv6\r
1701 \r
1702 .fi\r
1703 IPv6 versions of PEX messages use the same 64+t shift as just mentioned.\r
1704 \r
1705 \r
1706 .ti 0\r
1707 8.3. Congestion Control Algorithms\r
1708 \r
1709 .fi\r
1710 Congestion control algorithm is left to the implementation and may even vary\r
1711 from peer to peer. Congestion control is entirely implemented by the sending\r
1712 peer, the receiver only provides clues, such as hints, acknowledgments and\r
1713 timestamps. In general, it is expected that servers would use TCP-like\r
1714 congestion control schemes such as classic AIMD or CUBIC [CUBIC]. End-user\r
1715 peers are expected to use weaker-than-TCP (least than best effort)\r
1716 congestion control, such as [LEDBAT] to minimize seeding counter-incentives.\r
1717 \r
1718 \r
1719 .ti 0\r
1720 8.4. Piece Picking Algorithms\r
1721 \r
1722 Piece picking entirely depends on the receiving peer. The sender peer is\r
1723 made aware of preferred pieces by the means of HINT messages. In some\r
1724 scenarios it may be beneficial to allow the sender to ignore those hints and\r
1725 send unrequested data.\r
1726 \r
1727 \r
1728 .ti 0\r
1729 8.5. Reciprocity Algorithms\r
1730 \r
1731 Reciprocity algorithms are the sole responsibility of the sender peer.\r
1732 Reciprocal intentions of the sender are not manifested by separate messages\r
1733 (as BitTorrent's CHOKE/UNCHOKE), as it does not guarantee anything anyway\r
1734 (the "snubbing" syndrome).\r
1735 \r
1736 \r
1737 .ti 0\r
1738 8.6. Different crypto/hashing schemes\r
1739 \r
1740 .fi\r
1741 Once a flavor of swift will need to use a different crypto scheme\r
1742 (e.g., SHA-256), a message should be allocated for that. As the root\r
1743 hash is supplied in the handshake message, the crypto scheme in use\r
1744 will be known from the very beginning. As the root hash is the\r
1745 content's identifier, different schemes of crypto cannot be mixed\r
1746 in the same swarm; different swarms may distribute the same content\r
1747 using different crypto.\r
1748 \r
1749 \r
1750 .ti 0\r
1751 9. Rationale\r
1752 \r
1753 Historically, the Internet was based on end-to-end unicast\r
1754 and, considering the failure of multicast, was addressed by\r
1755 different technologies, which ultimately boiled down to maintaining\r
1756 and coordinating distributed replicas. On one hand, downloading\r
1757 from a nearby well-provisioned replica is somewhat faster and/or\r
1758 cheaper; on the other hand, it requires to coordinate multiple\r
1759 parties (the data source, mirrors/CDN sites/peers, consumers). As\r
1760 the Internet progresses to richer and richer content, the overhead\r
1761 of peer/replica coordination becomes dwarfed by the mass of the\r
1762 download itself. Thus, the niche for multiparty transfers expands.\r
1763 Still, current, relevant technologies are tightly coupled to a\r
1764 single use case or even infrastructure of a particular corporation.\r
1765 The mission of our project is to create a generic content-centric\r
1766 multiparty transport protocol to allow seamless, effortless data\r
1767 dissemination on the Net.\r
1768 \r
1769                     TABLE 1. Use cases.\r
1770 \r
1771 .br\r
1772       | mirror-based   peer-assisted        peer-to-peer\r
1773 .br\r
1774 ------+----------------------------------------------------\r
1775 .br\r
1776 data  | SunSITE        CacheLogic VelociX   BitTorrent\r
1777 .br\r
1778 VoD   | YouTube        Azureus(+seedboxes)  SwarmPlayer\r
1779 .br\r
1780 live  | Akamai Str.    Octoshape, Joost     PPlive\r
1781 .br\r
1782 \r
1783 .fi\r
1784 The protocol must be designed for maximum genericity, thus \r
1785 focusing on the very core of the mission, contain no magic\r
1786  constants and no hardwired policies. Effectively, it is a \r
1787 set of messages allowing to securely retrieve data from\r
1788 whatever source available, in parallel. Ideally, the protocol must\r
1789 be able to run over IP as an independent transport protocol.\r
1790 Practically, it must run over UDP and TCP.\r
1791 \r
1792 \r
1793 .ti 0\r
1794 9.1.  Design Goals\r
1795 \r
1796 .fi\r
1797 The technical focus of the swift protocol is to find the \r
1798 simplest solution involving the minimum set of primitives, still \r
1799 being sufficient to implement all the targeted usecases (see \r
1800 Table 1), suitable for use in general-purpose software and hardware\r
1801 (i.e. a web browser or a set-top box). The five design goals for \r
1802 the protocol are:\r
1803 \r
1804 .nf\r
1805 1. Embeddable kernel-ready protocol.\r
1806 2. Embrace real-time streaming, in- and out-of-order download.\r
1807 3. Have short warm-up times.\r
1808 4. Traverse NATs transparently.\r
1809 5. Be extensible, allow for multitude of implementation over\r
1810    diverse mediums, allow for drop-in pluggability.\r
1811 \r
1812 The objectives are referenced as (1)-(5).\r
1813 \r
1814 .fi\r
1815 The goal of embedding (1) means that the protocol must be ready \r
1816 to function as a regular transport protocol inside a set-top box, \r
1817 mobile device, a browser and/or in the kernel space. Thus, the \r
1818 protocol must have light footprint, preferably less than TCP, \r
1819 in spite of the necessity to support numerous ongoing connections \r
1820 as well as to constantly probe the network for new possibilities. \r
1821 The practical overhead for TCP is estimated at 10KB per connection \r
1822 [HTTP1MLN]. We aim at <1KB per peer connected. Also, the amount of\r
1823  code necessary to make a basic implementation must be limited to \r
1824 10KLoC of C. Otherwise, besides the resource considerations, \r
1825 maintaining and auditing the code might become prohibitively expensive.\r
1826 \r
1827 The support for all three basic usecases of real-time streaming,\r
1828  \%in-order download and out-of-order download (2) is necessary \r
1829 for the manifested goal of THE multiparty transport protocol as no\r
1830  single usecase dominates over the others.\r
1831 \r
1832 The objective of short warm-up times (3) is the matter of end-user\r
1833  experience; the playback must start as soon as possible. Thus any \r
1834 unnecessary initialization roundtrips and warm-up cycles must be \r
1835 eliminated from the transport layer.\r
1836 \r
1837 .fi\r
1838 Transparent NAT traversal (4) is absolutely necessary as at least\r
1839 60% of today's users are hidden behind NATs. NATs severely affect\r
1840 connection patterns in P2P networks thus impacting performance \r
1841 and fairness [MOLNAT,LUCNAT].\r
1842 \r
1843 The protocol must define a common message set (5) to be used by \r
1844 implementations; it must not hardwire any magic constants, \r
1845 algorithms or schemes beyond that. For example, an implementation\r
1846  is free to use its own congestion control, connection rotation \r
1847 or reciprocity algorithms. Still, the protocol must enable such \r
1848 algorithms by supplying sufficient information. For example, \r
1849 trackerless peer discovery needs peer exchange messages, \r
1850 scavenger congestion control may need timestamped acknowledgments,\r
1851  etc.\r
1852 \r
1853 \r
1854 .ti 0\r
1855 9.2.  Not TCP\r
1856 \r
1857 .fi\r
1858 To large extent, swift's design is defined by the cornerstone decision\r
1859 to get rid of TCP and not to reinvent any TCP-like transports on\r
1860 top of UDP or otherwise. The requirements (1), (4), (5) make TCP a\r
1861 bad choice due to its high per-connection footprint, complex and\r
1862 less reliable NAT traversal and fixed predefined congestion control\r
1863 algorithms. Besides that, an important consideration is that no\r
1864 block of TCP functionality turns out to be useful for the general\r
1865 case of swarming downloads. Namely,\r
1866 .nf\r
1867   1. in-order delivery is less useful as peer-to-peer protocols\r
1868   often employ out-of-order delivery themselves and in either case\r
1869   \%out-of-order data can still be stored;\r
1870   2. reliable delivery/retransmissions are not useful because\r
1871   the same data might be requested from different sources; as\r
1872   in-order delivery is not required, packet losses might be\r
1873   patched up lazily, without stopping the flow of data;\r
1874   3. flow control is not necessary as the receiver is much less\r
1875   likely to be saturated with the data and even if so, that\r
1876   situation is perfectly detected by the congestion control;\r
1877   4. TCP congestion control is less useful as custom congestion\r
1878   control is often needed [LEDBAT].\r
1879 In general, TCP is built and optimized for a different usecase than\r
1880 we have with swarming downloads. The abstraction of a "data pipe"\r
1881 orderly delivering some stream of bytes from one peer to another\r
1882 turned out to be irrelevant. In even more general terms, TCP\r
1883 supports the abstraction of pairwise _conversations_, while we need\r
1884 a content-centric protocol built around the abstraction of a cloud\r
1885 of participants disseminating the same _data_ in any way and order\r
1886 that is convenient to them.\r
1887 \r
1888 .fi\r
1889 Thus, the choice is to design a protocol that runs on top of \r
1890 unreliable datagrams. Instead of reimplementing TCP, we create a \r
1891 \%datagram-based protocol, completely dropping the sequential data \r
1892 stream abstraction. Removing unnecessary features of TCP makes \r
1893 it easier both to implement the protocol and to verify it; \r
1894 numerous TCP vulnerabilities were caused by complexity of the \r
1895 protocol's state machine. Still, we reserve the possibility \r
1896 to run swift on top of TCP or HTTP. \r
1897 \r
1898 Pursuing the maxim of making things as simple as possible but not \r
1899 simpler, we fit the protocol into the constraints of the transport\r
1900 layer by dropping all the transmission's technical metadata except \r
1901 for the content's root hash (compare that to metadata files used \r
1902 in BitTorrent). Elimination of technical metadata is achieved \r
1903 through the use of Merkle [MERKLE,ABMRKL] hash trees, exclusively\r
1904 single-file transfers and other techniques. As a result, a transfer\r
1905 is identified and bootstrapped by its root hash only.\r
1906 \r
1907 .fi\r
1908 To avoid the usual layering of positive/negative acknowledgment \r
1909 mechanisms we introduce a scale-invariant acknowledgment system\r
1910  (see Sec 4.4). The system allows for aggregation and variable \r
1911 level of detail in requesting, announcing and acknowledging data,\r
1912  serves \%in-order and out-of-order retrieval with equal ease.\r
1913 Besides the protocol's footprint, we also aim at lowering the \r
1914 size of a minimal useful interaction. Once a single datagram is \r
1915 received, it must be checked for data integrity, and then either \r
1916 dropped or accepted, consumed and relayed.\r
1917 \r
1918 \r
1919 \r
1920 .ti 0\r
1921 9.3.  Generic Acknowledgments\r
1922 \r
1923 .nf\r
1924 Generic acknowledgments came out of the need to simplify the\r
1925 data addressing/requesting/acknowledging mechanics, which tends\r
1926 to become overly complex and multilayered with the conventional\r
1927 approach. Take the BitTorrent+TCP tandem for example:\r
1928 \r
1929 1. The basic data unit is a byte of content in a file.\r
1930 2. BitTorrent's highest-level unit is a "torrent", physically a\r
1931 byte range resulting from concatenation of content files.\r
1932 3. A torrent is divided into "pieces", typically about a thousand\r
1933 of them. Pieces are used to communicate progress to other\r
1934 peers. Pieces are also basic data integrity units, as the torrent's\r
1935 metadata includes a SHA1 hash for every piece.\r
1936 4. The actual data transfers are requested and made in 16KByte\r
1937 units, named "blocks" or chunks.\r
1938 5. Still, one layer lower, TCP also operates with bytes and byte\r
1939 offsets which are totally different from the torrent's bytes and\r
1940 offsets, as TCP considers cumulative byte offsets for all content\r
1941 sent by a connection, be it data, metadata or commands.\r
1942 6. Finally, another layer lower, IP transfers independent datagrams\r
1943 (typically around 1.5 kilobyte), which TCP then reassembles into\r
1944 continuous streams.\r
1945 \r
1946 Obviously, such addressing schemes need lots of mappings; from\r
1947 piece number and block to file(s) and offset(s) to TCP sequence\r
1948 numbers to the actual packets and the other way around. Lots of\r
1949 complexity is introduced by mismatch of bounds: packet bounds are\r
1950 different from file, block or hash/piece bounds. The picture is\r
1951 typical for a codebase which was historically layered.\r
1952 \r
1953 To simplify this aspect, we employ a generic content addressing\r
1954 scheme based on binary intervals, or "bins" for short.\r
1955 \r
1956 \r
1957 \r
1958 .ti 0\r
1959 Acknowledgements\r
1960 \r
1961 Arno Bakker and Victor Grishchenko are partially supported by the\r
1962 P2P-Next project (http://www.p2p-next.org/), a research project \r
1963 supported by the European Community under its 7th Framework Programme\r
1964 (grant agreement no. 216217).  The views and conclusions contained \r
1965 herein are those of the authors and should not be interpreted as \r
1966 necessarily representing the official policies or endorsements, \r
1967 either expressed or implied, of the P2P-Next project or the European \r
1968 Commission.\r
1969 \r
1970 .fi\r
1971 The swift protocol was designed by Victor Grishchenko at Technische Universiteit Delft.\r
1972 The authors would like to thank the following people for their \r
1973 contributions to this draft: Mihai Capota, Raul Jiminez, Flutra Osmani, \r
1974 Riccardo Petrocco, Johan Pouwelse, and Raynor Vliegendhart.\r
1975 \r
1976 \r
1977 .ti 0\r
1978 References\r
1979 \r
1980 .nf\r
1981 .in 0\r
1982 [RFC2119] Key words for use in RFCs to Indicate Requirement Levels\r
1983 [HTTP1MLN] Richard Jones. "A Million-user Comet Application with\r
1984     Mochiweb", Part 3. http://www.metabrew.com/article/\r
1985     \%a-million-user-comet-application-with-mochiweb-part-3\r
1986 [MOLNAT] J.J.D. Mol, J.A. Pouwelse, D.H.J. Epema and H.J. Sips:\r
1987     \%"Free-riding, Fairness, and Firewalls in P2P File-Sharing"\r
1988     Proc. Eighth International Conference on Peer-to-Peer Computing \r
1989     (P2P '08), Aachen, Germany, 8-11 Sept. 2008, pp. 301 - 310.\r
1990 [LUCNAT] L. D'Acunto and M. Meulpolder and R. Rahman and J.A. \r
1991     Pouwelse and H.J. Sips. "Modeling and Analyzing the Effects\r
1992     of Firewalls and NATs in P2P Swarming Systems". In Proc. of \r
1993     IEEE IPDPS (HotP2P), Atlanta, USA, April 23, 2010.\r
1994 [BINMAP] V. Grishchenko, J. Pouwelse: "Binmaps: hybridizing bitmaps\r
1995     and binary trees". Technical Report PDS-2011-005, Parallel and \r
1996     Distributed Systems Group, Fac. of Electrical Engineering, \r
1997     Mathematics, and Computer Science, Delft University of Technology, \r
1998     The Netherlands, April 2009.\r
1999 [SNP] B. Ford, P. Srisuresh, D. Kegel: "Peer-to-Peer Communication\r
2000     Across Network Address Translators",\r
2001     http://www.brynosaurus.com/pub/net/p2pnat/\r
2002 [FIPS180-2]\r
2003     Federal Information Processing Standards Publication 180-2:\r
2004     "Secure Hash Standard" 2002 August 1.\r
2005 [MERKLE] Merkle, R. "Secrecy, Authentication, and Public Key Systems", \r
2006     Ph.D. thesis, Dept. of Electrical Engineering, Stanford University, \r
2007     CA, USA, 1979. pp 40-45.\r
2008 [ABMRKL] Arno Bakker: "Merkle hash torrent extension", BitTorrent \r
2009     Enhancement Proposal 30, Mar 2009.\r
2010     http://bittorrent.org/beps/bep_0030.html\r
2011 [CUBIC] Injong Rhee, and Lisong Xu: "CUBIC: A New TCP-Friendly\r
2012     \%High-Speed TCP Variant", Proc. Third International Workshop\r
2013     on Protocols for Fast Long-Distance Networks (PFLDnet), Lyon,\r
2014     France, Feb 2005.\r
2015 [LEDBAT] S. Shalunov et al. "Low Extra Delay Background Transport \r
2016     (LEDBAT)", IETF Internet-Draft draft-ietf-ledbat-congestion\r
2017     (work in progress), Oct 2011.\r
2018     \%http://datatracker.ietf.org/doc/draft-ietf-ledbat-congestion/ \r
2019 [TIT4TAT] Bram Cohen: "Incentives Build Robustness in BitTorrent", \r
2020     Proc. 1st Workshop on Economics of Peer-to-Peer Systems, Berkeley, \r
2021     CA, USA, Jun 2003.\r
2022 [BITTORRENT] B. Cohen, "The BitTorrent Protocol Specification",\r
2023      February 2008, \%http://www.bittorrent.org/beps/bep_0003.html\r
2024 [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.\r
2025     Jacobson, "RTP: A Transport Protocol for Real-Time\r
2026     Applications", STD 64, RFC 3550, July 2003.\r
2027 [RFC3711] M. Baugher, D. McGrew, M. Naslund, E. Carrara, K. Norrman,\r
2028     "The Secure Real-time Transport Protocol (SRTP), RFC 3711, March \r
2029     2004.\r
2030 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing,\r
2031    "Session Traversal Utilities for NAT (STUN)", RFC 5389, October 2008.\r
2032 [RFC2616] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, \r
2033    P. Leach, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1",\r
2034    RFC2616, June 1999.\r
2035 [I-D.ietf-ppsp-reqs] Zong, N., Zhang, Y., Pascual, V., Williams, C.,\r
2036     and L. Xiao, "P2P  Streaming Protocol (PPSP) Requirements", \r
2037     draft-ietf-ppsp-reqs-05 (work in progress), October 2011.\r
2038 [PPSPCHART] M. Stiemerling et al. "Peer to Peer Streaming Protocol (ppsp)\r
2039     Description of Working Group" \r
2040    \%http://datatracker.ietf.org/wg/ppsp/charter/\r
2041 [PERMIDS] A. Bakker et al. "Next-Share Platform M8--Specification\r
2042     Part", App. C. P2P-Next project deliverable D4.0.1 (revised), \r
2043     June 2009.\r
2044     \%http://www.p2p-next.org/download.php?id=E7750C654035D8C2E04D836243E6526E\r
2045 [PUPPETCAST] A. Bakker and M. van Steen. "PuppetCast: A Secure Peer\r
2046     Sampling Protocol". Proceedings 4th Annual European Conference on\r
2047     Computer Network Defense (EC2ND'08), pp. 3-10, Dublin, Ireland,\r
2048     11-12 December 2008.\r
2049 [CLOSED] N. Borch, K. Michell, I. Arntzen, and D. Gabrijelcic: "Access\r
2050     control to BitTorrent swarms using closed swarms". In Proceedings\r
2051     of the 2010 ACM workshop on Advanced video streaming techniques\r
2052     for peer-to-peer networks and social networking (AVSTP2P '10).\r
2053     ACM, New York, NY, USA, 25-30. \r
2054     \%http://doi.acm.org/10.1145/1877891.1877898 \r
2055 [ECLIPSE] E. Sit and R. Morris, "Security Considerations for\r
2056     Peer-to-Peer Distributed Hash Tables", IPTPS '01: Revised Papers\r
2057     from the First International Workshop on Peer-to-Peer Systems, pp.\r
2058     261-269, Springer-Verlag, London, UK, 2002.\r
2059 [SECDHTS] G. Urdaneta, G. Pierre, M. van Steen, "A Survey of DHT\r
2060    Security Techniques", ACM Computing Surveys, vol. 43(2), June 2011.\r
2061 [SWIFTIMPL] V. Grishchenko, et al. "Swift M40 reference implementation",\r
2062    \%http://swarmplayer.p2p-next.org/download/Next-Share-M40.tar.bz2\r
2063    (subdirectory Next-Share/TUD/swift-trial-r2242/), July 2011.\r
2064 [CCNWIKI] http://en.wikipedia.org/wiki/Content-centric_networking\r
2065 [HAC01] A.J. Menezes, P.C. van Oorschot and S.A. Vanstone. "Handbook of\r
2066    Applied Cryptography", CRC Press, October 1996 (Fifth Printing, \r
2067    August 2001). \r
2068 [JIM11] R. Jimenez, F. Osmani, and B. Knutsson. "Sub-Second Lookups on \r
2069    a Large-Scale Kademlia-Based Overlay". 11th IEEE International \r
2070    Conference on Peer-to-Peer Computing 2011, Kyoto, Japan, Aug. 2011 \r
2071 \r
2072 \r
2073 .ti 0\r
2074 Authors' addresses\r
2075 \r
2076 .in 3\r
2077 A. Bakker\r
2078 Technische Universiteit Delft \r
2079 Department EWI/ST/PDS\r
2080 Room HB 9.160\r
2081 Mekelweg 4\r
2082 2628CD Delft\r
2083 The Netherlands\r
2084 \r
2085 Email: arno@cs.vu.nl\r
2086 \r
2087 .ce 0\r