src/libswift/doc/draft-ietf-ppsp-peer-protocol-00.txt

   1
   2
   3
   4
   5 PPSP                                                           A. Bakker
   6 Internet-Draft                                                  TU Delft
   7
   8 Expires: June 21, 2012                                 December 19, 2011
   9
  10         Peer-to-Peer Streaming Protocol (PPSP)
  11             <draft-ietf-ppsp-peer-protocol-00.txt>
  12
  13 Abstract
  14
  15    The Generic Multiparty Protocol (swift) is a peer-to-peer based
  16    transport protocol for content dissemination. It can be used for
  17    streaming on-demand and live video content, as well as conventional
  18    downloading. In swift, the clients consuming the content participate
  19    in the dissemination by forwarding the content to other clients via a
  20    mesh-like structure.  It is a generic protocol which can run directly
  21    on top of UDP, TCP, HTTP or as a RTP profile. Features of swift are
  22    short time-till-playback and extensibility. Hence, it can use
  23    different mechanisms to prevent freeriding, and work with different
  24    peer discovery schemes (centralized trackers or Distributed Hash
  25    Tables). Depending on the underlying transport protocol, swift can
  26    also use different congestion control algorithms, such as LEDBAT, and
  27    offer transparent NAT traversal. Finally, swift maintains only a
  28    small amount of state per peer and detects malicious modification of
  29    content. This documents describes swift and how it satisfies the
  30    requirements for the IETF Peer-to-Peer Streaming Protocol (PPSP)
  31    Working Group's peer protocol.
  32
  33
  34 Status of this memo
  35
  36    This Internet-Draft is submitted to IETF in full conformance with the
  37    provisions of BCP 78 and BCP 79.
  38
  39    Internet-Drafts are working documents of the Internet Engineering
  40    Task Force (IETF), its areas, and its working groups.  Note that
  41    other groups may also distribute working documents as Internet-
  42    Drafts.
  43
  44    Internet-Drafts are draft documents valid for a maximum of six months
  45    and may be updated, replaced, or obsoleted by other documents at any
  46    time.  It is inappropriate to use Internet-Drafts as reference
  47    material or to cite them other than as "work in progress."
  48
  49    The list of current Internet-Drafts can be accessed at
  50    http://www.ietf.org/ietf/1id-abstracts.txt.
  51
  52    The list of Internet-Draft Shadow Directories can be accessed at
  53
  54
  55
  56 Grishchenko and Bakker   Expires June 21, 2012                  [Page 1]
  57 \f
  58 Internet-Draft                   swift                 December 19, 2011
  59
  60
  61    http://www.ietf.org/shadow.html.
  62
  63
  64    Copyright (c) 2011 IETF Trust and the persons identified as the
  65    document authors.  All rights reserved.
  66
  67    This document is subject to BCP 78 and the IETF Trust's Legal
  68    Provisions Relating to IETF Documents
  69    (http://trustee.ietf.org/license-info) in effect on the date of
  70    publication of this document.  Please review these documents
  71    carefully, as they describe your rights and restrictions with respect
  72    to this document.  Code Components extracted from this document must
  73    include Simplified BSD License text as described in Section 4.e of
  74    the Trust Legal Provisions and are provided without warranty as
  75    described in the Simplified BSD License.
  76
  77 Table of Contents
  78
  79    1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
  80      1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . .  3
  81      1.2. Conventions Used in This Document . . . . . . . . . . . . .  4
  82      1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . .  5
  83    2. Overall Operation . . . . . . . . . . . . . . . . . . . . . . .  6
  84      2.1. Joining a Swarm . . . . . . . . . . . . . . . . . . . . . .  6
  85      2.2. Exchanging Chunks . . . . . . . . . . . . . . . . . . . . .  6
  86      2.3. Leaving a Swarm . . . . . . . . . . . . . . . . . . . . . .  7
  87    3. Messages  . . . . . . . . . . . . . . . . . . . . . . . . . . .  7
  88      3.1. HANDSHAKE . . . . . . . . . . . . . . . . . . . . . . . . .  8
  89      3.3. HAVE  . . . . . . . . . . . . . . . . . . . . . . . . . . .  8
  90        3.3.1. Bin Numbers . . . . . . . . . . . . . . . . . . . . . .  8
  91        3.3.2. HAVE Message  . . . . . . . . . . . . . . . . . . . . .  9
  92      3.4. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . . .  9
  93      3.5. DATA and HASH . . . . . . . . . . . . . . . . . . . . . . . 10
  94        3.5.1. Merkle Hash Tree  . . . . . . . . . . . . . . . . . . . 10
  95        3.5.2. Content Integrity Verification  . . . . . . . . . . . . 11
  96        3.5.3. The Atomic Datagram Principle . . . . . . . . . . . . . 11
  97        3.5.4. DATA and HASH Messages  . . . . . . . . . . . . . . . . 12
  98      3.6. HINT  . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
  99      3.7. Peer Address Exchange and NAT Hole Punching . . . . . . . . 13
 100      3.8. KEEPALIVE . . . . . . . . . . . . . . . . . . . . . . . . . 14
 101      3.9. VERSION . . . . . . . . . . . . . . . . . . . . . . . . . . 14
 102      3.10. Conveying Peer Capabilities  . . . . . . . . . . . . . . . 14
 103      3.11. Directory Lists  . . . . . . . . . . . . . . . . . . . . . 14
 104    4. Automatic Detection of Content Size . . . . . . . . . . . . . . 14
 105      4.1. Peak Hashes . . . . . . . . . . . . . . . . . . . . . . . . 15
 106      4.2. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 16
 107    5. Live streaming  . . . . . . . . . . . . . . . . . . . . . . . . 17
 108    6. Transport Protocols and Encapsulation . . . . . . . . . . . . . 17
 109
 110
 111
 112 Grishchenko and Bakker   Expires June 21, 2012                  [Page 2]
 113 \f
 114 Internet-Draft                   swift                 December 19, 2011
 115
 116
 117      6.1. UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
 118        6.1.1. Chunk Size  . . . . . . . . . . . . . . . . . . . . . . 17
 119        6.1.2. Datagrams and Messages  . . . . . . . . . . . . . . . . 18
 120        6.1.3. Channels  . . . . . . . . . . . . . . . . . . . . . . . 18
 121        6.1.4. HANDSHAKE and VERSION . . . . . . . . . . . . . . . . . 19
 122        6.1.5. HAVE  . . . . . . . . . . . . . . . . . . . . . . . . . 20
 123        6.1.6. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . 20
 124        6.1.7. HASH  . . . . . . . . . . . . . . . . . . . . . . . . . 20
 125        6.1.8. DATA  . . . . . . . . . . . . . . . . . . . . . . . . . 20
 126        6.1.9. KEEPALIVE . . . . . . . . . . . . . . . . . . . . . . . 20
 127        6.1.10. Flow and Congestion Control  . . . . . . . . . . . . . 21
 128      6.2. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
 129      6.3. RTP Profile for PPSP  . . . . . . . . . . . . . . . . . . . 21
 130        6.3.1. Design  . . . . . . . . . . . . . . . . . . . . . . . . 22
 131        6.3.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 24
 132      6.4. HTTP (as PPSP)  . . . . . . . . . . . . . . . . . . . . . . 27
 133        6.4.1. Design  . . . . . . . . . . . . . . . . . . . . . . . . 27
 134        6.4.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 29
 135    7. Security Considerations . . . . . . . . . . . . . . . . . . . . 32
 136    8. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . 32
 137      8.1. 32 bit vs 64 bit  . . . . . . . . . . . . . . . . . . . . . 32
 138      8.2. IPv6  . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
 139      8.3. Congestion Control Algorithms . . . . . . . . . . . . . . . 32
 140      8.4. Piece Picking Algorithms  . . . . . . . . . . . . . . . . . 33
 141      8.5. Reciprocity Algorithms  . . . . . . . . . . . . . . . . . . 33
 142      8.6. Different crypto/hashing schemes  . . . . . . . . . . . . . 33
 143    9. Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
 144      9.1.  Design Goals . . . . . . . . . . . . . . . . . . . . . . . 34
 145      9.2.  Not TCP  . . . . . . . . . . . . . . . . . . . . . . . . . 35
 146      9.3.  Generic Acknowledgments  . . . . . . . . . . . . . . . . . 36
 147    Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 37
 148    References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
 149    Authors' addresses . . . . . . . . . . . . . . . . . . . . . . . . 39
 150
 151
 152
 153
 154 1.  Introduction
 155
 156 1.1. Purpose
 157
 158    This document describes the Generic Multiparty Protocol (swift),
 159    designed from the ground up for the task of disseminating the same
 160    content to a group of interested parties. Swift supports streaming
 161    on-demand and live video content, as well as conventional
 162    downloading, thus covering today's three major use cases for content
 163    distribution. To fulfil this task, clients consuming the content are
 164    put on equal footing with the servers initially providing the content
 165
 166
 167
 168 Grishchenko and Bakker   Expires June 21, 2012                  [Page 3]
 169 \f
 170 Internet-Draft                   swift                 December 19, 2011
 171
 172
 173    to create a peer-to-peer system where everyone can provide data. Each
 174    peer connects to a random set of other peers resulting in a mesh-like
 175    structure.
 176
 177    Swift uses a simple method of naming content based on self-
 178    certification. In particular, content in swift is identified by a
 179    single cryptographic hash that is the root hash in a Merkle hash tree
 180    calculated recursively from the content [ABMRKL].  This self-
 181    certifying hash tree allows every peer to directly detect when a
 182    malicious peer tries to distribute fake content. It also ensures only
 183    a small amount of information is needed to start a download (just the
 184    root hash and some peer addresses).
 185
 186    Swift uses a novel method of addressing chunks of content called "bin
 187    numbers". Bin numbers allow the addressing of a binary interval of
 188    data using a single integer. This reduces the amount of state that
 189    needs to be recorded per peer and the space needed to denote
 190    intervals on the wire, making the protocol light-weight. In general,
 191    this numbering system allows swift to work with simpler data
 192    structures, e.g. to use arrays instead of binary trees, thus reducing
 193    complexity.
 194
 195    Swift is a generic protocol which can run directly on top of UDP,
 196    TCP, HTTP, or as a layer below RTP, similar to SRTP [RFC3711]. As
 197    such, swift defines a common set of messages that make up the
 198    protocol, which can have different representations on the wire
 199    depending on the lower-level protocol used. When the lower-level
 200    transport is UDP, swift can also use different congestion control
 201    algorithms and facilitate NAT traversal.
 202
 203    In addition, swift is extensible in the mechanisms it uses to promote
 204    client contribution and prevent freeriding, that is, how to deal with
 205    peers that only download content but never upload to others.
 206    Furthermore, it can work with different peer discovery schemes, such
 207    as centralized trackers or fast Distributed Hash Tables [JIM11].
 208
 209    This documents describes not only the swift protocol but also how it
 210    satisfies the requirements for the IETF Peer-to-Peer Streaming
 211    Protocol (PPSP) Working Group's peer protocol [PPSPCHART,I-D.ietf-
 212    ppsp-reqs]. A reference implementation of swift over UDP is available
 213    [SWIFTIMPL].
 214
 215
 216 1.2. Conventions Used in This Document
 217
 218    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
 219    "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
 220    document are to be interpreted as described in [RFC2119].
 221
 222
 223
 224 Grishchenko and Bakker   Expires June 21, 2012                  [Page 4]
 225 \f
 226 Internet-Draft                   swift                 December 19, 2011
 227
 228
 229 1.3. Terminology
 230
 231    message
 232         The basic unit of swift communication. A message will have
 233         different representations on the wire depending on the transport
 234         protocol used. Messages are typically multiplexed into a
 235         datagram for transmission.
 236
 237    datagram
 238         A sequence of messages that is offered as a unit to the
 239         underlying transport protocol (UDP, etc.). The datagram is
 240         swift's Protocol Data Unit (PDU).
 241
 242    content
 243         Either a live transmission, a pre-recorded multimedia asset, or
 244         a file.
 245
 246    bin
 247         A number denoting a specific binary interval of the content
 248         (i.e., one or more consecutive chunks).
 249
 250    chunk
 251         The basic unit in which the content is divided. E.g. a block of
 252         N kilobyte.
 253
 254    hash
 255         The result of applying a cryptographic hash function, more
 256         specifically a modification detection code (MDC) [HAC01], such
 257         as SHA1 [FIPS180-2], to a piece of data.
 258
 259    root hash
 260         The root in a Merkle hash tree calculated recursively from the
 261         content.
 262
 263    swarm
 264         A group of peers participating in the distribution of the same
 265         content.
 266
 267    swarm ID
 268         Unique identifier for a swarm of peers, in swift the root hash
 269         of the content (video-on-demand,download) or a public key (live
 270         streaming).
 271
 272    tracker
 273         An entity that records the addresses of peers participating in a
 274         swarm, usually for a set of swarms, and makes this membership
 275         information available to other peers on request.
 276
 277
 278
 279
 280 Grishchenko and Bakker   Expires June 21, 2012                  [Page 5]
 281 \f
 282 Internet-Draft                   swift                 December 19, 2011
 283
 284
 285    choking
 286         When a peer A is choking peer B it means that A is currently not
 287         willing to accept requests for content from B.
 288
 289
 290 2. Overall Operation
 291
 292    The basic unit of communication in swift is the message. Multiple
 293    messages are multiplexed into a single datagram for transmission. A
 294    datagram (and hence the messages it contains) will have different
 295    representations on the wire depending on the transport protocol used
 296    (see Sec. 6).
 297
 298
 299 2.1. Joining a Swarm
 300
 301    Consider a peer A that wants to download a certain content asset. To
 302    commence a swift download, peer A must have the swarm ID of the
 303    content and a list of one or more tracker contact points (e.g.
 304    host+port). The list of trackers is optional in the presence of a
 305    decentralized tracking mechanism. The swarm ID consists of the swift
 306    root hash of the content (video-on-demand, downloading) or a public
 307    key (live streaming).
 308
 309    Peer A now registers with the tracker following e.g. the PPSP tracker
 310    protocol [I-D.ietf.ppsp-reqs] and receives the IP address and port of
 311    peers already in the swarm, say B, C, and D. Peer A now sends a
 312    datagram containing a HANDSHAKE message to B, C, and D. This message
 313    serves as an end-to-end check that the peers are actually in the
 314    correct swarm, and contains the root hash of the swarm.  Peer B and C
 315    respond with datagrams containing a HANDSHAKE message and one or more
 316    HAVE messages. A HAVE message conveys (part of) the chunk
 317    availability of a peer and thus contains a bin number that denotes
 318    what chunks of the content peer B, resp. C have. Peer D sends a
 319    datagram with just a HANDSHAKE and omits HAVE messages as a way of
 320    choking A.
 321
 322 2.2. Exchanging Chunks
 323
 324    In response to B and C, A sends new datagrams to B and C containing
 325    HINT messages. A HINT or request message indicates the chunks that a
 326    peer wants to download, and contains a bin number. The HINT messages
 327    to B and C refer to disjunct sets of chunks.  B and C respond with
 328    datagrams containing HASH, HAVE and DATA messages.  The HASH messages
 329    contains all cryptographic hashes that peer A needs to verify the
 330    integrity of the content chunk sent in the DATA message, using the
 331    content's root hash as trusted anchor, see Sec. 3.5. Using these
 332    hashes peer A verifies that the chunks received from B and C are
 333
 334
 335
 336 Grishchenko and Bakker   Expires June 21, 2012                  [Page 6]
 337 \f
 338 Internet-Draft                   swift                 December 19, 2011
 339
 340
 341    correct. It also updates the chunk availability of B and C using the
 342    information in the received HAVE messages.
 343
 344    After processing, A sends a datagram containing HAVE messages for the
 345    chunks it just received to all its peers. In the datagram to B and C
 346    it includes an ACK message acknowledging the receipt of the chunks,
 347    and adds HINT messages for new chunks. ACK messages are not used when
 348    a reliable transport protocol is used. When e.g. C finds that A
 349    obtained a chunk (from B) that C did not yet have, C's next datagram
 350    includes a HINT for that chunk.
 351
 352    Peer D does not send HAVE messages to A when it downloads chunks from
 353    other peers, until D decides to unchoke peer A. In the case, it sends
 354    a datagram with HAVE messages to inform A of its current
 355    availability. If B or C decide to choke A they stop sending HAVE and
 356    DATA messages and A should then rerequest from other peers. They may
 357    continue to send HINT messages, or periodic KEEPALIVE messages such
 358    that A keeps sending them HAVE messages.
 359
 360    Once peer A has received all content (video-on-demand use case) it
 361    stops sending messages to all other peers that have all content
 362    (a.k.a. seeders). Peer A MAY also contact the tracker or another
 363    source again to obtain more peer addresses.
 364
 365
 366 2.3. Leaving a Swarm
 367
 368    Depending on the transport protocol used, peers should either use
 369    explicit leave messages or implicitly leave a swarm by stopping to
 370    respond to messages. Peers that learn about the departure should
 371    remove these peers from the current peer list. The implicit-leave
 372    mechanism works for both graceful and ungraceful leaves (i.e., peer
 373    crashes or disconnects). When leaving gracefully, a peer should
 374    deregister from the tracker following the (PPSP) tracker protocol.
 375
 376
 377 3. Messages
 378
 379    In general, no error codes or responses are used in the protocol;
 380    absence of any response indicates an error. Invalid messages are
 381    discarded.
 382
 383    For the sake of simplicity, one swarm of peers always deals with one
 384    content asset (e.g. file) only. Retrieval of large collections of
 385    files is done by retrieving a directory list file and then
 386    recursively retrieving files, which might also turn to be directory
 387    lists, as described in Sec. 3.11.
 388
 389
 390
 391
 392 Grishchenko and Bakker   Expires June 21, 2012                  [Page 7]
 393 \f
 394 Internet-Draft                   swift                 December 19, 2011
 395
 396
 397 3.1. HANDSHAKE
 398
 399    As an end-to-end check that the peers are actually in the correct
 400    swarm, the initiating peer and the addressed peer SHOULD send a
 401    HANDSHAKE message in the first datagrams they exchange. The only
 402    payload of the HANDSHAKE message is the root hash of the content.
 403
 404    After the handshakes are exchanged, the initiator knows that the peer
 405    really responds. Hence, the second datagram the initiator sends MAY
 406    already contain some heavy payload. To minimize the number of
 407    initialization roundtrips, implementations MAY dispense with the
 408    HANDSHAKE message. To the same end, the first two datagrams exchanged
 409    MAY also contain some minor payload, e.g. HAVE messages to indicate
 410    the current progress of a peer or a HINT (see Sec. 3.6).
 411
 412
 413 3.3. HAVE
 414
 415    The HAVE message is used to convey which chunks a peers has
 416    available, expressed in a new content addressing scheme called "bin
 417    numbers".
 418
 419 3.3.1. Bin Numbers
 420
 421    Swift employs a generic content addressing scheme based on binary
 422    intervals ("bins" in short). The smallest interval is a chunk (e.g. a
 423    N kilobyte block), the top interval is the complete 2**63 range.  A
 424    novel addition to the classical scheme are "bin numbers", a scheme of
 425    numbering binary intervals which lays them out into a vector nicely.
 426    Consider an chunk interval of width W. To derive the bin numbers of
 427    the complete interval and the subintervals, a minimal balanced binary
 428    tree is built that is at least W chunks wide at the base. The leaves
 429    from left-to-right correspond to the chunks 0..W in the interval, and
 430    have bin number I*2 where I is the index of the chunk (counting
 431    beyond W-1 to balance the tree). The higher level nodes P in the tree
 432    have bin number
 433
 434          binP = (binL + binR) / 2
 435
 436    where binL is the bin of node P's left-hand child and binR is the bin
 437    of node P's right-hand child. Given that each node in the tree
 438    represents a subinterval of the original interval, each such
 439    subinterval now is addressable by a bin number, a single integer. The
 440    bin number tree of a interval of width W=8 looks like this:
 441
 442
 443
 444
 445
 446
 447
 448 Grishchenko and Bakker   Expires June 21, 2012                  [Page 8]
 449 \f
 450 Internet-Draft                   swift                 December 19, 2011
 451
 452
 453                            7
 454                           / \
 455                         /     \
 456                       /         \
 457                     /             \
 458                   3                11
 459                  / \              / \
 460                 /   \            /   \
 461                /     \          /     \
 462               1       5        9       13
 463              / \     / \      / \      / \
 464             0   2   4   6    8   10  12   14
 465
 466    So bin 7 represents the complete interval, 3 represents the interval
 467    of chunk 0..3 and 1 represents the interval of chunks 0 and 1. The
 468    special numbers 0xFFFFFFFF (32-bit) or 0xFFFFFFFFFFFFFFFF (64-bit)
 469    stands for an empty interval, and 0x7FFF...FFF stands for
 470    "everything".
 471
 472
 473 3.3.2. HAVE Message
 474
 475    When a receiving peer has successfully checked the integrity of a
 476    chunk or interval of chunks it MUST send a HAVE message to all peers
 477    it wants to interact with. The latter allows the HAVE message to be
 478    used as a method of choking. The HAVE message MUST contain the bin
 479    number of the biggest complete interval of all chunks the receiver
 480    has received and checked so far that fully includes the interval of
 481    chunks just received. So the bin number MUST denote at least the
 482    interval received, but the receiver is supposed to aggregate and
 483    acknowledge bigger bins, when possible.
 484
 485    As a result, every single chunk is acknowledged a logarithmic number
 486    of times. That provides some necessary redundancy of acknowledgments
 487    and sufficiently compensates for unreliable transport protocols.
 488
 489    To record which chunks a peer has in the state that an implementation
 490    keeps for each peer, an implementation MAY use the "binmap" data
 491    structure, which is a hybrid of a bitmap and a binary tree, discussed
 492    in detail in [BINMAP].
 493
 494
 495 3.4. ACK
 496
 497    When swift is run over an unreliable transport protocol, an
 498    implementation MAY choose to use ACK messages to acknowledge received
 499    data. When a receiving peer has successfully checked the integrity of
 500    a chunk or interval of chunks C it MUST send a ACK message containing
 501
 502
 503
 504 Grishchenko and Bakker   Expires June 21, 2012                  [Page 9]
 505 \f
 506 Internet-Draft                   swift                 December 19, 2011
 507
 508
 509    the bin number of its biggest, complete, interval covering C to the
 510    sending peer (see HAVE). To facilitate delay-based congestion
 511    control, an ACK message contains a timestamp.
 512
 513
 514 3.5. DATA and HASH
 515
 516    The DATA message is used to transfer chunks of content. The
 517    associated HASH message carries cryptographic hashes that are
 518    necessary for a receiver to check the the integrity of the chunk.
 519    Swift's content integrity protection is based on a Merkle hash tree
 520    and works as follows.
 521
 522 3.5.1. Merkle Hash Tree
 523
 524    Swift uses a method of naming content based on self-certification. In
 525    particular, content in swift is identified by a single cryptographic
 526    hash that is the root hash in a Merkle hash tree calculated
 527    recursively from the content [ABMRKL]. This self-certifying hash tree
 528    allows every peer to directly detect when a malicious peer tries to
 529    distribute fake content. It also ensures only a small the amount of
 530    information is needed to start a download (the root hash and some
 531    peer addresses). For live streaming public keys and dynamic trees are
 532    used, see below.
 533
 534    The Merkle hash tree of a content asset that is divided into N chunks
 535    is constructed as follows. Note the construction does not assume
 536    chunks of content to be fixed size. Given a cryptographic hash
 537    function, more specifically a modification detection code (MDC)
 538    [HAC01], such as SHA1, the hashes of all the chunks of the content
 539    are calculated. Next, a binary tree of sufficient height is created.
 540    Sufficient height means that the lowest level in the tree has enough
 541    nodes to hold all chunk hashes in the set, as before, see HAVE
 542    message. The figure below shows the tree for a content asset
 543    consisting of 7 chunks. As before with the content addressing scheme,
 544    the leaves of the tree correspond to a chunk and in this case are
 545    assigned the hash of that chunk, starting at the left-most leaf. As
 546    the base of the tree may be wider than the number of chunks, any
 547    remaining leaves in the tree are assigned a empty hash value of all
 548    zeros. Finally, the hash values of the higher levels in the tree are
 549    calculated, by concatenating the hash values of the two children
 550    (again left to right) and computing the hash of that aggregate. This
 551    process ends in a hash value for the root node, which is called the
 552    "root hash". Note the root hash only depends on the content and any
 553    modification of the content will result in a different root hash.
 554
 555
 556
 557
 558
 559
 560 Grishchenko and Bakker   Expires June 21, 2012                 [Page 10]
 561 \f
 562 Internet-Draft                   swift                 December 19, 2011
 563
 564
 565                            7 = root hash
 566                           / \
 567                         /     \
 568                       /         \
 569                     /             \
 570                   3*               11
 571                  / \              / \
 572                 /   \            /   \
 573                /     \          /     \
 574               1       5        9       13* = uncle hash
 575              / \     / \      / \      / \
 576             0   2   4   6    8   10* 12   14
 577
 578             C0  C1  C2  C3   C4  C5  C6   E
 579             =chunk index     ^^           = empty hash
 580
 581
 582 3.5.2. Content Integrity Verification
 583
 584    Assuming a peer receives the root hash of the content it wants to
 585    download from a trusted source, it can can check the integrity of any
 586    chunk of that content it receives as follows. It first calculates the
 587    hash of the chunk it received, for example chunk C4 in the previous
 588    figure. Along with this chunk it MUST receive the hashes required to
 589    check the integrity of that chunk. In principle, these are the hash
 590    of the chunk's sibling (C5) and that of its "uncles". A chunk's
 591    uncles are the sibling Y of its parent X, and the uncle of that Y,
 592    recursively until the root is reached. For chunk C4 its uncles are
 593    bins 13 and 3, marked with * in the figure. Using this information
 594    the peer recalculates the root hash of the tree, and compares it to
 595    the root hash it received from the trusted source. If they match the
 596    chunk of content has been positively verified to be the requested
 597    part of the content. Otherwise, the sending peer either sent the
 598    wrong content or the wrong sibling or uncle hashes. For simplicity,
 599    the set of sibling and uncles hashes is collectively referred to as
 600    the "uncle hashes".
 601
 602    In the case of live streaming the tree of chunks grows dynamically
 603    and content is identified with a public key instead of a root hash,
 604    as the root hash is undefined or, more precisely, transient, as long
 605    as new data is generated by the live source. Live streaming is
 606    described in more detail below, but content verification works the
 607    same for both live and predefined content.
 608
 609 3.5.3. The Atomic Datagram Principle
 610
 611    As explained above, a datagram consists of a sequence of messages.
 612    Ideally, every datagram sent must be independent of other datagrams,
 613
 614
 615
 616 Grishchenko and Bakker   Expires June 21, 2012                 [Page 11]
 617 \f
 618 Internet-Draft                   swift                 December 19, 2011
 619
 620
 621    so each datagram SHOULD be processed separately and a loss of one
 622    datagram MUST NOT disrupt the flow.  Thus, as a datagram carries zero
 623    or more messages, neither messages nor message interdependencies
 624    should span over multiple datagrams.
 625
 626    This principle implies that as any chunk is verified using its uncle
 627    hashes the necessary hashes MUST be put into the same datagram as the
 628    chunk's data (Sec. 3.5.4).  As a general rule, if some additional
 629    data is still missing to process a message within a datagram, the
 630    message SHOULD be dropped.
 631
 632    The hashes necessary to verify a chunk are in principle its sibling's
 633    hash and all its uncle hashes, but the set of hashes to sent can be
 634    optimized. Before sending a packet of data to the receiver, the
 635    sender inspects the receiver's previous acknowledgments (HAVE or ACK)
 636    to derive which hashes the receiver already has for sure. Suppose,
 637    the receiver had acknowledged bin 1 (first two chunks of the file),
 638    then it must already have uncle hashes 5, 11 and so on. That is
 639    because those hashes are necessary to check packets of bin 1 against
 640    the root hash. Then, hashes 3, 7 and so on must be also known as they
 641    are calculated in the process of checking the uncle hash chain.
 642    Hence, to send bin 12 (i.e. the 7th chunk of content), the sender
 643    needs to include just the hashes for bins 14 and 9, which let the
 644    data be checked against hash 11 which is already known to the
 645    receiver.
 646
 647    The sender MAY optimistically skip hashes which were sent out in
 648    previous, still unacknowledged datagrams. It is an optimization
 649    tradeoff between redundant hash transmission and possibility of
 650    collateral data loss in the case some necessary hashes were lost in
 651    the network so some delivered data cannot be verified and thus has to
 652    be dropped. In either case, the receiver builds the Merkle tree on-
 653    demand, incrementally, starting from the root hash, and uses it for
 654    data validation.
 655
 656    In short, the sender MUST put into the datagram the missing hashes
 657    necessary for the receiver to verify the chunk.
 658
 659 3.5.4. DATA and HASH Messages
 660
 661    Concretely, a peer that wants to send a chunk of content creates a
 662    datagram that MUST consist of one or more HASH messages and a DATA
 663    message. The datagram MUST contain a HASH message for each hash the
 664    receiver misses for integrity checking. A HASH message MUST contain
 665    the bin number and hash data for each of those hashes. The DATA
 666    message MUST contain the bin number of the chunk and chunk itself. A
 667    peer MAY send the required messages for multiple chunks in the same
 668    datagram.
 669
 670
 671
 672 Grishchenko and Bakker   Expires June 21, 2012                 [Page 12]
 673 \f
 674 Internet-Draft                   swift                 December 19, 2011
 675
 676
 677 3.6. HINT
 678
 679    While bulk download protocols normally do explicit requests for
 680    certain ranges of data (i.e., use a pull model, for example,
 681    BitTorrent [BITTORRENT]), live streaming protocols quite often use a
 682    request-less push model to save round trips. Swift supports both
 683    models of operation.
 684
 685    A peer MUST send a HINT message containing the bin of the chunk
 686    interval it wants to download. A peer receiving a HINT message MAY
 687    send out requested pieces. When it receives multiple HINTs (either in
 688    one datagram or in multiple), the peer SHOULD process the HINTs
 689    sequentially. When live streaming, it also may send some other chunks
 690    in case it runs out of requests or for some other reason. In that
 691    case the only purpose of HINT messages is to coordinate peers and to
 692    avoid unnecessary data retransmission, hence the name.
 693
 694
 695
 696 3.7. Peer Address Exchange and NAT Hole Punching
 697
 698    Peer address exchange messages (or PEX messages for short) are common
 699    for many peer-to-peer protocols. By exchanging peer addresses in
 700    gossip fashion, peers relieve central coordinating entities (the
 701    trackers) from unnecessary work. swift optionally features two types
 702    of PEX messages: PEX_REQ and PEX_ADD. A peer that wants to retrieve
 703    some peer addresses MUST send a PEX_REQ message. The receiving peer
 704    MAY respond with a PEX_ADD message containing the addresses of
 705    several peers. The addresses MUST be of peers it has recently
 706    exchanged messages with to guarantee liveliness.
 707
 708    To unify peer exchange and NAT hole punching functionality, the
 709    sending pattern of PEX messages is restricted. As the swift handshake
 710    is able to do simple NAT hole punching [SNP] transparently, PEX
 711    messages must be emitted in the way to facilitate that. Namely, once
 712    peer A introduces peer B to peer C by sending a PEX_ADD message to C,
 713    it SHOULD also send a message to B introducing C. The messages SHOULD
 714    be within 2 seconds from each other, but MAY not be, simultaneous,
 715    instead leaving a gap of twice the "typical" RTT, i.e. 300-600ms. The
 716    peers are supposed to initiate handshakes to each other thus forming
 717    a simple NAT hole punching pattern where the introducing peer
 718    effectively acts as a STUN server [RFC5389]. Still, peers MAY ignore
 719    PEX messages if uninterested in obtaining new peers or because of
 720    security considerations (rate limiting) or any other reason.
 721
 722    The PEX messages can be used to construct a dedicated tracker peer.
 723
 724
 725
 726
 727
 728 Grishchenko and Bakker   Expires June 21, 2012                 [Page 13]
 729 \f
 730 Internet-Draft                   swift                 December 19, 2011
 731
 732
 733 3.8. KEEPALIVE
 734
 735    A peer MUST send a datagram containing a KEEPALIVE message
 736    periodically to each peer it wants to interact with in the future but
 737    has no other messages to send them at present.
 738
 739
 740 3.9. VERSION
 741    Peers MUST convey which version of the swift protocol they support
 742    using a VERSION message. This message MUST be included in the initial
 743    (handshake) datagrams and MUST indicate which version of the swift
 744    protocol the sending peer supports.
 745
 746
 747 3.10. Conveying Peer Capabilities
 748    Peers may support just a subset of the swift messages. For example,
 749    peers running over TCP may not accept ACK messages, or peers used
 750    with a centralized tracking infrastructure may not accept PEX
 751    messages. For these reasons, peers SHOULD signal which subset of the
 752    swift messages they support by means of the MSGTYPE_RCVD message.
 753    This message SHOULD be included in the initial (handshake) datagrams
 754    and MUST indicate which swift protocol messages the sending peer
 755    supports.
 756
 757
 758 3.11. Directory Lists
 759
 760    Directory list files MUST start with magic bytes ".\n..\n". The rest
 761    of the file is a newline-separated list of hashes and file names for
 762    the content of the directory. An example:
 763
 764    .
 765    ..
 766    1234567890ABCDEF1234567890ABCDEF12345678  readme.txt
 767    01234567890ABCDEF1234567890ABCDEF1234567  big_file.dat
 768
 769
 770
 771 4. Automatic Detection of Content Size
 772
 773    In swift, the root hash of a static content asset, such as a video
 774    file, along with some peer addresses is sufficient to start a
 775    download. In addition, swift can reliably and automatically derive
 776    the size of such content from information received from the network
 777    when fixed sized chunks are used. As a result, it is not necessary to
 778    include the size of the content asset as the metadata of the content,
 779    in addition to the root hash. Implementations of swift MAY use this
 780    automatic detection feature.
 781
 782
 783
 784 Grishchenko and Bakker   Expires June 21, 2012                 [Page 14]
 785 \f
 786 Internet-Draft                   swift                 December 19, 2011
 787
 788
 789 4.1. Peak Hashes
 790
 791    The ability for a newcomer peer to detect the size of the content
 792    depends heavily on the concept of peak hashes. Peak hashes, in
 793    general, enable two cornerstone features of swift: reliable file size
 794    detection and download/live streaming unification (see Sec. 5). The
 795    concept of peak hashes depends on the concepts of filled and
 796    incomplete bins. Recall that when constructing the binary trees for
 797    content verification and addressing the base of the tree may have
 798    more leaves than the number of chunks in the content. In the Merkle
 799    hash tree these leaves were assigned empty all-zero hashes to be able
 800    to calculate the higher level hashes. A filled bin is now defined as
 801    a bin number that addresses an interval of leaves that consists only
 802    of hashes of content chunks, not empty hashes. Reversely, an
 803    incomplete (not filled) bin addresses an interval that contains also
 804    empty hashes, typically an interval that extends past the end of the
 805    file. In the following figure bins 7, 11, 13 and 14 are incomplete
 806    the rest is filled.
 807
 808    Formally, a peak hash is a hash in the Merkle tree defined over a
 809    filled bin, whose sibling is defined over an incomplete bin.
 810    Practically, suppose a file is 7162 bytes long and a chunk is 1
 811    kilobyte. That file fits into 7 chunks, the tail chunk being 1018
 812    bytes long. The Merkle tree for that file looks as follows. Following
 813    the definition the peak hashes of this file are in bins 3, 9 and 12,
 814    denoted with a *. E denotes an empty hash.
 815
 816                            7
 817                           / \
 818                         /     \
 819                       /         \
 820                     /             \
 821                   3*               11
 822                  / \              / \
 823                 /   \            /   \
 824                /     \          /     \
 825               1       5        9*      13
 826              / \     / \      / \      / \
 827             0   2   4   6    8   10  12*  14
 828
 829             C0  C1  C2  C3   C4  C5  C6   E
 830                                      = 1018 bytes
 831
 832    Peak hashes can be explained by the binary representation of the
 833    number of chunks the file occupies. The binary representation for 7
 834    is 111. Every "1" in binary representation of the file's packet
 835    length corresponds to a peak hash. For this particular file there are
 836    indeed three peaks, bin numbers 3, 9, 12. The number of peak hashes
 837
 838
 839
 840 Grishchenko and Bakker   Expires June 21, 2012                 [Page 15]
 841 \f
 842 Internet-Draft                   swift                 December 19, 2011
 843
 844
 845    for a file is therefore also at most logarithmic with its size.
 846
 847    A peer knowing which bins contain the peak hashes for the file can
 848    therefore calculate the number of chunks it consists of, and thus get
 849    an estimate of the file size (given all chunks but the last are fixed
 850    size). Which bins are the peaks can  be securely communicated from
 851    one (untrusted) peer A to another B by letting A send the peak hashes
 852    and their bin numbers to B. It can be shown that the root hash that B
 853    obtained from a trusted source is sufficient to verify that these are
 854    indeed the right peak hashes, as follows.
 855
 856    Lemma: Peak hashes can be checked against the root hash.
 857
 858    Proof: (a) Any peak hash is always the left sibling. Otherwise, be it
 859    the right sibling, its left neighbor/sibling must also be defined
 860    over a filled bin, because of the way chunks are laid out in the
 861    leaves, contradiction. (b) For the rightmost peak hash, its right
 862    sibling is zero. (c) For any peak hash, its right sibling might be
 863    calculated using peak hashes to the left and zeros for empty bins.
 864    (d) Once the right sibling of the leftmost peak hash is calculated,
 865    its parent might be calculated. (e) Once that parent is calculated,
 866    we might trivially get to the root hash by concatenating the hash
 867    with zeros and hashing it repeatedly.
 868
 869    Informally, the Lemma might be expressed as follows: peak hashes
 870    cover all data, so the remaining hashes are either trivial (zeros) or
 871    might be calculated from peak hashes and zero hashes.
 872
 873    Finally, once peer B has obtained the number of chunks in the content
 874    it can determine the exact file size as follows. Given that all
 875    chunks except the last are fixed size B just needs to know the size
 876    of the last chunk. Knowing the number of chunks B can calculate the
 877    bin number of the last chunk and download it. As always B verifies
 878    the integrity of this chunk against the trusted root hash. As there
 879    is only one chunk of data that leads to a successful verification the
 880    size of this chunk must be correct. B can then determine the exact
 881    file size as
 882
 883        (number of chunks -1) * fixed chunk size + size of last chunk
 884
 885
 886 4.2. Procedure
 887
 888    A swift implementation that wants to use automatic size detection
 889    MUST operate as follows. When a peer B sends a DATA message for the
 890    first time to a peer A, B MUST include all the peak hashes for the
 891    content in the same datagram, unless A has already signalled earlier
 892    in the exchange that it knows the peak hashes by having acknowledged
 893
 894
 895
 896 Grishchenko and Bakker   Expires June 21, 2012                 [Page 16]
 897 \f
 898 Internet-Draft                   swift                 December 19, 2011
 899
 900
 901    any bin, even the empty one. The receiver A MUST check the peak
 902    hashes against the root hash to determine the approximate content
 903    size. To obtain the definite content size peer A MUST download the
 904    last chunk of the content from any peer that offers it.
 905
 906
 907
 908
 909
 910 5. Live streaming
 911
 912    In the case of live streaming a transfer is bootstrapped with a
 913    public key instead of a root hash, as the root hash is undefined or,
 914    more precisely, transient, as long as new data is being generated by
 915    the live source. Live/download unification is achieved by sending
 916    signed peak hashes on-demand, ahead of the actual data. As before,
 917    the sender might use acknowledgements to derive which content range
 918    the receiver has peak hashes for and to prepend the data hashes with
 919    the necessary (signed) peak hashes. Except for the fact that the set
 920    of peak hashes changes with time, other parts of the algorithm work
 921    as described in Sec. 3.
 922
 923    As with static content assets in the previous section, in live
 924    streaming content length is not known on advance, but derived
 925    on-the-go from the peak hashes. Suppose, our 7 KB stream extended to
 926    another kilobyte. Thus, now hash 7 becomes the only peak hash, eating
 927    hashes 3, 9 and 12. So, the source sends out a SIGNED_HASH message to
 928    announce the fact.
 929
 930    The number of cryptographic operations will be limited. For example,
 931    consider a 25 frame/second video transmitted over UDP. When each
 932    frame is transmitted in its own chunk, only 25 signature verification
 933    operations per second are required at the receiver for bitrates up to
 934    ~12.8 megabit/second. For higher bitrates multiple UDP packets per
 935    frame are needed and the number of verifications doubles.
 936
 937
 938
 939
 940
 941 6. Transport Protocols and Encapsulation
 942
 943 6.1. UDP
 944
 945 6.1.1. Chunk Size
 946
 947    Currently, swift-over-UDP is the preferred deployment option.
 948    Effectively, UDP allows the use of IP with minimal overhead and it
 949
 950
 951
 952 Grishchenko and Bakker   Expires June 21, 2012                 [Page 17]
 953 \f
 954 Internet-Draft                   swift                 December 19, 2011
 955
 956
 957    also allows userspace implementations. The default is to use chunks
 958    of 1 kilobyte such that a datagram fits in an Ethernet-sized IP
 959    packet. The bin numbering allows to use swift over Jumbo
 960    frames/datagrams. Both DATA and HAVE/ACK messages may use e.g. 8
 961    kilobyte packets instead of the standard 1 KiB. The hashing scheme
 962    stays the same. Using swift with 512 or 256-byte packets is
 963    theoretically possible with 64-bit byte-precise bin numbers, but IP
 964    fragmentation might be a better method to achieve the same result.
 965
 966
 967 6.1.2. Datagrams and Messages
 968
 969    When using UDP, the abstract datagram described above corresponds
 970    directly to a UDP datagram. Each message within a datagram has a
 971    fixed length, which depends on the type of the message. The first
 972    byte of a message denotes its type. The currently defined types are:
 973
 974            HANDSHAKE = 0x00
 975            DATA = 0x01
 976            ACK = 0x02
 977            HAVE = 0x03
 978            HASH = 0x04
 979            PEX_ADD = 0x05
 980            PEX_REQ = 0x06
 981            SIGNED_HASH = 0x07
 982            HINT = 0x08
 983            MSGTYPE_RCVD = 0x09
 984            VERSION = 0x10
 985
 986
 987    Furthermore, integers are serialized in the network (big-endian) byte
 988    order. So consider the example of an ACK message (Sec 3.4). It has
 989    message type of 0x02 and a payload of a bin number, a four-byte
 990    integer (say, 1); hence, its on the wire representation for UDP can
 991    be written in hex as: "02 00000001". This hex-like two character-per-
 992    byte notation is used to represent message formats in the rest of
 993    this section.
 994
 995 6.1.3. Channels
 996
 997    As it is increasingly complex for peers to enable UDP communication
 998    between each other due to NATs and firewalls, swift-over-UDP uses a
 999    multiplexing scheme, called "channels", to allow multiple swarms to
1000    use the same UDP port. Channels loosely correspond to TCP connections
1001    and each channel belongs to a single swarm. When channels are used,
1002    each datagram starts with four bytes corresponding to the receiving
1003    channel number.
1004
1005
1006
1007
1008 Grishchenko and Bakker   Expires June 21, 2012                 [Page 18]
1009 \f
1010 Internet-Draft                   swift                 December 19, 2011
1011
1012
1013 6.1.4. HANDSHAKE and VERSION
1014
1015    A channel is established with a handshake. To start a handshake, the
1016    initiating peer needs to know:
1017
1018    (1) the IP address of a peer
1019    (2) peer's UDP port and
1020    (3) the root hash of the content (see Sec. 3.5.1).
1021
1022    To do the handshake the initiating peer sends a datagram that MUST
1023    start with an all 0-zeros channel number followed by a VERSION
1024    message, then a HASH message whose payload is the root hash, and a
1025    HANDSHAKE message, whose only payload is a locally unused channel
1026    number.
1027
1028    On the wire the datagram will look something like this:
1029       00000000 10 01
1030       04 7FFFFFFF 1234123412341234123412341234123412341234
1031       00 00000011
1032    (to unknown channel, handshake from channel 0x11 speaking protocol
1033    version 0x01, initiating a transfer of a file with a root hash
1034    123...1234)
1035
1036    The receiving peer MUST respond with a datagram that starts with the
1037    channel number from the sender's HANDSHAKE message, followed by a
1038    VERSION message, then a HANDSHAKE message, whose only payload is a
1039    locally unused channel number, followed by any other messages it
1040    wants to send.
1041
1042    Peer's response datagram on the wire:
1043       00000011 10 01
1044       00 00000022  03 00000003
1045    (peer to the initiator: use channel number 0x22 for this transfer and
1046    proto version 0x01; I also have first 4 chunks of the file, see Sec.
1047    4.3)
1048
1049    At this point, the initiator knows that the peer really responds; for
1050    that purpose channel ids MUST be random enough to prevent easy
1051    guessing. So, the third datagram of a handshake MAY already contain
1052    some heavy payload. To minimize the number of initialization
1053    roundtrips, the first two datagrams MAY also contain some minor
1054    payload, e.g. a couple of HAVE messages roughly indicating the
1055    current progress of a peer or a HINT (see Sec. 3.6). When receiving
1056    the third datagram, both peers have the proof they really talk to
1057    each other; three-way handshake is complete.
1058
1059    A peer MAY explicit close a channel by sending a HANDSHAKE message
1060    that MUST contain an all 0-zeros channel number.
1061
1062
1063
1064 Grishchenko and Bakker   Expires June 21, 2012                 [Page 19]
1065 \f
1066 Internet-Draft                   swift                 December 19, 2011
1067
1068
1069    On the wire:
1070       00 00000000
1071
1072
1073 6.1.5. HAVE
1074
1075    A HAVE message (type 0x03) states that the sending peer has the
1076    complete specified bin and successfully checked its integrity:
1077       03 00000003
1078    (got/checked first four kilobytes of a file/stream)
1079
1080
1081
1082 6.1.6. ACK
1083
1084    An ACK message (type 0x02) acknowledges data that was received from
1085    its addressee; to facilitate delay-based congestion control, an
1086    ACK message contains a timestamp, in particular, a 64-bit microsecond
1087    time.
1088       02 00000002 12345678
1089    (got the second kilobyte of the file from you; my microsecond
1090    timer was showing 0x12345678 at that moment)
1091
1092
1093 6.1.7. HASH
1094
1095    A HASH message (type 0x04) consists of a four-byte bin number and
1096    the cryptographic hash (e.g. a 20-byte SHA1 hash)
1097       04 7FFFFFFF 1234123412341234123412341234123412341234
1098
1099
1100 6.1.8. DATA
1101
1102    A DATA message (type 0x01) consists of a four-byte bin number and the
1103    actual chunk. In case a datagram contains a DATA message, a sender
1104    MUST always put the data message in the tail of the datagram. For
1105    example:
1106       01 00000000 48656c6c6f20776f726c6421
1107    (This message accommodates an entire file: "Hello world!")
1108
1109
1110 6.1.9. KEEPALIVE
1111
1112    Keepalives do not have a message type on UDP. They are just simple
1113    datagrams consisting of a 4-byte channel id only.
1114
1115    On the wire:
1116       00000022
1117
1118
1119
1120 Grishchenko and Bakker   Expires June 21, 2012                 [Page 20]
1121 \f
1122 Internet-Draft                   swift                 December 19, 2011
1123
1124
1125 6.1.10. Flow and Congestion Control
1126
1127    Explicit flow control is not necessary in swift-over-UDP. In the case
1128    of video-on-demand the receiver will request data explicitly from
1129    peers and is therefore in control of how much data is coming towards
1130    it. In the case of live streaming, where a push-model may be used,
1131    the amount of data incoming is limited to the bitrate, which the
1132    receiver must be able to process otherwise it cannot play the stream.
1133    Should, for any reason, the receiver get saturated with data that
1134    situation is perfectly detected by the congestion control. Swift-
1135    over-UDP can support different congestion control algorithms, in
1136    particular, it supports the new IETF Low Extra Delay Background
1137    Transport (LEDBAT) congestion control algorithm that ensures that
1138    peer-to-peer traffic yields to regular best-effort traffic [LEDBAT].
1139
1140
1141 6.2. TCP
1142
1143    When run over TCP, swift becomes functionally equivalent to
1144    BitTorrent. Namely, most swift messages have corresponding BitTorrent
1145    messages and vice versa, except for BitTorrent's explicit interest
1146    declarations and choking/unchoking, which serve the classic
1147    implementation of the tit-for-tat algorithm [TIT4TAT]. However, TCP
1148    is not well suited for multiparty communication, as argued in Sec. 9.
1149
1150
1151 6.3. RTP Profile for PPSP
1152
1153    In this section we sketch how swift can be integrated into RTP
1154    [RFC3550] to form the Peer-to-Peer Streaming Protocol (PPSP) [I-
1155    D.ietf-ppsp-reqs] running over UDP. The PPSP charter requires
1156    existing media transfer protocols be used [PPSPCHART]. Hence, the
1157    general idea is to define swift as a profile of RTP, in the same way
1158    as the Secure Real-time Transport Protocol (SRTP) [RFC3711]. SRTP,
1159    and therefore swift is considered ``a "bump in the stack"
1160    implementation which resides between the RTP application and the
1161    transport layer. [swift] intercepts RTP packets and then forwards an
1162    equivalent [swift] packet on the sending side, and intercepts [swift]
1163    packets and passes an equivalent RTP packet up the stack on the
1164    receiving side.'' [RFC3711].
1165
1166    In particular, to encode a swift datagram in an RTP packet all the
1167    non-DATA messages of swift such as HINT and HAVE are postfixed to the
1168    RTP packet using the UDP encoding and the content of DATA messages is
1169    sent in the payload field. Implementations MAY omit the RTP header
1170    for packets without payload. This construction allows the streaming
1171    application to use of all RTP's current features, and with a
1172    modification to the Merkle tree hashing scheme (see below) meets
1173
1174
1175
1176 Grishchenko and Bakker   Expires June 21, 2012                 [Page 21]
1177 \f
1178 Internet-Draft                   swift                 December 19, 2011
1179
1180
1181    swift's atomic datagram principle. The latter means that a receiving
1182    peer can autonomously verify the RTP packet as being correct content,
1183    thus preventing the spread of corrupt data (see requirement PPSP.SEC-
1184    REQ-4).
1185
1186    The use of ACK messages for reliability is left as a choice of the
1187    application using PPSP.
1188
1189
1190 6.3.1. Design
1191
1192    6.3.1.1. Joining a Swarm
1193
1194    To commence a PPSP download a peer A must have the swarm ID of the
1195    stream and a list of one or more tracker contact points (e.g.
1196    host+port). The list of trackers is optional in the presence of a
1197    decentralized tracking mechanism. The swarm ID consists of the swift
1198    root hash of the content, which is divided into chunks (see
1199    Discussion).
1200
1201    Peer A now registers with the PPSP tracker following the tracker
1202    protocol [I-D.ietf.ppsp-reqs] and receives the IP address and RTP
1203    port of peers already in the swarm, say B, C, and D. Peer A now sends
1204    an RTP packet containing a HANDSHAKE without channel information to
1205    B, C, and D. This serves as an end-to-end check that the peers are
1206    actually in the correct swarm. Optionally A could include a HINT
1207    message in some RTP packets if it wants to start receiving content
1208    immediately. B and C respond with a HANDSHAKE and HAVE messages.  D
1209    sends just a HANDSHAKE and omits HAVE messages as a way of choking A.
1210
1211
1212    6.3.1.2. Exchanging Chunks
1213
1214    In response to B and C, A sends new RTP packets to B and C with HINTs
1215    for disjunct sets of chunks. B and C respond with the requested
1216    chunks in the payload and HAVE messages, updating their chunk
1217    availability. Upon receipt, A sends HAVE for the chunks received and
1218    new HINT messages to B and C. When e.g. C finds that A obtained a
1219    chunk (from B) that C did not yet have, C's response includes a HINT
1220    for that chunk.
1221
1222    D does not send HAVE messages, instead if D decides to unchoke peer
1223    A, it sends an RTP packet with HAVE messages to inform A of its
1224    current availability. If B or C decide to choke A they stop sending
1225    HAVE and DATA messages and A should then rerequest from other peers.
1226    They may continue to send HINT messages, or exponentially slowing
1227    KEEPALIVE messages such that A keeps sending them HAVE messages.
1228
1229
1230
1231
1232 Grishchenko and Bakker   Expires June 21, 2012                 [Page 22]
1233 \f
1234 Internet-Draft                   swift                 December 19, 2011
1235
1236
1237    Once A has received all content (video-on-demand use case) it stops
1238    sending messages to all other peers that have all content (a.k.a.
1239    seeders).
1240
1241
1242    6.3.1.3. Leaving a Swarm
1243
1244    Peers can implicitly leave a swarm by stopping to respond to
1245    messages. Sending peers should remove these peers from the current
1246    peer list. This mechanism works for both graceful and ungraceful
1247    leaves (i.e., peer crashes or disconnects). When leaving gracefully,
1248    a peer should deregister from the tracker following the PPSP tracker
1249    protocol.
1250
1251    More explicit graceful leaves could be implemented using RTCP. In
1252    particular, a peer could send a RTCP BYE on the RTCP port that is
1253    derivable from a peer's RTP port for all peers in its current peer
1254    list. However, to prevent malicious peers from sending BYEs a form of
1255    peer authentication is required (e.g. using public keys as peer IDs
1256    [PERMIDS].)
1257
1258
1259    6.3.1.4. Discussion
1260
1261    Using swift as an RTP profile requires a change to the content
1262    integrity protection scheme (see Sec. 3.5). The fields in the RTP
1263    header, such as the timestamp and PT fields, must be protected by the
1264    Merkle tree hashing scheme to prevent malicious alterations.
1265    Therefore, the Merkle tree is no longer constructed from pure content
1266    chunks, but from the complete RTP packet for a chunk as it would be
1267    transmitted (minus the non-DATA swift messages). In other words, the
1268    hash of the leaves in the tree is the hash over the Authenticated
1269    Portion of the RTP packet as defined by SRTP, illustrated in the
1270    following figure (extended from [RFC3711]). There is no need for the
1271    RTP packets to be fixed size, as the hashing scheme can deal with
1272    variable-sized leaves.
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288 Grishchenko and Bakker   Expires June 21, 2012                 [Page 23]
1289 \f
1290 Internet-Draft                   swift                 December 19, 2011
1291
1292
1293         0                   1                   2                   3
1294       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1295      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<+
1296      |V=2|P|X|  CC   |M|     PT      |       sequence number         | |
1297      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1298      |                           timestamp                           | |
1299      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1300      |           synchronization source (SSRC) identifier            | |
1301      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |
1302      |            contributing source (CSRC) identifiers             | |
1303      |                               ....                            | |
1304      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1305      |                   RTP extension (OPTIONAL)                    | |
1306      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1307      |                          payload  ...                         | |
1308      |                               +-------------------------------+ |
1309      |                               | RTP padding   | RTP pad count | |
1310      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<+
1311      ~                     swift non-DATA messages (REQUIRED)        ~ |
1312      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1313      |                 length of swift messages (REQUIRED)           | |
1314      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1315                                                                        |
1316                                               Authenticated Portion ---+
1317
1318            Figure: The format of an RTP-Swift packet.
1319
1320
1321    As a downside, with variable-sized payloads the automatic content
1322    size detection of Section 4 no longer works, so content length MUST
1323    be explicit in the metadata. In addition, storage on disk is more
1324    complex with out-of-order, variable-sized packets. On the upside,
1325    carrying RTP over swift allow decryption-less caching.
1326
1327    As with UDP, another matter is how much data is carried inside each
1328    packet. An important swift-specific factor here is the resulting
1329    number of hash calculations per second needed to verify chunks.
1330    Experiments should be conducted to ensure they are not excessive for,
1331    e.g., mobile hardware.
1332
1333    At present, Peer IDs are not required in this design.
1334
1335
1336 6.3.2. PPSP Requirements
1337
1338    6.3.2.1. Basic Requirements
1339
1340    - PPSP.REQ-1: The swift PEX message can also be used as the basis for
1341
1342
1343
1344 Grishchenko and Bakker   Expires June 21, 2012                 [Page 24]
1345 \f
1346 Internet-Draft                   swift                 December 19, 2011
1347
1348
1349    a tracker protocol, to be discussed elsewhere.
1350
1351    - PPSP.REQ-2: This draft preserves the properties of RTP.
1352
1353    - PPSP.REQ-3: This draft does not place requirements on peer IDs,
1354    IP+port is sufficient.
1355
1356    - PPSP.REQ-4: The content is identified by its root hash (video-on-
1357    demand) or a public key (live streaming).
1358
1359    - PPSP.REQ-5: The content is partitioned by the streaming
1360    application.
1361
1362    - PPSP.REQ-6: Each chunk is identified by a bin number (and its
1363    cryptographic hash.)
1364
1365    - PPSP.REQ-7: The protocol is carried over UDP because RTP is.
1366
1367    - PPSP.REQ-8: The protocol has been designed to allow meaningful data
1368    transfer between peers as soon as possible and to avoid unnecessary
1369    round-trips. It supports small and variable chunk sizes, and its
1370    content integrity protection enables wide scale caching.
1371
1372
1373    6.3.2.2. Peer Protocol Requirements
1374
1375    - PPSP.PP.REQ-1: A GET_HAVE would have to be added to request which
1376    chunks are available from a peer, if the proposed push-based HAVE
1377    mechanism is not sufficient.
1378
1379    - PPSP.PP.REQ-2: A set of HAVE messages satisfies this.
1380
1381    - PPSP.PP.REQ-3: The PEX_REQ message satisfies this. Care should be
1382    taken with peer address exchange in general, as the use of such
1383    hearsay is a risk for the protocol as it may be exploited by
1384    malicious peers (as a DDoS attack mechanism). A secure tracking /
1385    peer sampling protocol like [PUPPETCAST] may be needed to make peer-
1386    address exchange safe.
1387
1388    - PPSP.PP.REQ-4: HAVE messages convey current availability via a push
1389    model.
1390
1391    - PPSP.PP.REQ-5: Bin numbering enables a compact representation of
1392    chunk availability.
1393
1394    - PPSP.PP.REQ-6: A new PPSP specific Peer Report message would have
1395    to be added to RTCP.
1396
1397
1398
1399
1400 Grishchenko and Bakker   Expires June 21, 2012                 [Page 25]
1401 \f
1402 Internet-Draft                   swift                 December 19, 2011
1403
1404
1405    - PPSP.PP.REQ-7: Transmission and chunk requests are integrated in
1406    this protocol.
1407
1408
1409    6.3.2.3. Security Requirements
1410
1411    - PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms
1412    [CLOSED] would have to be added.
1413
1414    - PPSP.SEC.REQ-2: As RTP is carried verbatim over swift, RTP
1415    encryption can be used. Note that just encrypting the RTP part will
1416    allow for caching servers that are part of the swarm but do not need
1417    access to the decryption keys. They just need access to the swift
1418    HASHES in the postfix to verify the packet's integrity.
1419
1420    - PPSP.SEC.REQ-3: RTP encryption or IPsec [RFC4303] can be used, if
1421    the swift messages must also be encrypted.
1422
1423    - PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the
1424    indirect spread of corrupt content, as peers will only forward chunks
1425    to others if their integrity check out. Another protection mechanism
1426    is to not depend on hearsay (i.e., do not forward other peers'
1427    availability information), or to only use it when the information
1428    spread is self-certified by its subjects.
1429
1430    Other attacks, such as a malicious peer claiming it has content but
1431    not replying, are still possible. Or wasting CPU and bandwidth at a
1432    receiving peer by sending packets where the DATA doesn't match the
1433    HASHes.
1434
1435
1436    - PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving
1437    peer to detect a malicious or faulty sender, which it can
1438    subsequently ignore. Spreading this knowledge to other peers such
1439    that they know about this bad behavior is hearsay.
1440
1441
1442    - PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that
1443    malicious peers launch an Eclipse [ECLIPSE] attack on the initial
1444    injectors of the content (in particular in live streaming). The
1445    attack tries to let the injector upload to just malicious peers which
1446    then do not forward the content to others, thus stopping the
1447    distribution. An Eclipse attack could also be launched on an
1448    individual peer. Letting these injectors only use trusted trackers
1449    that provide true random samples of the population or using a secure
1450    peer sampling service [PUPPETCAST] can help negate such an attack.
1451
1452
1453
1454
1455
1456 Grishchenko and Bakker   Expires June 21, 2012                 [Page 26]
1457 \f
1458 Internet-Draft                   swift                 December 19, 2011
1459
1460
1461    - PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or
1462    additional mechanisms such as DHTs [SECDHTS], but self-certification
1463    of addresses is needed. Self-certification means For example, that
1464    each peer has a public/private key pair [PERMIDS] and creates self-
1465    certified address changes that include the swarm ID and a timestamp,
1466    which are then exchanged among peers or stored in DHTs. See also
1467    discussion of PPSP.PP.REQ-3 above. Content distribution can continue
1468    as long as there are peers that have it available.
1469
1470    - PPSP.SEC.REQ-8: The verification of data via hashes obtained from a
1471    trusted source is well-established in the BitTorrent protocol
1472    [BITTORRENT]. The proposed Merkle tree scheme is a secure extension
1473    of this idea. Self-certification and not using hearsay are other
1474    lessons learned from existing distributed systems.
1475
1476    - PPSP.SEC.REQ-9: Swift has built-in content integrity protection via
1477    self-certified naming of content, see SEC.REQ-5 and Sec. 3.5.1.
1478
1479
1480 6.4. HTTP (as PPSP)
1481
1482    In this section we sketch how swift can be carried over HTTP
1483    [RFC2616] to form the PPSP running over TCP. The general idea is to
1484    encode a swift datagram in HTTP GET and PUT requests and their
1485    replies by transmitting all the non-DATA messages such as HINTs and
1486    HAVEs as headers and send DATA messages in the body. This idea
1487    follows the atomic datagram principle for each request and reply. So
1488    a receiving peer can autonomously verify the message as carrying
1489    correct data, thus preventing the spread of corrupt data (see
1490    requirement PPSP.SEC-REQ-4).
1491
1492    A problem with HTTP is that it is a client/server protocol. To
1493    overcome this problem, a peer A uses a PUT request instead of a GET
1494    request if the peer B has indicated in a reply that it wants to
1495    retrieve a chunk from A. In cases where peer A is no longer
1496    interested in receiving requests from B (described below) B may need
1497    to establish a new HTTP connection to A to quickly download a chunk,
1498    instead of waiting for a convenient time when A sends another
1499    request. As an alternative design, two HTTP connections could be used
1500    always., but this is inefficient.
1501
1502 6.4.1. Design
1503
1504    6.4.1.1. Joining a Swarm
1505
1506    To commence a PPSP download a peer A must have the swarm ID of the
1507    stream and a list of one or more tracker contact points, as above.
1508    The swarm ID as earlier also consists of the swift root hash of the
1509
1510
1511
1512 Grishchenko and Bakker   Expires June 21, 2012                 [Page 27]
1513 \f
1514 Internet-Draft                   swift                 December 19, 2011
1515
1516
1517    content, divided in chunks by the streaming application (e.g. fixed-
1518    size chunks of 1 kilobyte for video-on-demand).
1519
1520    Peer A now registers with the PPSP tracker following the tracker
1521    protocol [I-D.ietf-ppsp-reqs] and receives the IP address and HTTP
1522    port of peers already in the swarm, say B, C, and D. Peer A now
1523    establishes persistent HTTP connections with B, C, D and sends GET
1524    requests with the Request-URI set to /<encoded roothash>. Optionally
1525    A could include a HINT message in some requests if it wants to start
1526    receiving content immediately. A HINT is encoded as a Range header
1527    with a new "bins" unit [RFC2616,$14.35].
1528
1529    B and C respond with a 200 OK reply with header-encoded HAVE
1530    messages.  A HAVE message is encoded as an extended Accept-Ranges:
1531    header [RFC2616,$14.5] with the new bins unit and the possibility of
1532    listing the set of accepted bins. If no HINT/Range header was present
1533    in the request, the body of the reply is empty. D sends just a 200 OK
1534    reply and omits the HAVE/Accept-Ranges header as a way of choking A.
1535
1536    6.4.1.2. Exchanging Chunks
1537
1538    In response to B and C, A sends GET requests with Range headers,
1539    requesting disjunct sets of chunks. B and C respond with 206 Partial
1540    Content replies with the requested chunks in the body and Accept-
1541    Ranges headers, updating their chunk availability. The HASHES for the
1542    chunks are encoded in a new Content-Merkle header and the Content-
1543    Range is set to identify the chunk [RFC2616,$14.16]. A new
1544    "multipart-bin ranges" equivalent to the "multipart-bytes ranges"
1545    media type may be used to transmit multiple chunks in one reply.
1546
1547    Upon receipt, A sends a new GET request with a HAVE/Accept-Ranges
1548    header for the chunks received and new HINT/Range headers to B and C.
1549    Now when e.g. C finds that A obtained a chunk (from B) that C did not
1550    yet have, C's response includes a HINT/Range for that chunk. In this
1551    case, A's next request to C is not a GET request, but a PUT request
1552    with the requested chunk sent in the body.
1553
1554    Again, working around the fact that HTTP is a client/server protocol,
1555    peer A periodically sends HEAD requests to peer D (which was
1556    virtually choking A) that serve as keepalives and may contain
1557    HAVE/Accept-Ranges headers. If D decides to unchoke peer A, it
1558    includes an Accept-Ranges header in the "200 OK" reply to inform A of
1559    its current chunk availability.
1560
1561    If B or C decide to choke A they start responding with 204 No Content
1562    replies without HAVE/Accept-Ranges headers and A should then re-
1563    request from other peers. However, if their replies contain
1564    HINT/Range headers A should keep on sending PUT requests with the
1565
1566
1567
1568 Grishchenko and Bakker   Expires June 21, 2012                 [Page 28]
1569 \f
1570 Internet-Draft                   swift                 December 19, 2011
1571
1572
1573    desired data (another client/server workaround). If not, A should
1574    slowly send HEAD requests as keepalive and content availability
1575    update.
1576
1577    Once A has received all content (video-on-demand use case) it closes
1578    the persistent connections to all other peers that have all content
1579    (a.k.a. seeders).
1580
1581
1582    6.4.1.3. Leaving a Swarm
1583
1584    Peers can explicitly leave a swarm by closing the connection. This
1585    mechanism works for both graceful and ungraceful leaves (i.e., peer
1586    crashes or disconnects). When leaving gracefully, a peer should
1587    deregister from the tracker following the PPSP tracker protocol.
1588
1589
1590    6.4.1.4. Discussion
1591
1592    As mentioned earlier, this design suffers from the fact that HTTP is
1593    a client/server protocol. A solution where a peer establishes two
1594    HTTP connections with every other peer may be more elegant, but
1595    inefficient. The mapping of swift messages to headers remains the
1596    same:
1597
1598         HINT = Range
1599         HAVE = Accept-Ranges
1600         HASH = Content-Merkle
1601         PEX = e.g. extended Content-Location
1602
1603    The Content-Merkle header should include some parameters to indicate
1604    the hash function and chunk size (e.g. SHA1 and 1K) used to build the
1605    Merkle tree.
1606
1607
1608 6.4.2. PPSP Requirements
1609
1610    6.4.2.1. Basic Requirements
1611
1612    - PPSP.REQ-1: The HTTP-based BitTorrent tracker protocol [BITTORRENT]
1613    can be used as the basis for a tracker protocol, to be discussed
1614    elsewhere.
1615
1616    - PPSP.REQ-2: This draft preserves the properties of HTTP, but extra
1617    mechanisms may be necessary to protect against faulty or malicious
1618    peers.
1619
1620    - PPSP.REQ-3: This draft does not place requirements on peer IDs,
1621
1622
1623
1624 Grishchenko and Bakker   Expires June 21, 2012                 [Page 29]
1625 \f
1626 Internet-Draft                   swift                 December 19, 2011
1627
1628
1629    IP+port is sufficient.
1630
1631    - PPSP.REQ-4: The content is identified by its root hash (video-on-
1632    demand) or a public key (live streaming).
1633
1634    - PPSP.REQ-5: The content is partitioned into chunks by the streaming
1635    application (see 6.4.1.1.)
1636
1637    - PPSP.REQ-6: Each chunk is identified by a bin number (and its
1638    cryptographic hash.)
1639
1640    - PPSP.REQ-7: The protocol is carried over TCP because HTTP is.
1641
1642
1643    6.4.2.2. Peer Protocol Requirements
1644
1645    - PPSP.PP.REQ-1: A HEAD request can be used to find out which chunks
1646    are available from a peer, which returns the new Accept-Ranges
1647    header.
1648
1649    - PPSP.PP.REQ-2: The new Accept-Ranges header satisfies this.
1650
1651    - PPSP.PP.REQ-3: A GET with a request-URI requesting the peers of a
1652    resource (e.g. /<encoded roothash>/peers) would have to be added to
1653    request known peers from a peer, if the proposed push-based
1654    PEX/~Content-Location mechanism is not sufficient. Care should be
1655    taken with peer address exchange in general, as the use of such
1656    hearsay is a risk for the protocol as it may be exploited by
1657    malicious peers (as a DDoS attack mechanism). A secure tracking /
1658    peer sampling protocol like [PUPPETCAST] may be needed to make peer-
1659    address exchange safe.
1660
1661
1662    - PPSP.PP.REQ-4: HAVE/Accept-Ranges headers convey current
1663    availability.
1664
1665    - PPSP.PP.REQ-5: Bin numbering enables a compact representation of
1666    chunk availability.
1667
1668    - PPSP.PP.REQ-6: A new PPSP specific Peer-Report header would have to
1669    be added.
1670
1671    - PPSP.PP.REQ-7: Transmission and chunk requests are integrated in
1672    this protocol.
1673
1674
1675
1676
1677
1678
1679
1680 Grishchenko and Bakker   Expires June 21, 2012                 [Page 30]
1681 \f
1682 Internet-Draft                   swift                 December 19, 2011
1683
1684
1685    6.4.2.3. Security Requirements
1686
1687    - PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms
1688    [CLOSED] would have to be added.
1689
1690    - PPSP.SEC.REQ-2: As swift is carried over HTTP, HTTPS encryption can
1691    be used instead. Alternatively, just the body could be encrypted. The
1692    latter allows for caching servers that are part of the swarm but do
1693    not need access to the decryption keys (they need access to the swift
1694    HASHES in the headers to verify the packet's integrity).
1695
1696    - PPSP.SEC.REQ-3: HTTPS encryption or the content encryption
1697    facilities of HTTP can be used.
1698
1699    - PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the
1700    indirect spread of corrupt content, as peers will only forward
1701    content to others if its integrity checks out. Another protection
1702    mechanism is to not depend on hearsay (i.e., do not forward other
1703    peers' availability information), or to only use it when the
1704    information spread is self-certified by its subjects.
1705
1706    Other attacks such as a malicious peer claiming it has content, but
1707    not replying are still possible. Or wasting CPU and bandwidth at a
1708    receiving peer by sending packets where the body doesn't match the
1709    HASH/Content-Merkle headers.
1710
1711
1712    - PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving
1713    peer to detect a malicious or faulty sender, which it can
1714    subsequently close its connection to and ignore. Spreading this
1715    knowledge to other peers such that they know about this bad behavior
1716    is hearsay.
1717
1718
1719    - PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that
1720    malicious peers launch an Eclipse [ECLIPSE] attack on the initial
1721    injectors of the content (in particular in live streaming). The
1722    attack tries to let the injector upload to just malicious peers which
1723    then do not forward the content to others, thus stopping the
1724    distribution. An Eclipse attack could also be launched on an
1725    individual peer. Letting these injectors only use trusted trackers
1726    that provide true random samples of the population or using a secure
1727    peer sampling service [PUPPETCAST] can help negate such an attack.
1728
1729
1730    - PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or
1731    additional mechanisms such as DHTs [SECDHTS], but self-certification
1732    of addresses is needed. Self-certification means For example, that
1733
1734
1735
1736 Grishchenko and Bakker   Expires June 21, 2012                 [Page 31]
1737 \f
1738 Internet-Draft                   swift                 December 19, 2011
1739
1740
1741    each peer has a public/private key pair [PERMIDS] and creates self-
1742    certified address changes that include the swarm ID and a timestamp,
1743    which are then exchanged among peers or stored in DHTs. See also
1744    discussion of PPSP.PP.REQ-3 above. Content distribution can continue
1745    as long as there are peers that have it available.
1746
1747    - PPSP.SEC.REQ-8: The verification of data via hashes obtained from a
1748    trusted source is well-established in the BitTorrent protocol
1749    [BITTORRENT]. The proposed Merkle tree scheme is a secure extension
1750    of this idea. Self-certification and not using hearsay are other
1751    lessons learned from existing distributed systems.
1752
1753    - PPSP.SEC.REQ-9: Swift has built-in content integrity protection via
1754    self-certified naming of content, see SEC.REQ-5 and Sec. 3.5.1.
1755
1756
1757 7. Security Considerations
1758
1759    As any other network protocol, the swift faces a common set of
1760    security challenges. An implementation must consider the possibility
1761    of buffer overruns, DoS attacks and manipulation (i.e. reflection
1762    attacks). Any guarantee of privacy seems unlikely, as the user is
1763    exposing its IP address to the peers. A probable exception is the
1764    case of the user being hidden behind a public NAT or proxy.
1765
1766
1767 8. Extensibility
1768
1769 8.1. 32 bit vs 64 bit
1770
1771    While in principle the protocol supports bigger (>1TB) files, all the
1772    mentioned counters are 32-bit. It is an optimization, as using
1773    64-bit numbers on-wire may cost ~2% practical overhead. The 64-bit
1774    version of every message has typeid of 64+t, e.g. typeid 68 for
1775    64-bit hash message:
1776       44 000000000000000E 01234567890ABCDEF1234567890ABCDEF1234567
1777
1778 8.2. IPv6
1779
1780    IPv6 versions of PEX messages use the same 64+t shift as just
1781    mentioned.
1782
1783
1784 8.3. Congestion Control Algorithms
1785
1786    Congestion control algorithm is left to the implementation and may
1787    even vary from peer to peer. Congestion control is entirely
1788    implemented by the sending peer, the receiver only provides clues,
1789
1790
1791
1792 Grishchenko and Bakker   Expires June 21, 2012                 [Page 32]
1793 \f
1794 Internet-Draft                   swift                 December 19, 2011
1795
1796
1797    such as hints, acknowledgments and timestamps. In general, it is
1798    expected that servers would use TCP-like congestion control schemes
1799    such as classic AIMD or CUBIC [CUBIC]. End-user peers are expected to
1800    use weaker-than-TCP (least than best effort) congestion control, such
1801    as [LEDBAT] to minimize seeding counter-incentives.
1802
1803
1804 8.4. Piece Picking Algorithms
1805
1806    Piece picking entirely depends on the receiving peer. The sender peer
1807    is made aware of preferred pieces by the means of HINT messages. In
1808    some scenarios it may be beneficial to allow the sender to ignore
1809    those hints and send unrequested data.
1810
1811
1812 8.5. Reciprocity Algorithms
1813
1814    Reciprocity algorithms are the sole responsibility of the sender
1815    peer. Reciprocal intentions of the sender are not manifested by
1816    separate messages (as BitTorrent's CHOKE/UNCHOKE), as it does not
1817    guarantee anything anyway (the "snubbing" syndrome).
1818
1819
1820 8.6. Different crypto/hashing schemes
1821
1822    Once a flavor of swift will need to use a different crypto scheme
1823    (e.g., SHA-256), a message should be allocated for that. As the root
1824    hash is supplied in the handshake message, the crypto scheme in use
1825    will be known from the very beginning. As the root hash is the
1826    content's identifier, different schemes of crypto cannot be mixed in
1827    the same swarm; different swarms may distribute the same content
1828    using different crypto.
1829
1830
1831 9. Rationale
1832
1833    Historically, the Internet was based on end-to-end unicast and,
1834    considering the failure of multicast, was addressed by different
1835    technologies, which ultimately boiled down to maintaining and
1836    coordinating distributed replicas. On one hand, downloading from a
1837    nearby well-provisioned replica is somewhat faster and/or cheaper; on
1838    the other hand, it requires to coordinate multiple parties (the data
1839    source, mirrors/CDN sites/peers, consumers). As the Internet
1840    progresses to richer and richer content, the overhead of peer/replica
1841    coordination becomes dwarfed by the mass of the download itself.
1842    Thus, the niche for multiparty transfers expands. Still, current,
1843    relevant technologies are tightly coupled to a single use case or
1844    even infrastructure of a particular corporation. The mission of our
1845
1846
1847
1848 Grishchenko and Bakker   Expires June 21, 2012                 [Page 33]
1849 \f
1850 Internet-Draft                   swift                 December 19, 2011
1851
1852
1853    project is to create a generic content-centric multiparty transport
1854    protocol to allow seamless, effortless data dissemination on the Net.
1855
1856                        TABLE 1. Use cases.
1857
1858          | mirror-based   peer-assisted        peer-to-peer
1859    ------+----------------------------------------------------
1860    data  | SunSITE        CacheLogic VelociX   BitTorrent
1861    VoD   | YouTube        Azureus(+seedboxes)  SwarmPlayer
1862    live  | Akamai Str.    Octoshape, Joost     PPlive
1863
1864    The protocol must be designed for maximum genericity, thus focusing
1865    on the very core of the mission, contain no magic constants and no
1866    hardwired policies. Effectively, it is a set of messages allowing to
1867    securely retrieve data from whatever source available, in parallel.
1868    Ideally, the protocol must be able to run over IP as an independent
1869    transport protocol. Practically, it must run over UDP and TCP.
1870
1871
1872 9.1.  Design Goals
1873
1874    The technical focus of the swift protocol is to find the simplest
1875    solution involving the minimum set of primitives, still being
1876    sufficient to implement all the targeted usecases (see Table 1),
1877    suitable for use in general-purpose software and hardware (i.e. a web
1878    browser or a set-top box). The five design goals for the protocol
1879    are:
1880
1881    1. Embeddable kernel-ready protocol.
1882    2. Embrace real-time streaming, in- and out-of-order download.
1883    3. Have short warm-up times.
1884    4. Traverse NATs transparently.
1885    5. Be extensible, allow for multitude of implementation over
1886       diverse mediums, allow for drop-in pluggability.
1887
1888    The objectives are referenced as (1)-(5).
1889
1890    The goal of embedding (1) means that the protocol must be ready to
1891    function as a regular transport protocol inside a set-top box, mobile
1892    device, a browser and/or in the kernel space. Thus, the protocol must
1893    have light footprint, preferably less than TCP, in spite of the
1894    necessity to support numerous ongoing connections as well as to
1895    constantly probe the network for new possibilities. The practical
1896    overhead for TCP is estimated at 10KB per connection [HTTP1MLN]. We
1897    aim at <1KB per peer connected. Also, the amount of code necessary to
1898    make a basic implementation must be limited to 10KLoC of C.
1899    Otherwise, besides the resource considerations, maintaining and
1900    auditing the code might become prohibitively expensive.
1901
1902
1903
1904 Grishchenko and Bakker   Expires June 21, 2012                 [Page 34]
1905 \f
1906 Internet-Draft                   swift                 December 19, 2011
1907
1908
1909    The support for all three basic usecases of real-time streaming,
1910    in-order download and out-of-order download (2) is necessary for the
1911    manifested goal of THE multiparty transport protocol as no single
1912    usecase dominates over the others.
1913
1914    The objective of short warm-up times (3) is the matter of end-user
1915    experience; the playback must start as soon as possible. Thus any
1916    unnecessary initialization roundtrips and warm-up cycles must be
1917    eliminated from the transport layer.
1918
1919    Transparent NAT traversal (4) is absolutely necessary as at least 60%
1920    of today's users are hidden behind NATs. NATs severely affect
1921    connection patterns in P2P networks thus impacting performance and
1922    fairness [MOLNAT,LUCNAT].
1923
1924    The protocol must define a common message set (5) to be used by
1925    implementations; it must not hardwire any magic constants, algorithms
1926    or schemes beyond that. For example, an implementation is free to use
1927    its own congestion control, connection rotation or reciprocity
1928    algorithms. Still, the protocol must enable such algorithms by
1929    supplying sufficient information. For example, trackerless peer
1930    discovery needs peer exchange messages, scavenger congestion control
1931    may need timestamped acknowledgments, etc.
1932
1933
1934 9.2.  Not TCP
1935
1936    To large extent, swift's design is defined by the cornerstone
1937    decision to get rid of TCP and not to reinvent any TCP-like
1938    transports on top of UDP or otherwise. The requirements (1), (4), (5)
1939    make TCP a bad choice due to its high per-connection footprint,
1940    complex and less reliable NAT traversal and fixed predefined
1941    congestion control algorithms. Besides that, an important
1942    consideration is that no block of TCP functionality turns out to be
1943    useful for the general case of swarming downloads. Namely,
1944      1. in-order delivery is less useful as peer-to-peer protocols
1945      often employ out-of-order delivery themselves and in either case
1946      out-of-order data can still be stored;
1947      2. reliable delivery/retransmissions are not useful because
1948      the same data might be requested from different sources; as
1949      in-order delivery is not required, packet losses might be
1950      patched up lazily, without stopping the flow of data;
1951      3. flow control is not necessary as the receiver is much less
1952      likely to be saturated with the data and even if so, that
1953      situation is perfectly detected by the congestion control;
1954      4. TCP congestion control is less useful as custom congestion
1955      control is often needed [LEDBAT].
1956    In general, TCP is built and optimized for a different usecase than
1957
1958
1959
1960 Grishchenko and Bakker   Expires June 21, 2012                 [Page 35]
1961 \f
1962 Internet-Draft                   swift                 December 19, 2011
1963
1964
1965    we have with swarming downloads. The abstraction of a "data pipe"
1966    orderly delivering some stream of bytes from one peer to another
1967    turned out to be irrelevant. In even more general terms, TCP
1968    supports the abstraction of pairwise _conversations_, while we need
1969    a content-centric protocol built around the abstraction of a cloud
1970    of participants disseminating the same _data_ in any way and order
1971    that is convenient to them.
1972
1973    Thus, the choice is to design a protocol that runs on top of
1974    unreliable datagrams. Instead of reimplementing TCP, we create a
1975    datagram-based protocol, completely dropping the sequential data
1976    stream abstraction. Removing unnecessary features of TCP makes it
1977    easier both to implement the protocol and to verify it; numerous TCP
1978    vulnerabilities were caused by complexity of the protocol's state
1979    machine. Still, we reserve the possibility to run swift on top of TCP
1980    or HTTP.
1981
1982    Pursuing the maxim of making things as simple as possible but not
1983    simpler, we fit the protocol into the constraints of the transport
1984    layer by dropping all the transmission's technical metadata except
1985    for the content's root hash (compare that to metadata files used in
1986    BitTorrent). Elimination of technical metadata is achieved through
1987    the use of Merkle [MERKLE,ABMRKL] hash trees, exclusively single-file
1988    transfers and other techniques. As a result, a transfer is identified
1989    and bootstrapped by its root hash only.
1990
1991    To avoid the usual layering of positive/negative acknowledgment
1992    mechanisms we introduce a scale-invariant acknowledgment system (see
1993    Sec 4.4). The system allows for aggregation and variable level of
1994    detail in requesting, announcing and acknowledging data, serves
1995    in-order and out-of-order retrieval with equal ease. Besides the
1996    protocol's footprint, we also aim at lowering the size of a minimal
1997    useful interaction. Once a single datagram is received, it must be
1998    checked for data integrity, and then either dropped or accepted,
1999    consumed and relayed.
2000
2001
2002
2003 9.3.  Generic Acknowledgments
2004
2005    Generic acknowledgments came out of the need to simplify the
2006    data addressing/requesting/acknowledging mechanics, which tends
2007    to become overly complex and multilayered with the conventional
2008    approach. Take the BitTorrent+TCP tandem for example:
2009
2010    1. The basic data unit is a byte of content in a file.
2011    2. BitTorrent's highest-level unit is a "torrent", physically a
2012    byte range resulting from concatenation of content files.
2013
2014
2015
2016 Grishchenko and Bakker   Expires June 21, 2012                 [Page 36]
2017 \f
2018 Internet-Draft                   swift                 December 19, 2011
2019
2020
2021    3. A torrent is divided into "pieces", typically about a thousand
2022    of them. Pieces are used to communicate progress to other
2023    peers. Pieces are also basic data integrity units, as the torrent's
2024    metadata includes a SHA1 hash for every piece.
2025    4. The actual data transfers are requested and made in 16KByte
2026    units, named "blocks" or chunks.
2027    5. Still, one layer lower, TCP also operates with bytes and byte
2028    offsets which are totally different from the torrent's bytes and
2029    offsets, as TCP considers cumulative byte offsets for all content
2030    sent by a connection, be it data, metadata or commands.
2031    6. Finally, another layer lower, IP transfers independent datagrams
2032    (typically around 1.5 kilobyte), which TCP then reassembles into
2033    continuous streams.
2034
2035    Obviously, such addressing schemes need lots of mappings; from
2036    piece number and block to file(s) and offset(s) to TCP sequence
2037    numbers to the actual packets and the other way around. Lots of
2038    complexity is introduced by mismatch of bounds: packet bounds are
2039    different from file, block or hash/piece bounds. The picture is
2040    typical for a codebase which was historically layered.
2041
2042    To simplify this aspect, we employ a generic content addressing
2043    scheme based on binary intervals, or "bins" for short.
2044
2045
2046
2047 Acknowledgements
2048
2049    Arno Bakker and Victor Grishchenko are partially supported by the
2050    P2P-Next project (http://www.p2p-next.org/), a research project
2051    supported by the European Community under its 7th Framework Programme
2052    (grant agreement no. 216217).  The views and conclusions contained
2053    herein are those of the authors and should not be interpreted as
2054    necessarily representing the official policies or endorsements,
2055    either expressed or implied, of the P2P-Next project or the European
2056    Commission.
2057
2058    The swift protocol was designed by Victor Grishchenko at Technische
2059    Universiteit Delft. The authors would like to thank the following
2060    people for their contributions to this draft: Mihai Capota, Raul
2061    Jiminez, Flutra Osmani, Riccardo Petrocco, Johan Pouwelse, and Raynor
2062    Vliegendhart.
2063
2064
2065 References
2066
2067 [RFC2119] Key words for use in RFCs to Indicate Requirement Levels
2068 [HTTP1MLN] Richard Jones. "A Million-user Comet Application with
2069
2070
2071
2072 Grishchenko and Bakker   Expires June 21, 2012                 [Page 37]
2073 \f
2074 Internet-Draft                   swift                 December 19, 2011
2075
2076
2077     Mochiweb", Part 3. http://www.metabrew.com/article/
2078     a-million-user-comet-application-with-mochiweb-part-3
2079 [MOLNAT] J.J.D. Mol, J.A. Pouwelse, D.H.J. Epema and H.J. Sips:
2080     "Free-riding, Fairness, and Firewalls in P2P File-Sharing"
2081     Proc. Eighth International Conference on Peer-to-Peer Computing
2082     (P2P '08), Aachen, Germany, 8-11 Sept. 2008, pp. 301 - 310.
2083 [LUCNAT] L. D'Acunto and M. Meulpolder and R. Rahman and J.A.
2084     Pouwelse and H.J. Sips. "Modeling and Analyzing the Effects
2085     of Firewalls and NATs in P2P Swarming Systems". In Proc. of
2086     IEEE IPDPS (HotP2P), Atlanta, USA, April 23, 2010.
2087 [BINMAP] V. Grishchenko, J. Pouwelse: "Binmaps: hybridizing bitmaps
2088     and binary trees". Technical Report PDS-2011-005, Parallel and
2089     Distributed Systems Group, Fac. of Electrical Engineering,
2090     Mathematics, and Computer Science, Delft University of Technology,
2091     The Netherlands, April 2009.
2092 [SNP] B. Ford, P. Srisuresh, D. Kegel: "Peer-to-Peer Communication
2093     Across Network Address Translators",
2094     http://www.brynosaurus.com/pub/net/p2pnat/
2095 [FIPS180-2]
2096     Federal Information Processing Standards Publication 180-2:
2097     "Secure Hash Standard" 2002 August 1.
2098 [MERKLE] Merkle, R. "Secrecy, Authentication, and Public Key Systems",
2099     Ph.D. thesis, Dept. of Electrical Engineering, Stanford University,
2100     CA, USA, 1979. pp 40-45.
2101 [ABMRKL] Arno Bakker: "Merkle hash torrent extension", BitTorrent
2102     Enhancement Proposal 30, Mar 2009.
2103     http://bittorrent.org/beps/bep_0030.html
2104 [CUBIC] Injong Rhee, and Lisong Xu: "CUBIC: A New TCP-Friendly
2105     High-Speed TCP Variant", Proc. Third International Workshop
2106     on Protocols for Fast Long-Distance Networks (PFLDnet), Lyon,
2107     France, Feb 2005.
2108 [LEDBAT] S. Shalunov et al. "Low Extra Delay Background Transport
2109     (LEDBAT)", IETF Internet-Draft draft-ietf-ledbat-congestion
2110     (work in progress), Oct 2011.
2111     http://datatracker.ietf.org/doc/draft-ietf-ledbat-congestion/
2112 [TIT4TAT] Bram Cohen: "Incentives Build Robustness in BitTorrent",
2113     Proc. 1st Workshop on Economics of Peer-to-Peer Systems, Berkeley,
2114     CA, USA, Jun 2003.
2115 [BITTORRENT] B. Cohen, "The BitTorrent Protocol Specification",
2116      February 2008, http://www.bittorrent.org/beps/bep_0003.html
2117 [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
2118     Jacobson, "RTP: A Transport Protocol for Real-Time
2119     Applications", STD 64, RFC 3550, July 2003.
2120 [RFC3711] M. Baugher, D. McGrew, M. Naslund, E. Carrara, K. Norrman,
2121     "The Secure Real-time Transport Protocol (SRTP), RFC 3711, March
2122     2004.
2123 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing,
2124    "Session Traversal Utilities for NAT (STUN)", RFC 5389, October 2008.
2125
2126
2127
2128 Grishchenko and Bakker   Expires June 21, 2012                 [Page 38]
2129 \f
2130 Internet-Draft                   swift                 December 19, 2011
2131
2132
2133 [RFC2616] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter,
2134    P. Leach, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1",
2135    RFC2616, June 1999.
2136 [I-D.ietf-ppsp-reqs] Zong, N., Zhang, Y., Pascual, V., Williams, C.,
2137     and L. Xiao, "P2P  Streaming Protocol (PPSP) Requirements",
2138     draft-ietf-ppsp-reqs-05 (work in progress), October 2011.
2139 [PPSPCHART] M. Stiemerling et al. "Peer to Peer Streaming Protocol (ppsp)
2140     Description of Working Group"
2141    http://datatracker.ietf.org/wg/ppsp/charter/
2142 [PERMIDS] A. Bakker et al. "Next-Share Platform M8--Specification
2143     Part", App. C. P2P-Next project deliverable D4.0.1 (revised),
2144     June 2009.
2145     http://www.p2p-next.org/download.php?id=E7750C654035D8C2E04D836243E6526E
2146 [PUPPETCAST] A. Bakker and M. van Steen. "PuppetCast: A Secure Peer
2147     Sampling Protocol". Proceedings 4th Annual European Conference on
2148     Computer Network Defense (EC2ND'08), pp. 3-10, Dublin, Ireland,
2149     11-12 December 2008.
2150 [CLOSED] N. Borch, K. Michell, I. Arntzen, and D. Gabrijelcic: "Access
2151     control to BitTorrent swarms using closed swarms". In Proceedings
2152     of the 2010 ACM workshop on Advanced video streaming techniques
2153     for peer-to-peer networks and social networking (AVSTP2P '10).
2154     ACM, New York, NY, USA, 25-30.
2155     http://doi.acm.org/10.1145/1877891.1877898
2156 [ECLIPSE] E. Sit and R. Morris, "Security Considerations for
2157     Peer-to-Peer Distributed Hash Tables", IPTPS '01: Revised Papers
2158     from the First International Workshop on Peer-to-Peer Systems, pp.
2159     261-269, Springer-Verlag, London, UK, 2002.
2160 [SECDHTS] G. Urdaneta, G. Pierre, M. van Steen, "A Survey of DHT
2161    Security Techniques", ACM Computing Surveys, vol. 43(2), June 2011.
2162 [SWIFTIMPL] V. Grishchenko, et al. "Swift M40 reference implementation",
2163    http://swarmplayer.p2p-next.org/download/Next-Share-M40.tar.bz2
2164    (subdirectory Next-Share/TUD/swift-trial-r2242/), July 2011.
2165 [CCNWIKI] http://en.wikipedia.org/wiki/Content-centric_networking
2166 [HAC01] A.J. Menezes, P.C. van Oorschot and S.A. Vanstone. "Handbook of
2167    Applied Cryptography", CRC Press, October 1996 (Fifth Printing,
2168    August 2001).
2169 [JIM11] R. Jimenez, F. Osmani, and B. Knutsson. "Sub-Second Lookups on
2170    a Large-Scale Kademlia-Based Overlay". 11th IEEE International
2171    Conference on Peer-to-Peer Computing 2011, Kyoto, Japan, Aug. 2011
2172
2173
2174 Authors' addresses
2175
2176    A. Bakker
2177    Technische Universiteit Delft
2178    Department EWI/ST/PDS
2179    Room HB 9.160
2180    Mekelweg 4
2181
2182
2183
2184 Grishchenko and Bakker   Expires June 21, 2012                 [Page 39]
2185 \f
2186 Internet-Draft                   swift                 December 19, 2011
2187
2188
2189    2628CD Delft
2190    The Netherlands
2191
2192    Email: arno@cs.vu.nl
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240 Grishchenko and Bakker   Expires June 21, 2012                 [Page 40]