X-Git-Url: http://p2p-next.cs.pub.ro/gitweb/?a=blobdiff_plain;f=src%2Flibswift%2Fdoc%2Fdraft-ietf-ppsp-peer-protocol-00.nroff;fp=src%2Flibswift%2Fdoc%2Fdraft-ietf-ppsp-peer-protocol-00.nroff;h=8f6dd7dd9e571b1f3868bd31da824ceb9d28d640;hb=45963a7511531cd1656ad5d3847d2dafd015c54d;hp=0000000000000000000000000000000000000000;hpb=d069796805ad79542fd7e4406d1e9c6d2d8c2ef7;p=swifty.git diff --git a/src/libswift/doc/draft-ietf-ppsp-peer-protocol-00.nroff b/src/libswift/doc/draft-ietf-ppsp-peer-protocol-00.nroff new file mode 100644 index 0000000..8f6dd7d --- /dev/null +++ b/src/libswift/doc/draft-ietf-ppsp-peer-protocol-00.nroff @@ -0,0 +1,2087 @@ +.\" Auto generated Nroff by NroffEdit on April 12, 2010 +.pl 10.0i +.po 0 +.ll 7.2i +.lt 7.2i +.nr LL 7.2i +.nr LT 7.2i +.ds LF Grishchenko and Bakker +.ds RF FORMFEED[Page %] +.ds LH Internet-Draft +.ds RH December 19, 2011 +.ds CH swift +.ds CF Expires June 21, 2012 +.hy 0 +.nh +.ad l +.in 0 +.nf +.tl 'PPSP' 'A. Bakker' +.tl 'Internet-Draft' 'TU Delft' +.tl 'Intended status: Informational' +.tl 'Expires: June 21, 2012' 'December 19, 2011' + +.fi +.in 3 +.in 12 +.ti 8 +Peer-to-Peer Streaming Protocol (PPSP) \% + +.ti 0 +Abstract + +.in 3 +The Generic Multiparty Protocol (swift) is a peer-to-peer based transport +protocol for content dissemination. It can be used for streaming on-demand +and live video content, as well as conventional downloading. In swift, the +clients consuming the content participate in the dissemination by forwarding +the content to other clients via a mesh-like structure. It is a generic +protocol which can run directly on top of UDP, TCP, HTTP or as a RTP +profile. Features of swift are short time-till-playback and extensibility. +Hence, it can use different mechanisms to prevent freeriding, and work +with different peer discovery schemes (centralized trackers or +Distributed Hash Tables). Depending on the underlying transport +protocol, swift can also use different congestion control algorithms, +such as LEDBAT, and offer transparent NAT traversal. Finally, swift maintains +only a small amount of state per peer and detects malicious modification of +content. This documents describes swift and how it satisfies the requirements +for the IETF Peer-to-Peer Streaming Protocol (PPSP) Working Group's peer +protocol. + + +.ti 0 +Status of this memo + +This Internet-Draft is submitted to IETF in full conformance with the +provisions of BCP 78 and BCP 79. + +Internet-Drafts are working documents of the Internet Engineering Task Force +(IETF), its areas, and its working groups. Note that other groups may also +distribute working documents as Internet- Drafts. + +Internet-Drafts are draft documents valid for a maximum of six months and +may be updated, replaced, or obsoleted by other documents at any time. It +is inappropriate to use Internet-Drafts as reference material or to cite +them other than as "work in progress." + +The list of current Internet-Drafts can be accessed at +\%http://www.ietf.org/ietf/1id-abstracts.txt. + +The list of Internet-Draft Shadow Directories can be accessed at +http://www.ietf.org/shadow.html. + + +.nf +Copyright (c) 2011 IETF Trust and the persons identified as the +document authors. All rights reserved. + +This document is subject to BCP 78 and the IETF Trust's Legal +Provisions Relating to IETF Documents +\%(http://trustee.ietf.org/license-info) in effect on the date of +publication of this document. Please review these documents +carefully, as they describe your rights and restrictions with respect +to this document. Code Components extracted from this document must +include Simplified BSD License text as described in Section 4.e of +the Trust Legal Provisions and are provided without warranty as +described in the Simplified BSD License. + +.\" \# TD4 -- Set TOC depth by altering this value (TD5 = depth 5) +.\" \# TOC -- Beginning of auto updated Table of Contents +.in 0 +Table of Contents + +.nf + 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 + 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . 3 + 1.2. Conventions Used in This Document . . . . . . . . . . . . . 4 + 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . 5 + 2. Overall Operation . . . . . . . . . . . . . . . . . . . . . . . 6 + 2.1. Joining a Swarm . . . . . . . . . . . . . . . . . . . . . . 6 + 2.2. Exchanging Chunks . . . . . . . . . . . . . . . . . . . . . 6 + 2.3. Leaving a Swarm . . . . . . . . . . . . . . . . . . . . . . 7 + 3. Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 + 3.1. HANDSHAKE . . . . . . . . . . . . . . . . . . . . . . . . . 8 + 3.3. HAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 + 3.3.1. Bin Numbers . . . . . . . . . . . . . . . . . . . . . . 8 + 3.3.2. HAVE Message . . . . . . . . . . . . . . . . . . . . . 9 + 3.4. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 + 3.5. DATA and HASH . . . . . . . . . . . . . . . . . . . . . . . 10 + 3.5.1. Merkle Hash Tree . . . . . . . . . . . . . . . . . . . 10 + 3.5.2. Content Integrity Verification . . . . . . . . . . . . 11 + 3.5.3. The Atomic Datagram Principle . . . . . . . . . . . . . 11 + 3.5.4. DATA and HASH Messages . . . . . . . . . . . . . . . . 12 + 3.6. HINT . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 + 3.7. Peer Address Exchange and NAT Hole Punching . . . . . . . . 13 + 3.8. KEEPALIVE . . . . . . . . . . . . . . . . . . . . . . . . . 14 + 3.9. VERSION . . . . . . . . . . . . . . . . . . . . . . . . . . 14 + 3.10. Conveying Peer Capabilities . . . . . . . . . . . . . . . 14 + 3.11. Directory Lists . . . . . . . . . . . . . . . . . . . . . 14 + 4. Automatic Detection of Content Size . . . . . . . . . . . . . . 14 + 4.1. Peak Hashes . . . . . . . . . . . . . . . . . . . . . . . . 15 + 4.2. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 16 + 5. Live streaming . . . . . . . . . . . . . . . . . . . . . . . . 17 + 6. Transport Protocols and Encapsulation . . . . . . . . . . . . . 17 + 6.1. UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 + 6.1.1. Chunk Size . . . . . . . . . . . . . . . . . . . . . . 17 + 6.1.2. Datagrams and Messages . . . . . . . . . . . . . . . . 18 + 6.1.3. Channels . . . . . . . . . . . . . . . . . . . . . . . 18 + 6.1.4. HANDSHAKE and VERSION . . . . . . . . . . . . . . . . . 19 + 6.1.5. HAVE . . . . . . . . . . . . . . . . . . . . . . . . . 20 + 6.1.6. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . 20 + 6.1.7. HASH . . . . . . . . . . . . . . . . . . . . . . . . . 20 + 6.1.8. DATA . . . . . . . . . . . . . . . . . . . . . . . . . 20 + 6.1.9. KEEPALIVE . . . . . . . . . . . . . . . . . . . . . . . 20 + 6.1.10. Flow and Congestion Control . . . . . . . . . . . . . 21 + 6.2. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 + 6.3. RTP Profile for PPSP . . . . . . . . . . . . . . . . . . . 21 + 6.3.1. Design . . . . . . . . . . . . . . . . . . . . . . . . 22 + 6.3.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 24 + 6.4. HTTP (as PPSP) . . . . . . . . . . . . . . . . . . . . . . 27 + 6.4.1. Design . . . . . . . . . . . . . . . . . . . . . . . . 27 + 6.4.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 29 + 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 32 + 8. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . 32 + 8.1. 32 bit vs 64 bit . . . . . . . . . . . . . . . . . . . . . 32 + 8.2. IPv6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 + 8.3. Congestion Control Algorithms . . . . . . . . . . . . . . . 32 + 8.4. Piece Picking Algorithms . . . . . . . . . . . . . . . . . 33 + 8.5. Reciprocity Algorithms . . . . . . . . . . . . . . . . . . 33 + 8.6. Different crypto/hashing schemes . . . . . . . . . . . . . 33 + 9. Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 + 9.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . . 34 + 9.2. Not TCP . . . . . . . . . . . . . . . . . . . . . . . . . 35 + 9.3. Generic Acknowledgments . . . . . . . . . . . . . . . . . 36 + Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 37 + References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 + Authors' addresses . . . . . . . . . . . . . . . . . . . . . . . . 39 +.fi +.in 3 + +.\" \# ETC -- End of auto updated Table of Contents + + + +.ti 0 +1. Introduction + +.ti 0 +1.1. Purpose + +This document describes the Generic Multiparty Protocol (swift), designed +from the ground up for the task of disseminating the same content to a group +of interested parties. Swift supports streaming on-demand and +live video content, as well as conventional downloading, thus covering +today's three major use cases for content distribution. To fulfil this task, +clients consuming the content are put on equal footing with the servers +initially providing the content to create a peer-to-peer system where +everyone can provide data. Each peer connects to a random set of other peers +resulting in a mesh-like structure. + +Swift uses a simple method of naming content based on self-certification. In +particular, content in swift is identified by a single cryptographic hash +that is the root hash in a Merkle hash tree calculated recursively from the +content [ABMRKL]. This self-certifying hash tree allows every peer to +directly detect when a malicious peer tries to distribute fake content. It +also ensures only a small amount of information is needed to start a +download (just the root hash and some peer addresses). + +Swift uses a novel method of addressing chunks of content called "bin +numbers". Bin numbers allow the addressing of a binary interval of data +using a single integer. This reduces the amount of state that needs to be +recorded per peer and the space needed to denote intervals on the wire, +making the protocol light-weight. In general, this numbering system allows +swift to work with simpler data structures, e.g. to use arrays instead of +binary trees, thus reducing complexity. + +Swift is a generic protocol which can run directly on top of UDP, TCP, HTTP, +or as a layer below RTP, similar to SRTP [RFC3711]. As such, swift defines a +common set of messages that make up the protocol, which can have different +representations on the wire depending on the lower-level protocol used. When +the lower-level transport is UDP, swift can also use different congestion +control algorithms and facilitate NAT traversal. + +In addition, swift is extensible in the mechanisms it uses to promote client +contribution and prevent freeriding, that is, how to deal with peers +that only download content but never upload to others. Furthermore, +it can work with different peer discovery schemes, +such as centralized trackers or fast Distributed Hash Tables [JIM11]. + +This documents describes not only the swift protocol but also how it +satisfies the requirements for the IETF Peer-to-Peer Streaming Protocol +(PPSP) Working Group's peer protocol [PPSPCHART,I-D.ietf-ppsp-reqs]. +A reference implementation of swift over UDP is available [SWIFTIMPL]. + + +.ti 0 +1.2. Conventions Used in This Document + +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", +"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in +this document are to be interpreted as described in [RFC2119]. + +.ti 0 +1.3. Terminology + +.in 3 +message +.br +.in 8 +The basic unit of swift communication. A message will have different +representations on the wire depending on the transport protocol used. Messages +are typically multiplexed into a datagram for transmission. + +.in 3 +datagram +.br +.in 8 +A sequence of messages that is offered as a unit to the underlying transport +protocol (UDP, etc.). The datagram is swift's Protocol Data Unit (PDU). + +.in 3 +content +.br +.in 8 +Either a live transmission, a pre-recorded multimedia asset, or a file. + +.in 3 +bin +.br +.in 8 +A number denoting a specific binary interval of the content (i.e., one +or more consecutive chunks). + +.in 3 +chunk +.br +.in 8 +The basic unit in which the content is divided. E.g. a block of N +kilobyte. + +.in 3 +hash +.br +.in 8 +The result of applying a cryptographic hash function, more specifically +a modification detection code (MDC) [HAC01], such as SHA1 [FIPS180-2], +to a piece of data. + +.in 3 +root hash +.br +.in 8 +The root in a Merkle hash tree calculated recursively from the content. + +.in 3 +swarm +.br +.in 8 +A group of peers participating in the distribution of the same content. + +.in 3 +swarm ID +.br +.in 8 +Unique identifier for a swarm of peers, in swift the root hash of the +content (video-on-demand,download) or a public key (live streaming). + +.in 3 +tracker +.br +.in 8 +An entity that records the addresses of peers participating in a swarm, +usually for a set of swarms, and makes this membership information +available to other peers on request. + +.in 3 +choking +.br +.in 8 +When a peer A is choking peer B it means that A is currently not +willing to accept requests for content from B. +.in 3 + + +.ti 0 +2. Overall Operation + +The basic unit of communication in swift is the message. Multiple messages are +multiplexed into a single datagram for transmission. A datagram (and hence the +messages it contains) will have different representations on the wire depending +on the transport protocol used (see Sec. 6). + + +.ti 0 +2.1. Joining a Swarm + +Consider a peer A that wants to download a certain content asset. +To commence a swift download, peer A must have the swarm ID of the content +and a list of one or more tracker contact points (e.g. host+port). The list +of trackers is optional in the presence of a decentralized tracking mechanism. +The swarm ID consists of the swift root hash of the content (video-on-demand, +downloading) or a public key (live streaming). + +Peer A now registers with the tracker following e.g. the PPSP tracker +protocol [I-D.ietf.ppsp-reqs] and receives the IP address and port of peers +already in the swarm, say B, C, and D. Peer A now sends a datagram +containing a HANDSHAKE message to B, C, and D. This message serves as an +end-to-end check that the peers are actually in the correct swarm, and +contains the root hash of the swarm. Peer B and C respond with datagrams +containing a HANDSHAKE message and one or more HAVE messages. A HAVE message +conveys (part of) the chunk availability of a peer and thus contains a bin +number that denotes what chunks of the content peer B, resp. C have. Peer D +sends a datagram with just a HANDSHAKE and omits HAVE messages as a way of +choking A. + +.ti 0 +2.2. Exchanging Chunks + +In response to B and C, A sends new datagrams to B and C containing HINT messages. +A HINT or request message indicates the chunks that a peer wants to +download, and contains a bin number. The HINT messages to B and C refer to +disjunct sets of chunks. B and C respond with datagrams containing HASH, +HAVE and DATA messages. The HASH messages contains all cryptographic hashes +that peer A needs to verify the integrity of the content chunk sent in the +DATA message, using the content's root hash as trusted anchor, see Sec. 3.5. +Using these hashes peer A verifies that the chunks received from B and C are +correct. It also updates the chunk availability of B and C using the information +in the received HAVE messages. + +After processing, A sends a datagram containing HAVE messages for the chunks +it just received to all its peers. In the datagram to B and C it includes an +ACK message acknowledging the receipt of the chunks, and adds HINT messages +for new chunks. ACK messages are not used when a reliable transport protocol +is used. When e.g. C finds that A obtained a chunk (from B) that C did not +yet have, C's next datagram includes a HINT for that chunk. + +Peer D does not send HAVE messages to A when it downloads chunks from other peers, +until D decides to unchoke peer A. In the case, it sends a datagram with +HAVE messages to inform A of its current availability. If B or C decide to +choke A they stop sending HAVE and DATA messages and A should then rerequest +from other peers. They may continue to send HINT messages, or periodic +KEEPALIVE messages such that A keeps sending them HAVE messages. + +Once peer A has received all content (video-on-demand use case) it stops +sending messages to all other peers that have all content (a.k.a. seeders). +Peer A MAY also contact the tracker or another source again to obtain more +peer addresses. + + +.ti 0 +2.3. Leaving a Swarm + +Depending on the transport protocol used, peers should either use explicit +leave messages or implicitly leave a swarm by stopping to respond +to messages. Peers that learn about the departure should remove these peers +from the current peer list. The implicit-leave mechanism works for both graceful and +ungraceful leaves (i.e., peer crashes or disconnects). When leaving gracefully, a +peer should deregister from the tracker following the (PPSP) tracker protocol. + + +.ti 0 +3. Messages + +.fi +In general, no error codes or responses are used in the protocol; absence +of any response indicates an error. Invalid messages are discarded. + +For the sake of simplicity, one swarm of peers always deals with one content +asset (e.g. file) only. Retrieval of large collections of files is done by +retrieving a directory list file and then recursively retrieving files, which +might also turn to be directory lists, as described in Sec. 3.11. + +.ti 0 +3.1. HANDSHAKE + +As an end-to-end check that the peers are actually in the correct swarm, the +initiating peer and the addressed peer SHOULD send a HANDSHAKE message in +the first datagrams they exchange. The only payload of the HANDSHAKE message +is the root hash of the content. + +After the handshakes are exchanged, the initiator knows that the peer really +responds. Hence, the second datagram the initiator sends MAY already contain +some heavy payload. To minimize the number of initialization roundtrips, +implementations MAY dispense with the HANDSHAKE message. To the same end, +the first two datagrams exchanged MAY also contain some minor payload, e.g. +HAVE messages to indicate the current progress of a peer or a HINT +(see Sec. 3.6). + + +.ti 0 +3.3. HAVE + +The HAVE message is used to convey which chunks a peers has available, +expressed in a new content addressing scheme called "bin numbers". + +.ti 0 +3.3.1. Bin Numbers + +Swift employs a generic content addressing scheme based on binary intervals +("bins" in short). The smallest interval is a chunk (e.g. a N kilobyte +block), the top interval is the complete 2**63 range. A novel addition +to the classical scheme are "bin numbers", a scheme of numbering binary +intervals which lays them out into a vector nicely. Consider an chunk +interval of width W. To derive the bin numbers of the complete interval and +the subintervals, a minimal balanced binary tree is built that is at least W +chunks wide at the base. The leaves from left-to-right correspond to +the chunks 0..W in the interval, and have bin number I*2 where I is the +index of the chunk (counting beyond W-1 to balance the tree). The higher +level nodes P in the tree have bin number + + binP = (binL + binR) / 2 + +.br +where binL is the bin of node P's left-hand child and binR is the bin of node +P's right-hand child. Given that each node in the tree represents a +subinterval of the original interval, each such subinterval now is +addressable by a bin number, a single integer. The bin number tree of a +interval of width W=8 looks like this: + + + + + 7 +.br + / \\ +.br + / \\ +.br + / \\ +.br + / \\ +.br + 3 11 +.br + / \\ / \\ +.br + / \\ / \\ +.br + / \\ / \\ +.br + 1 5 9 13 +.br + / \\ / \\ / \\ / \\ +.br + 0 2 4 6 8 10 12 14 +.br + +.fi +So bin 7 represents the complete interval, 3 represents the interval of +chunk 0..3 and 1 represents the interval of chunks 0 and 1. The special +numbers 0xFFFFFFFF (32-bit) or 0xFFFFFFFFFFFFFFFF (64-bit) stands for an +empty interval, and 0x7FFF...FFF stands for "everything". + + +.ti 0 +3.3.2. HAVE Message + +When a receiving peer has successfully checked the integrity of a chunk or +interval of chunks it MUST send a HAVE message to all peers it wants to +interact with. The latter allows the HAVE message to be used as a method of +choking. The HAVE message MUST contain the bin number of the biggest +complete interval of all chunks the receiver has received and checked so far +that fully includes the interval of chunks just received. So the bin number +MUST denote at least the interval received, but the receiver is supposed to +aggregate and acknowledge bigger bins, when possible. + +As a result, every single chunk is acknowledged a logarithmic number of times. +That provides some necessary redundancy of acknowledgments and sufficiently +compensates for unreliable transport protocols. + +To record which chunks a peer has in the state that an implementation keeps +for each peer, an implementation MAY use the "binmap" data structure, which +is a hybrid of a bitmap and a binary tree, discussed in detail in [BINMAP]. + + +.ti 0 +3.4. ACK + +When swift is run over an unreliable transport protocol, an implementation MAY +choose to use ACK messages to acknowledge received data. When a receiving +peer has successfully checked the integrity of a chunk or interval of chunks +C it MUST send a ACK message containing the bin number of its biggest, complete, +interval covering C to the sending peer (see HAVE). To facilitate delay-based +congestion control, an ACK message contains a timestamp. + + +.ti 0 +3.5. DATA and HASH + +The DATA message is used to transfer chunks of content. The associated HASH message +carries cryptographic hashes that are necessary for a receiver to check the +the integrity of the chunk. Swift's content integrity protection is based on a +Merkle hash tree and works as follows. + +.ti 0 +3.5.1. Merkle Hash Tree + +Swift uses a method of naming content based on self-certification. In particular, +content in swift is identified by a single cryptographic hash that is the +root hash in a Merkle hash tree calculated recursively from the content [ABMRKL]. +This self-certifying hash tree allows every peer to directly detect when a malicious +peer tries to distribute fake content. It also ensures only a small the amount +of information is needed to start a download (the root hash and some peer +addresses). For live streaming public keys and dynamic trees are used, see below. + +The Merkle hash tree of a content asset that is divided into N chunks +is constructed as follows. Note the construction does not assume chunks +of content to be fixed size. Given a cryptographic hash function, more specifically +a modification detection code (MDC) [HAC01], such as SHA1, the hashes of all the +chunks of the content are calculated. Next, a binary tree of sufficient height +is created. Sufficient +height means that the lowest level in the tree has enough nodes to hold all +chunk hashes in the set, as before, see HAVE message. The figure below shows +the tree for a content asset consisting of 7 chunks. As before with the +content addressing scheme, the leaves of the tree correspond to a chunk and +in this case are assigned the hash of that chunk, starting at the left-most leaf. +As the base of the tree may be wider than the number of chunks, any remaining +leaves in the tree are assigned a empty hash value of all zeros. Finally, the hash +values of the higher levels in the tree are calculated, by concatenating the hash +values of the two children (again left to right) and computing the hash of that +aggregate. This process ends in a hash value for the root node, which is called +the "root hash". Note the root hash only depends on the content and any modification +of the content will result in a different root hash. + + + + + 7 = root hash +.br + / \\ +.br + / \\ +.br + / \\ +.br + / \\ +.br + 3* 11 +.br + / \\ / \\ +.br + / \\ / \\ +.br + / \\ / \\ +.br + 1 5 9 13* = uncle hash +.br + / \\ / \\ / \\ / \\ +.br + 0 2 4 6 8 10* 12 14 +.br + + C0 C1 C2 C3 C4 C5 C6 E +.br + =chunk index ^^ = empty hash +.br + + +.ti 0 +3.5.2. Content Integrity Verification + +.fi +Assuming a peer receives the root hash of the content it wants to download +from a trusted source, it can can check the integrity of any chunk of that +content it receives as follows. It first calculates the hash of the chunk +it received, for example chunk C4 in the previous figure. Along with this +chunk it MUST receive the hashes required to check the integrity of that +chunk. In principle, these are the hash of the chunk's sibling (C5) and +that of its "uncles". A chunk's uncles are the sibling Y of its parent X, +and the uncle of that Y, recursively until the root is reached. For chunk C4 +its uncles are bins 13 and 3, marked with * in the figure. Using this information +the peer recalculates the root hash of the tree, and compares it to the +root hash it received from the trusted source. If they match the chunk of +content has been positively verified to be the requested part of the content. +Otherwise, the sending peer either sent the wrong content or the wrong +sibling or uncle hashes. For simplicity, the set of sibling and uncles +hashes is collectively referred to as the "uncle hashes". + +In the case of live streaming the tree of chunks grows dynamically and +content is identified with a public key instead of a root hash, as the root +hash is undefined or, more precisely, transient, as long as new data is +generated by the live source. Live streaming is described in more detail +below, but content verification works the same for both live and predefined +content. + +.ti 0 +3.5.3. The Atomic Datagram Principle + +As explained above, a datagram consists of a sequence of messages. Ideally, +every datagram sent must be independent of other datagrams, so each +datagram SHOULD be processed separately and a loss of one datagram MUST NOT +disrupt the flow. Thus, as a datagram carries zero or more messages, +neither messages nor message interdependencies should span over multiple +datagrams. + +This principle implies that as any chunk is verified using its uncle +hashes the necessary hashes MUST be put into the same datagram as the +chunk's data (Sec. 3.5.4). As a general rule, if some additional data is +still missing to process a message within a datagram, the message SHOULD be +dropped. + +The hashes necessary to verify a chunk are in principle its sibling's hash +and all its uncle hashes, but the set of hashes to sent can be optimized. +Before sending a packet of data to the receiver, the sender inspects the +receiver's previous acknowledgments (HAVE or ACK) to derive which hashes the +receiver already has for sure. Suppose, the receiver had acknowledged bin 1 +(first two chunks of the file), then it must already have uncle hashes 5, +11 and so on. That is because those hashes are necessary to check packets of +bin 1 against the root hash. Then, hashes 3, 7 and so on must be also known +as they are calculated in the process of checking the uncle hash chain. +Hence, to send bin 12 (i.e. the 7th chunk of content), the sender needs to +include just the hashes for bins 14 and 9, which let the data be checked +against hash 11 which is already known to the receiver. + +The sender MAY optimistically skip hashes which were sent out in previous, +still unacknowledged datagrams. It is an optimization tradeoff between +redundant hash transmission and possibility of collateral data loss in the +case some necessary hashes were lost in the network so some delivered data +cannot be verified and thus has to be dropped. In either case, the receiver +builds the Merkle tree on-demand, incrementally, starting from the root +hash, and uses it for data validation. + +In short, the sender MUST put into the datagram the missing hashes necessary +for the receiver to verify the chunk. + +.ti 0 +3.5.4. DATA and HASH Messages + +Concretely, a peer that wants to send a chunk of content creates a datagram +that MUST consist of one or more HASH messages and a DATA message. The datagram +MUST contain a HASH message for each hash the receiver misses for integrity +checking. A HASH message MUST contain the bin number and hash data for each +of those hashes. The DATA message MUST contain the bin number of the chunk +and chunk itself. A peer MAY send the required messages for multiple chunks +in the same datagram. + + +.ti 0 +3.6. HINT + +While bulk download protocols normally do explicit requests for certain ranges +of data (i.e., use a pull model, for example, BitTorrent [BITTORRENT]), live +streaming protocols quite often use a request-less push model to save round +trips. Swift supports both models of operation. + +A peer MUST send a HINT message containing the bin of the chunk interval it +wants to download. A peer receiving a HINT message MAY send out requested +pieces. When it receives multiple HINTs (either in one datagram or in multiple), +the peer SHOULD process the HINTs sequentially. When live streaming, +it also may send some other chunks in case it runs out of requests or +for some other reason. In that case the only purpose of HINT messages is +to coordinate peers and to avoid unnecessary data retransmission, hence +the name. + + + +.ti 0 +3.7. Peer Address Exchange and NAT Hole Punching + +Peer address exchange messages (or PEX messages for short) are common for +many peer-to-peer protocols. By exchanging peer addresses in gossip fashion, +peers relieve central coordinating entities (the trackers) from unnecessary +work. swift optionally features two types of PEX messages: PEX_REQ and PEX_ADD. +A peer that wants to retrieve some peer addresses MUST send a PEX_REQ message. +The receiving peer MAY respond with a PEX_ADD message containing the addresses +of several peers. The addresses MUST be of peers it has recently exchanged +messages with to guarantee liveliness. + +.fi +To unify peer exchange and NAT hole punching functionality, the +sending pattern of PEX messages is restricted. As the swift handshake +is able to do simple NAT hole punching [SNP] transparently, PEX +messages must be emitted in the way to facilitate that. Namely, +once peer A introduces peer B to peer C by sending a PEX_ADD message to +C, it SHOULD also send a message to B introducing C. The messages +SHOULD be within 2 seconds from each other, but MAY not be, +simultaneous, instead leaving a gap of twice the "typical" RTT, i.e. +\%300-600ms. The peers are supposed to initiate handshakes to each +other thus forming a simple NAT hole punching pattern where the +introducing peer effectively acts as a STUN server [RFC5389]. Still, peers +MAY ignore PEX messages if uninterested in obtaining new peers or +because of security considerations (rate limiting) or any other +reason. + +The PEX messages can be used to construct a dedicated tracker peer. + + +.ti 0 +3.8. KEEPALIVE + +A peer MUST send a datagram containing a KEEPALIVE message periodically +to each peer it wants to interact with in the future but has no +other messages to send them at present. + + +.ti 0 +3.9. VERSION +.fi +Peers MUST convey which version of the swift protocol they support using a +VERSION message. This message MUST be included in the initial (handshake) +datagrams and MUST indicate which version of the swift protocol the sending +peer supports. + + +.ti 0 +3.10. Conveying Peer Capabilities +.fi +Peers may support just a subset of the swift messages. For example, peers +running over TCP may not accept ACK messages, or peers used with a centralized +tracking infrastructure may not accept PEX messages. For these reasons, peers +SHOULD signal which subset of the swift messages they support by means of +the MSGTYPE_RCVD message. This message SHOULD be included in the initial +(handshake) datagrams and MUST indicate which swift protocol messages +the sending peer supports. + + +.ti 0 +3.11. Directory Lists + +Directory list files MUST start with magic bytes ".\\n..\\n". The rest of +the file is a newline-separated list of hashes and file names for the +content of the directory. An example: + +.nf +\&. +\&.. +1234567890ABCDEF1234567890ABCDEF12345678 readme.txt +01234567890ABCDEF1234567890ABCDEF1234567 big_file.dat + + + +.ti 0 +4. Automatic Detection of Content Size + +.fi +In swift, the root hash of a static content asset, such as a video file, +along with some peer addresses is sufficient to start a download. +In addition, swift can reliably and automatically derive the size +of such content from information received from the network when fixed +sized chunks are used. As a result, it is not necessary to include +the size of the content asset as the metadata of the content, +in addition to the root hash. Implementations of swift MAY use +this automatic detection feature. + +.ti 0 +4.1. Peak Hashes + +The ability for a newcomer peer to detect the size of the content +depends heavily on the concept of peak hashes. Peak hashes, +in general, enable two cornerstone features of swift: reliable file +size detection and download/live streaming unification (see Sec. 5). +The concept of peak hashes depends on the concepts of filled and +incomplete bins. Recall that when constructing the binary trees +for content verification and addressing the base of the tree may +have more leaves than the number of chunks in the content. In the +Merkle hash tree these leaves were assigned empty all-zero +hashes to be able to calculate the higher level hashes. A filled +bin is now defined as a bin number that addresses an interval +of leaves that consists only of hashes of content chunks, not +empty hashes. Reversely, an incomplete (not filled) bin +addresses an interval that contains also empty hashes, +typically an interval that extends past the end of the file. +In the following figure bins 7, 11, 13 and 14 are incomplete +the rest is filled. + +Formally, a peak hash is a hash in the Merkle tree defined +over a filled bin, whose sibling is defined over an incomplete bin. +Practically, suppose a file is 7162 bytes long and a chunk +is 1 kilobyte. That file fits into 7 chunks, the tail chunk being +1018 bytes long. The Merkle tree for that file looks as follows. +Following the definition the peak hashes of this file are in +bins 3, 9 and 12, denoted with a *. E denotes an empty +hash. + + 7 +.br + / \\ +.br + / \\ +.br + / \\ +.br + / \\ +.br + 3* 11 +.br + / \\ / \\ +.br + / \\ / \\ +.br + / \\ / \\ +.br + 1 5 9* 13 +.br + / \\ / \\ / \\ / \\ +.br + 0 2 4 6 8 10 12* 14 +.br + + C0 C1 C2 C3 C4 C5 C6 E +.br + = 1018 bytes +.br + +Peak hashes can be explained by the binary representation of the +number of chunks the file occupies. The binary representation for +7 is 111. Every "1" in binary representation of the file's packet +length corresponds to a peak hash. For this particular file there +are indeed three peaks, bin numbers 3, 9, 12. The number of peak +hashes for a file is therefore also at most logarithmic with its +size. + +.fi +A peer knowing which bins contain the peak hashes for the file +can therefore calculate the number of chunks it consists of, and +thus get an estimate of the file size (given all chunks but the last +are fixed size). Which bins are the peaks can be securely communicated +from one (untrusted) peer A to another B by letting A send the +peak hashes and their bin numbers to B. It can be shown that +the root hash that B obtained from a trusted source is sufficient +to verify that these are indeed the right peak hashes, as follows. + +Lemma: Peak hashes can be checked against the root hash. + +Proof: (a) Any peak hash is always the left sibling. Otherwise, be +it the right sibling, its left neighbor/sibling must also be +defined over a filled bin, because of the way chunks are laid +out in the leaves, contradiction. (b) For the rightmost +peak hash, its right sibling is zero. (c) For any peak hash, +its right sibling might be calculated using peak hashes to the +left and zeros for empty bins. (d) Once the right sibling of +the leftmost peak hash is calculated, its parent might be +calculated. (e) Once that parent is calculated, we might +trivially get to the root hash by concatenating the hash with +zeros and hashing it repeatedly. + +.fi +Informally, the Lemma might be expressed as follows: peak hashes cover all +data, so the remaining hashes are either trivial (zeros) or might be +calculated from peak hashes and zero hashes. + +Finally, once peer B has obtained the number of chunks in the content it +can determine the exact file size as follows. Given that all chunks +except the last are fixed size B just needs to know the size of the last +chunk. Knowing the number of chunks B can calculate the bin number of the +last chunk and download it. As always B verifies the integrity of this +chunk against the trusted root hash. As there is only one chunk of data +that leads to a successful verification the size of this chunk must +be correct. B can then determine the exact file size as + + (number of chunks -1) * fixed chunk size + size of last chunk + + +.ti 0 +4.2. Procedure + +A swift implementation that wants to use automatic size detection MUST +operate as follows. When a peer B sends a DATA message for the first time +to a peer A, B MUST include all the peak hashes for the content in the +same datagram, unless A has already signalled earlier in the exchange +that it knows the peak hashes by having acknowledged any bin, even the empty +one. The receiver A MUST check the peak hashes against the root hash +to determine the approximate content size. To obtain the definite content size +peer A MUST download the last chunk of the content from any peer that offers it. + + + + + +.ti 0 +5. Live streaming + +.fi +In the case of live streaming a transfer is bootstrapped with a public key +instead of a root hash, as the root hash is undefined or, more precisely, +transient, as long as new data is being generated by the live source. +Live/download unification is achieved by sending signed peak hashes on-demand, +ahead of the actual data. As before, the sender might use acknowledgements +to derive which content range the receiver has peak hashes for and to prepend +the data hashes with the necessary (signed) peak hashes. +Except for the fact that the set of peak hashes changes with time, +other parts of the algorithm work as described in Sec. 3. + +As with static content assets in the previous section, in live streaming +content length is not known on advance, but derived \%on-the-go from the peak +hashes. Suppose, our 7 KB stream extended to another kilobyte. Thus, now hash 7 +becomes the only peak hash, eating hashes 3, 9 and 12. So, the source sends +out a SIGNED_HASH message to announce the fact. + +The number of cryptographic operations will be limited. For example, +consider a 25 frame/second video transmitted over UDP. When each frame is +transmitted in its own chunk, only 25 signature verification operations +per second are required at the receiver for bitrates up to ~12.8 +megabit/second. For higher bitrates multiple UDP packets per frame +are needed and the number of verifications doubles. + + + + + +.ti 0 +6. Transport Protocols and Encapsulation + +.ti 0 +6.1. UDP + +.ti 0 +6.1.1. Chunk Size + +Currently, swift-over-UDP is the preferred deployment option. Effectively, UDP +allows the use of IP with minimal overhead and it also allows userspace +implementations. The default is to use chunks of 1 kilobyte such that a +datagram fits in an Ethernet-sized IP packet. The bin numbering +allows to use swift over Jumbo frames/datagrams. Both DATA and +HAVE/ACK messages may use e.g. 8 kilobyte packets instead of the standard 1 +KiB. The hashing scheme stays the same. Using swift with 512 or 256-byte +packets is theoretically possible with 64-bit byte-precise bin numbers, but IP +fragmentation might be a better method to achieve the same result. + + +.ti 0 +6.1.2. Datagrams and Messages + +When using UDP, the abstract datagram described above corresponds directly +to a UDP datagram. Each message within a datagram has a fixed length, which +depends on the type of the message. The first byte of a message denotes its type. +The currently defined types are: + + HANDSHAKE = 0x00 +.br + DATA = 0x01 +.br + ACK = 0x02 +.br + HAVE = 0x03 +.br + HASH = 0x04 +.br + PEX_ADD = 0x05 +.br + PEX_REQ = 0x06 +.br + SIGNED_HASH = 0x07 +.br + HINT = 0x08 +.br + MSGTYPE_RCVD = 0x09 +.br + VERSION = 0x10 +.br + + +Furthermore, integers are serialized in the network \%(big-endian) byte order. +So consider the example of an ACK message (Sec 3.4). It has message type of 0x02 +and a payload of a bin number, a four-byte integer (say, 1); hence, its on the wire +representation for UDP can be written in hex as: "02 00000001". This \%hex-like two +character-per-byte notation is used to represent message formats in the rest of +this section. + +.ti 0 +6.1.3. Channels + +As it is increasingly complex for peers to enable UDP communication +between each other due to NATs and firewalls, swift-over-UDP uses +a multiplexing scheme, called "channels", to allow multiple swarms to use the +same UDP port. Channels loosely correspond to TCP connections and each channel +belongs to a single swarm. When channels are used, each datagram starts with +four bytes corresponding to the receiving channel number. + + +.ti 0 +6.1.4. HANDSHAKE and VERSION + +A channel is established with a handshake. To start a handshake, the initiating +peer needs to know: + +.nf +(1) the IP address of a peer +(2) peer's UDP port and +(3) the root hash of the content (see Sec. 3.5.1). +.fi + +To do the handshake the initiating peer sends a datagram that MUST start +with an all 0-zeros channel number followed by a VERSION message, then a +HASH message whose payload is the root hash, and a HANDSHAKE message, whose +only payload is a locally unused channel number. + +On the wire the datagram will look something like this: +.nf + 00000000 10 01 + 04 7FFFFFFF 1234123412341234123412341234123412341234 + 00 00000011 +.fi +(to unknown channel, handshake from channel 0x11 speaking protocol version 0x01, +initiating a transfer of a file with a root hash 123...1234) + +The receiving peer MUST respond with a datagram that starts with the +channel number from the sender's HANDSHAKE message, followed by a +VERSION message, then a HANDSHAKE message, whose only payload is a locally +unused channel number, followed by any other messages it wants to send. + +Peer's response datagram on the wire: +.nf + 00000011 10 01 + 00 00000022 03 00000003 +.fi +(peer to the initiator: use channel number 0x22 for this transfer and +proto version 0x01; I also have first 4 chunks of the file, see Sec. 4.3) + +.fi +At this point, the initiator knows that the peer really responds; for that +purpose channel ids MUST be random enough to prevent easy guessing. So, the +third datagram of a handshake MAY already contain some heavy payload. To +minimize the number of initialization roundtrips, the first two datagrams +MAY also contain some minor payload, e.g. a couple of HAVE messages roughly +indicating the current progress of a peer or a HINT (see Sec. 3.6). +When receiving the third datagram, both peers have the proof they really talk +to each other; three-way handshake is complete. + +A peer MAY explicit close a channel by sending a HANDSHAKE message that MUST +contain an all 0-zeros channel number. + +On the wire: +.nf + 00 00000000 + + +.ti 0 +6.1.5. HAVE + +A HAVE message (type 0x03) states that the sending peer has the +complete specified bin and successfully checked its integrity: +.nf + 03 00000003 +(got/checked first four kilobytes of a file/stream) + + + +.ti 0 +6.1.6. ACK + +An ACK message (type 0x02) acknowledges data that was received from +its addressee; to facilitate delay-based congestion control, an +ACK message contains a timestamp, in particular, a 64-bit microsecond +time. +.nf + 02 00000002 12345678 +(got the second kilobyte of the file from you; my microsecond +timer was showing 0x12345678 at that moment) + + +.ti 0 +6.1.7. HASH + +A HASH message (type 0x04) consists of a four-byte bin number and +the cryptographic hash (e.g. a 20-byte SHA1 hash) +.nf + 04 7FFFFFFF 1234123412341234123412341234123412341234 + + +.ti 0 +6.1.8. DATA + +.fi +A DATA message (type 0x01) consists of a four-byte bin number and the +actual chunk. In case a datagram contains a DATA message, a sender +MUST always put the data message in the tail of the datagram. For +example: +.nf + 01 00000000 48656c6c6f20776f726c6421 +(This message accommodates an entire file: "Hello world!") + + +.ti 0 +6.1.9. KEEPALIVE + +Keepalives do not have a message type on UDP. They are just simple +datagrams consisting of a 4-byte channel id only. + +On the wire: +.nf + 00000022 + +.ti 0 +6.1.10. Flow and Congestion Control + +.fi +Explicit flow control is not necessary in swift-over-UDP. In the case of +video-on-demand the receiver will request data explicitly from peers and is +therefore in control of how much data is coming towards it. In the case of +live streaming, where a push-model may be used, the amount of data incoming +is limited to the bitrate, which the receiver must be able to process otherwise +it cannot play the stream. Should, for any reason, the receiver get saturated +with data that situation is perfectly detected by the congestion control. +Swift-over-UDP can support different congestion control algorithms, in particular, +it supports the new IETF Low Extra Delay Background Transport (LEDBAT) congestion +control algorithm that ensures that peer-to-peer traffic yields to regular +best-effort traffic [LEDBAT]. + + +.ti 0 +6.2. TCP + +.fi +When run over TCP, swift becomes functionally equivalent to BitTorrent. +Namely, most swift messages have corresponding BitTorrent messages and vice +versa, except for BitTorrent's explicit interest declarations and +choking/unchoking, which serve the classic implementation of the tit-for-tat +algorithm [TIT4TAT]. However, TCP is not well suited for multiparty communication, +as argued in Sec. 9. + + +.ti 0 +6.3. RTP Profile for PPSP + +.fi +In this section we sketch how swift can be integrated into RTP [RFC3550] to +form the Peer-to-Peer Streaming Protocol (PPSP) [I-D.ietf-ppsp-reqs] running +over UDP. The PPSP charter requires existing media transfer protocols be +used [PPSPCHART]. Hence, the general idea is to define swift as a profile +of RTP, in the same way as the Secure Real-time Transport Protocol (SRTP) +[RFC3711]. SRTP, and therefore swift is considered ``a "bump in the stack" +implementation which resides between the RTP application and the transport +layer. [swift] intercepts RTP packets and then forwards an equivalent +[swift] packet on the sending side, and intercepts [swift] packets and passes +an equivalent RTP packet up the stack on the receiving side.'' [RFC3711]. + +In particular, to encode a swift datagram in an RTP packet all the +non-DATA messages of swift such as HINT and HAVE are postfixed to the RTP +packet using the UDP encoding and the content of DATA messages is sent +in the payload field. Implementations MAY omit the RTP header for packets +without payload. This construction allows the streaming application to use +of all RTP's current features, and with a modification to the Merkle tree +hashing scheme (see below) meets swift's atomic datagram principle. The latter +means that a receiving peer can autonomously verify the RTP packet as being +correct content, thus preventing the spread of corrupt data (see +requirement PPSP.SEC-REQ-4). + +The use of ACK messages for reliability is left as a choice of the +application using PPSP. + + +.ti 0 +6.3.1. Design + +6.3.1.1. Joining a Swarm + +To commence a PPSP download a peer A must have the swarm ID of the stream +and a list of one or more tracker contact points (e.g. host+port). The list +of trackers is optional in the presence of a decentralized tracking +mechanism. The swarm ID consists of the swift root hash of the content, +which is divided into chunks (see Discussion). + +Peer A now registers with the PPSP tracker following the tracker protocol +[I-D.ietf.ppsp-reqs] and receives the IP address and RTP port of peers +already in the swarm, say B, C, and D. Peer A now sends an RTP packet +containing a HANDSHAKE without channel information to B, C, and D. This +serves as an end-to-end check that the peers are actually in the correct +swarm. Optionally A could include a HINT message in some RTP packets if it +wants to start receiving content immediately. B and C respond with a +HANDSHAKE and HAVE messages. D sends just a HANDSHAKE and omits HAVE +messages as a way of choking A. + + +6.3.1.2. Exchanging Chunks + +In response to B and C, A sends new RTP packets to B and C with HINTs for +disjunct sets of chunks. B and C respond with the requested chunks in the +payload and HAVE messages, updating their chunk availability. Upon receipt, +A sends HAVE for the chunks received and new HINT messages to B and C. When +e.g. C finds that A obtained a chunk (from B) that C did not yet have, C's +response includes a HINT for that chunk. + +D does not send HAVE messages, instead if D decides to unchoke peer A, it +sends an RTP packet with HAVE messages to inform A of its current +availability. If B or C decide to choke A they stop sending HAVE and DATA +messages and A should then rerequest from other peers. They may continue to +send HINT messages, or exponentially slowing KEEPALIVE messages such that A +keeps sending them HAVE messages. + +Once A has received all content (video-on-demand use case) it stops +sending messages to all other peers that have all content (a.k.a. seeders). + + +6.3.1.3. Leaving a Swarm + +Peers can implicitly leave a swarm by stopping to respond to messages. +Sending peers should remove these peers from the current peer list. This +mechanism works for both graceful and ungraceful leaves (i.e., peer crashes +or disconnects). When leaving gracefully, a peer should deregister from the +tracker following the PPSP tracker protocol. + +More explicit graceful leaves could be implemented using RTCP. In +particular, a peer could send a RTCP BYE on the RTCP port that is derivable +from a peer's RTP port for all peers in its current peer list. However, to +prevent malicious peers from sending BYEs a form of peer authentication is +required (e.g. using public keys as peer IDs [PERMIDS].) + + +6.3.1.4. Discussion + +Using swift as an RTP profile requires a change to the content +integrity protection scheme (see Sec. 3.5). The fields in the RTP +header, such as the timestamp and PT fields, must be protected by the Merkle +tree hashing scheme to prevent malicious alterations. Therefore, the Merkle +tree is no longer constructed from pure content chunks, but from the complete +RTP packet for a chunk as it would be transmitted (minus the non-DATA swift messages). +In other words, the hash of the leaves in the tree is the hash over the +Authenticated Portion of the RTP packet as defined by SRTP, illustrated +in the following figure (extended from [RFC3711]). There is no need for the +RTP packets to be fixed size, as the hashing scheme can deal with +variable-sized leaves. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.in 0 + 0 1 2 3 +.br + 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +.br + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<+ +.br + |V=2|P|X| CC |M| PT | sequence number | | +.br + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +.br + | timestamp | | +.br + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +.br + | synchronization source (SSRC) identifier | | +.br + +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | +.br + | contributing source (CSRC) identifiers | | +.br + | .... | | +.br + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +.br + | RTP extension (OPTIONAL) | | +.br + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +.br + | payload ... | | +.br + | +-------------------------------+ | +.br + | | RTP padding | RTP pad count | | +.br + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<+ +.br + ~ swift non-DATA messages (REQUIRED) ~ | +.br + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +.br + | length of swift messages (REQUIRED) | | +.br + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | +.br + | +.br + Authenticated Portion ---+ +.br + + Figure: The format of an RTP-Swift packet. +.in 3 + +As a downside, with variable-sized payloads the automatic content size +detection of Section 4 no longer works, so content length MUST be explicit in the +metadata. In addition, storage on disk is more complex with out-of-order, +variable-sized packets. On the upside, carrying RTP over swift allow +decryption-less caching. + +As with UDP, another matter is how much data is carried inside each packet. +An important swift-specific factor here is the resulting number of hash +calculations per second needed to verify chunks. Experiments should be +conducted to ensure they are not excessive for, e.g., mobile hardware. + +At present, Peer IDs are not required in this design. + + +.ti 0 +6.3.2. PPSP Requirements + +6.3.2.1. Basic Requirements + +- PPSP.REQ-1: The swift PEX message can also be used as the basis for a tracker protocol, to be discussed elsewhere. + +- PPSP.REQ-2: This draft preserves the properties of RTP. + +- PPSP.REQ-3: This draft does not place requirements on peer IDs, IP+port is sufficient. + +- PPSP.REQ-4: The content is identified by its root hash (video-on-demand) or a public key (live streaming). + +- PPSP.REQ-5: The content is partitioned by the streaming application. + +- PPSP.REQ-6: Each chunk is identified by a bin number (and its cryptographic hash.) + +- PPSP.REQ-7: The protocol is carried over UDP because RTP is. + +- PPSP.REQ-8: The protocol has been designed to allow meaningful data transfer between +peers as soon as possible and to avoid unnecessary round-trips. It supports small and +variable chunk sizes, and its content integrity protection enables wide scale caching. + + +6.3.2.2. Peer Protocol Requirements + +- PPSP.PP.REQ-1: A GET_HAVE would have to be added to request which +chunks are available from a peer, if the proposed push-based HAVE +mechanism is not sufficient. + +- PPSP.PP.REQ-2: A set of HAVE messages satisfies this. + +- PPSP.PP.REQ-3: The PEX_REQ message satisfies this. Care should be taken with +peer address exchange in general, as the use of such hearsay is a risk for the +protocol as it may be exploited by malicious peers (as a DDoS attack mechanism). +A secure tracking / peer sampling protocol like [PUPPETCAST] may be needed +to make peer-address exchange safe. + +- PPSP.PP.REQ-4: HAVE messages convey current availability via a push model. + +- PPSP.PP.REQ-5: Bin numbering enables a compact representation of chunk availability. + +- PPSP.PP.REQ-6: A new PPSP specific Peer Report message would have to be +added to RTCP. + +- PPSP.PP.REQ-7: Transmission and chunk requests are integrated in this +protocol. + + +6.3.2.3. Security Requirements + +- PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms +[CLOSED] would have to be added. + +- PPSP.SEC.REQ-2: As RTP is carried verbatim over swift, RTP encryption +can be used. Note that just encrypting the RTP part will allow for +caching servers that are part of the swarm but do not need access to the +decryption keys. They just need access to the swift HASHES in the postfix to +verify the packet's integrity. + +- PPSP.SEC.REQ-3: RTP encryption or IPsec [RFC4303] can be used, if the +swift messages must also be encrypted. + +- PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the indirect +spread of corrupt content, as peers will only forward chunks to others if +their integrity check out. Another protection mechanism is to not depend on +hearsay (i.e., do not forward other peers' availability information), or to +only use it when the information spread is self-certified by its subjects. + +Other attacks, such as a malicious peer claiming it has content but not +replying, are still possible. Or wasting CPU and bandwidth at a receiving +peer by sending packets where the DATA doesn't match the HASHes. + + +- PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving +peer to detect a malicious or faulty sender, which it can subsequently +ignore. Spreading this knowledge to other peers such that they know +about this bad behavior is hearsay. + + +- PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that malicious +peers launch an Eclipse [ECLIPSE] attack on the initial injectors of the +content (in particular in live streaming). The attack tries to let the +injector upload to just malicious peers which then do not forward the +content to others, thus stopping the distribution. An Eclipse attack could +also be launched on an individual peer. Letting these injectors only use +trusted trackers that provide true random samples of the population or using +a secure peer sampling service [PUPPETCAST] can help negate such an attack. + + +- PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or +additional mechanisms such as DHTs [SECDHTS], but self-certification of +addresses is needed. Self-certification means For example, that each peer +has a public/private key pair [PERMIDS] and creates self-certified address +changes that include the swarm ID and a timestamp, which are then exchanged +among peers or stored in DHTs. See also discussion of PPSP.PP.REQ-3 +above. Content distribution can continue as long as there are peers that +have it available. + +- PPSP.SEC.REQ-8: The verification of data via hashes obtained from a +trusted source is well-established in the BitTorrent protocol [BITTORRENT]. +The proposed Merkle tree scheme is a secure extension of this idea. +Self-certification and not using hearsay are other lessons learned from +existing distributed systems. + +- PPSP.SEC.REQ-9: Swift has built-in content integrity protection via +self-certified naming of content, see SEC.REQ-5 and Sec. 3.5.1. + + +.ti 0 +6.4. HTTP (as PPSP) + +In this section we sketch how swift can be carried over HTTP [RFC2616] to +form the PPSP running over TCP. The general idea is to encode a swift +datagram in HTTP GET and PUT requests and their replies by transmitting all +the non-DATA messages such as HINTs and HAVEs as headers and send DATA +messages in the body. This idea follows the atomic datagram principle for +each request and reply. So a receiving peer can autonomously verify the +message as carrying correct data, thus preventing the spread of corrupt data +(see requirement PPSP.SEC-REQ-4). + +A problem with HTTP is that it is a client/server protocol. To overcome this +problem, a peer A uses a PUT request instead of a GET request if the peer B +has indicated in a reply that it wants to retrieve a chunk from A. In cases +where peer A is no longer interested in receiving requests from B (described +below) B may need to establish a new HTTP connection to A to quickly +download a chunk, instead of waiting for a convenient time when A sends +another request. As an alternative design, two HTTP connections could be +used always., but this is inefficient. + +.ti 0 +6.4.1. Design + +6.4.1.1. Joining a Swarm + +To commence a PPSP download a peer A must have the swarm ID of the stream +and a list of one or more tracker contact points, as above. The swarm ID as +earlier also consists of the swift root hash of the content, divided in +chunks by the streaming application (e.g. fixed-size chunks of 1 kilobyte +for video-on-demand). + +Peer A now registers with the PPSP tracker following the tracker protocol +[I-D.ietf-ppsp-reqs] and receives the IP address and HTTP port of peers +already in the swarm, say B, C, and D. Peer A now establishes persistent +HTTP connections with B, C, D and sends GET requests with the Request-URI +set to /. Optionally A could include a HINT message in +some requests if it wants to start receiving content immediately. A HINT is +encoded as a Range header with a new "bins" unit [RFC2616,$14.35]. + +B and C respond with a 200 OK reply with header-encoded HAVE messages. A +HAVE message is encoded as an extended Accept-Ranges: header [RFC2616,$14.5] +with the new bins unit and the possibility of listing the set of accepted bins. +If no HINT/Range header was present in the request, the body of the reply is +empty. D sends just a 200 OK reply and omits the HAVE/Accept-Ranges header +as a way of choking A. + +6.4.1.2. Exchanging Chunks + +In response to B and C, A sends GET requests with Range headers, requesting +disjunct sets of chunks. B and C respond with 206 Partial Content replies +with the requested chunks in the body and Accept-Ranges headers, updating +their chunk availability. The HASHES for the chunks are encoded in a new +Content-Merkle header and the Content-Range is set to identify the chunk +[RFC2616,$14.16]. A new "multipart-bin ranges" equivalent to the +"multipart-bytes ranges" media type may be used to transmit multiple chunks +in one reply. + +Upon receipt, A sends a new GET request with a HAVE/Accept-Ranges header for +the chunks received and new HINT/Range headers to B and C. Now when e.g. C +finds that A obtained a chunk (from B) that C did not yet have, C's response +includes a HINT/Range for that chunk. In this case, A's next request to C is +not a GET request, but a PUT request with the requested chunk sent in the +body. + +Again, working around the fact that HTTP is a client/server protocol, peer A +periodically sends HEAD requests to peer D (which was virtually choking A) +that serve as keepalives and may contain HAVE/Accept-Ranges headers. If D +decides to unchoke peer A, it includes an Accept-Ranges header in the "200 OK" +reply to inform A of its current chunk availability. + +If B or C decide to choke A they start responding with 204 No Content +replies without HAVE/Accept-Ranges headers and A should then re-request from +other peers. However, if their replies contain HINT/Range headers A should +keep on sending PUT requests with the desired data (another client/server +workaround). If not, A should slowly send HEAD requests as keepalive and +content availability update. + +Once A has received all content (video-on-demand use case) it closes the +persistent connections to all other peers that have all content (a.k.a. +seeders). + + +6.4.1.3. Leaving a Swarm + +Peers can explicitly leave a swarm by closing the connection. This mechanism +works for both graceful and ungraceful leaves (i.e., peer crashes or +disconnects). When leaving gracefully, a peer should deregister from the +tracker following the PPSP tracker protocol. + + +6.4.1.4. Discussion + +As mentioned earlier, this design suffers from the fact that HTTP is a +client/server protocol. A solution where a peer establishes two HTTP +connections with every other peer may be more elegant, but inefficient. +The mapping of swift messages to headers remains the same: + + HINT = Range +.br + HAVE = Accept-Ranges +.br + HASH = Content-Merkle +.br + PEX = e.g. extended Content-Location +.br + +The Content-Merkle header should include some parameters to +indicate the hash function and chunk size (e.g. SHA1 and 1K) used to +build the Merkle tree. + + +.ti 0 +6.4.2. PPSP Requirements + +6.4.2.1. Basic Requirements + +- PPSP.REQ-1: The HTTP-based BitTorrent tracker protocol [BITTORRENT] can be +used as the basis for a tracker protocol, to be discussed elsewhere. + +- PPSP.REQ-2: This draft preserves the properties of HTTP, but +extra mechanisms may be necessary to protect against faulty +or malicious peers. + +- PPSP.REQ-3: This draft does not place requirements on peer IDs, +IP+port is sufficient. + +- PPSP.REQ-4: The content is identified by its root hash (video-on-demand) +or a public key (live streaming). + +- PPSP.REQ-5: The content is partitioned into chunks by the streaming application (see 6.4.1.1.) + +- PPSP.REQ-6: Each chunk is identified by a bin number (and its cryptographic hash.) + +- PPSP.REQ-7: The protocol is carried over TCP because HTTP is. + + +6.4.2.2. Peer Protocol Requirements + +- PPSP.PP.REQ-1: A HEAD request can be used to find out which +chunks are available from a peer, which returns the new Accept-Ranges header. + +- PPSP.PP.REQ-2: The new Accept-Ranges header satisfies this. + +- PPSP.PP.REQ-3: A GET with a request-URI requesting the peers of a resource +(e.g. //peers) would have to be added to request known +peers from a peer, if the proposed push-based PEX/~Content-Location +mechanism is not sufficient. Care should be taken with peer address exchange +in general, as the use of such hearsay is a risk for the protocol as it may be +exploited by malicious peers (as a DDoS attack mechanism). A secure +tracking / peer sampling protocol like [PUPPETCAST] may be needed +to make peer-address exchange safe. + + +- PPSP.PP.REQ-4: HAVE/Accept-Ranges headers convey current availability. + +- PPSP.PP.REQ-5: Bin numbering enables a compact representation of chunk availability. + +- PPSP.PP.REQ-6: A new PPSP specific Peer-Report header would have to be +added. + +- PPSP.PP.REQ-7: Transmission and chunk requests are integrated in this +protocol. + + + + +6.4.2.3. Security Requirements + +- PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms +[CLOSED] would have to be added. + +- PPSP.SEC.REQ-2: As swift is carried over HTTP, HTTPS encryption can be +used instead. Alternatively, just the body could be encrypted. The latter +allows for caching servers that are part of the swarm but do not need access +to the decryption keys (they need access to the swift HASHES in the headers +to verify the packet's integrity). + +- PPSP.SEC.REQ-3: HTTPS encryption or the content encryption facilities of +HTTP can be used. + +- PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the indirect +spread of corrupt content, as peers will only forward content to others if +its integrity checks out. Another protection mechanism is to not depend on +hearsay (i.e., do not forward other peers' availability information), or to +only use it when the information spread is self-certified by its subjects. + +Other attacks such as a malicious peer claiming it has content, but not +replying are still possible. Or wasting CPU and bandwidth at a receiving +peer by sending packets where the body doesn't match the HASH/Content-Merkle +headers. + + +- PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving peer to +detect a malicious or faulty sender, which it can subsequently close its +connection to and ignore. Spreading this knowledge to other peers such that +they know about this bad behavior is hearsay. + + +- PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that malicious +peers launch an Eclipse [ECLIPSE] attack on the initial injectors of the +content (in particular in live streaming). The attack tries to let the +injector upload to just malicious peers which then do not forward the +content to others, thus stopping the distribution. An Eclipse attack could +also be launched on an individual peer. Letting these injectors only use +trusted trackers that provide true random samples of the population or using +a secure peer sampling service [PUPPETCAST] can help negate such an attack. + + +- PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or +additional mechanisms such as DHTs [SECDHTS], but self-certification of +addresses is needed. Self-certification means For example, that each peer +has a public/private key pair [PERMIDS] and creates self-certified address +changes that include the swarm ID and a timestamp, which are then exchanged +among peers or stored in DHTs. See also discussion of PPSP.PP.REQ-3 +above. Content distribution can continue as long as there are peers that +have it available. + +- PPSP.SEC.REQ-8: The verification of data via hashes obtained from a +trusted source is well-established in the BitTorrent protocol [BITTORRENT]. +The proposed Merkle tree scheme is a secure extension of this idea. +Self-certification and not using hearsay are other lessons learned from +existing distributed systems. + +- PPSP.SEC.REQ-9: Swift has built-in content integrity protection via +self-certified naming of content, see SEC.REQ-5 and Sec. 3.5.1. + + +.ti 0 +7. Security Considerations + +As any other network protocol, the swift faces a common set of security +challenges. An implementation must consider the possibility of buffer +overruns, DoS attacks and manipulation (i.e. reflection attacks). Any +guarantee of privacy seems unlikely, as the user is exposing its IP address +to the peers. A probable exception is the case of the user being hidden +behind a public NAT or proxy. + + +.ti 0 +8. Extensibility + +.ti 0 +8.1. 32 bit vs 64 bit + +.nf +While in principle the protocol supports bigger (>1TB) files, all the +mentioned counters are 32-bit. It is an optimization, as using +\%64-bit numbers on-wire may cost ~2% practical overhead. The 64-bit +version of every message has typeid of 64+t, e.g. typeid 68 for +\%64-bit hash message: +.nf + 44 000000000000000E 01234567890ABCDEF1234567890ABCDEF1234567 + +.ti 0 +8.2. IPv6 + +.fi +IPv6 versions of PEX messages use the same 64+t shift as just mentioned. + + +.ti 0 +8.3. Congestion Control Algorithms + +.fi +Congestion control algorithm is left to the implementation and may even vary +from peer to peer. Congestion control is entirely implemented by the sending +peer, the receiver only provides clues, such as hints, acknowledgments and +timestamps. In general, it is expected that servers would use TCP-like +congestion control schemes such as classic AIMD or CUBIC [CUBIC]. End-user +peers are expected to use weaker-than-TCP (least than best effort) +congestion control, such as [LEDBAT] to minimize seeding counter-incentives. + + +.ti 0 +8.4. Piece Picking Algorithms + +Piece picking entirely depends on the receiving peer. The sender peer is +made aware of preferred pieces by the means of HINT messages. In some +scenarios it may be beneficial to allow the sender to ignore those hints and +send unrequested data. + + +.ti 0 +8.5. Reciprocity Algorithms + +Reciprocity algorithms are the sole responsibility of the sender peer. +Reciprocal intentions of the sender are not manifested by separate messages +(as BitTorrent's CHOKE/UNCHOKE), as it does not guarantee anything anyway +(the "snubbing" syndrome). + + +.ti 0 +8.6. Different crypto/hashing schemes + +.fi +Once a flavor of swift will need to use a different crypto scheme +(e.g., SHA-256), a message should be allocated for that. As the root +hash is supplied in the handshake message, the crypto scheme in use +will be known from the very beginning. As the root hash is the +content's identifier, different schemes of crypto cannot be mixed +in the same swarm; different swarms may distribute the same content +using different crypto. + + +.ti 0 +9. Rationale + +Historically, the Internet was based on end-to-end unicast +and, considering the failure of multicast, was addressed by +different technologies, which ultimately boiled down to maintaining +and coordinating distributed replicas. On one hand, downloading +from a nearby well-provisioned replica is somewhat faster and/or +cheaper; on the other hand, it requires to coordinate multiple +parties (the data source, mirrors/CDN sites/peers, consumers). As +the Internet progresses to richer and richer content, the overhead +of peer/replica coordination becomes dwarfed by the mass of the +download itself. Thus, the niche for multiparty transfers expands. +Still, current, relevant technologies are tightly coupled to a +single use case or even infrastructure of a particular corporation. +The mission of our project is to create a generic content-centric +multiparty transport protocol to allow seamless, effortless data +dissemination on the Net. + + TABLE 1. Use cases. + +.br + | mirror-based peer-assisted peer-to-peer +.br +------+---------------------------------------------------- +.br +data | SunSITE CacheLogic VelociX BitTorrent +.br +VoD | YouTube Azureus(+seedboxes) SwarmPlayer +.br +live | Akamai Str. Octoshape, Joost PPlive +.br + +.fi +The protocol must be designed for maximum genericity, thus +focusing on the very core of the mission, contain no magic + constants and no hardwired policies. Effectively, it is a +set of messages allowing to securely retrieve data from +whatever source available, in parallel. Ideally, the protocol must +be able to run over IP as an independent transport protocol. +Practically, it must run over UDP and TCP. + + +.ti 0 +9.1. Design Goals + +.fi +The technical focus of the swift protocol is to find the +simplest solution involving the minimum set of primitives, still +being sufficient to implement all the targeted usecases (see +Table 1), suitable for use in general-purpose software and hardware +(i.e. a web browser or a set-top box). The five design goals for +the protocol are: + +.nf +1. Embeddable kernel-ready protocol. +2. Embrace real-time streaming, in- and out-of-order download. +3. Have short warm-up times. +4. Traverse NATs transparently. +5. Be extensible, allow for multitude of implementation over + diverse mediums, allow for drop-in pluggability. + +The objectives are referenced as (1)-(5). + +.fi +The goal of embedding (1) means that the protocol must be ready +to function as a regular transport protocol inside a set-top box, +mobile device, a browser and/or in the kernel space. Thus, the +protocol must have light footprint, preferably less than TCP, +in spite of the necessity to support numerous ongoing connections +as well as to constantly probe the network for new possibilities. +The practical overhead for TCP is estimated at 10KB per connection +[HTTP1MLN]. We aim at <1KB per peer connected. Also, the amount of + code necessary to make a basic implementation must be limited to +10KLoC of C. Otherwise, besides the resource considerations, +maintaining and auditing the code might become prohibitively expensive. + +The support for all three basic usecases of real-time streaming, + \%in-order download and out-of-order download (2) is necessary +for the manifested goal of THE multiparty transport protocol as no + single usecase dominates over the others. + +The objective of short warm-up times (3) is the matter of end-user + experience; the playback must start as soon as possible. Thus any +unnecessary initialization roundtrips and warm-up cycles must be +eliminated from the transport layer. + +.fi +Transparent NAT traversal (4) is absolutely necessary as at least +60% of today's users are hidden behind NATs. NATs severely affect +connection patterns in P2P networks thus impacting performance +and fairness [MOLNAT,LUCNAT]. + +The protocol must define a common message set (5) to be used by +implementations; it must not hardwire any magic constants, +algorithms or schemes beyond that. For example, an implementation + is free to use its own congestion control, connection rotation +or reciprocity algorithms. Still, the protocol must enable such +algorithms by supplying sufficient information. For example, +trackerless peer discovery needs peer exchange messages, +scavenger congestion control may need timestamped acknowledgments, + etc. + + +.ti 0 +9.2. Not TCP + +.fi +To large extent, swift's design is defined by the cornerstone decision +to get rid of TCP and not to reinvent any TCP-like transports on +top of UDP or otherwise. The requirements (1), (4), (5) make TCP a +bad choice due to its high per-connection footprint, complex and +less reliable NAT traversal and fixed predefined congestion control +algorithms. Besides that, an important consideration is that no +block of TCP functionality turns out to be useful for the general +case of swarming downloads. Namely, +.nf + 1. in-order delivery is less useful as peer-to-peer protocols + often employ out-of-order delivery themselves and in either case + \%out-of-order data can still be stored; + 2. reliable delivery/retransmissions are not useful because + the same data might be requested from different sources; as + in-order delivery is not required, packet losses might be + patched up lazily, without stopping the flow of data; + 3. flow control is not necessary as the receiver is much less + likely to be saturated with the data and even if so, that + situation is perfectly detected by the congestion control; + 4. TCP congestion control is less useful as custom congestion + control is often needed [LEDBAT]. +In general, TCP is built and optimized for a different usecase than +we have with swarming downloads. The abstraction of a "data pipe" +orderly delivering some stream of bytes from one peer to another +turned out to be irrelevant. In even more general terms, TCP +supports the abstraction of pairwise _conversations_, while we need +a content-centric protocol built around the abstraction of a cloud +of participants disseminating the same _data_ in any way and order +that is convenient to them. + +.fi +Thus, the choice is to design a protocol that runs on top of +unreliable datagrams. Instead of reimplementing TCP, we create a +\%datagram-based protocol, completely dropping the sequential data +stream abstraction. Removing unnecessary features of TCP makes +it easier both to implement the protocol and to verify it; +numerous TCP vulnerabilities were caused by complexity of the +protocol's state machine. Still, we reserve the possibility +to run swift on top of TCP or HTTP. + +Pursuing the maxim of making things as simple as possible but not +simpler, we fit the protocol into the constraints of the transport +layer by dropping all the transmission's technical metadata except +for the content's root hash (compare that to metadata files used +in BitTorrent). Elimination of technical metadata is achieved +through the use of Merkle [MERKLE,ABMRKL] hash trees, exclusively +single-file transfers and other techniques. As a result, a transfer +is identified and bootstrapped by its root hash only. + +.fi +To avoid the usual layering of positive/negative acknowledgment +mechanisms we introduce a scale-invariant acknowledgment system + (see Sec 4.4). The system allows for aggregation and variable +level of detail in requesting, announcing and acknowledging data, + serves \%in-order and out-of-order retrieval with equal ease. +Besides the protocol's footprint, we also aim at lowering the +size of a minimal useful interaction. Once a single datagram is +received, it must be checked for data integrity, and then either +dropped or accepted, consumed and relayed. + + + +.ti 0 +9.3. Generic Acknowledgments + +.nf +Generic acknowledgments came out of the need to simplify the +data addressing/requesting/acknowledging mechanics, which tends +to become overly complex and multilayered with the conventional +approach. Take the BitTorrent+TCP tandem for example: + +1. The basic data unit is a byte of content in a file. +2. BitTorrent's highest-level unit is a "torrent", physically a +byte range resulting from concatenation of content files. +3. A torrent is divided into "pieces", typically about a thousand +of them. Pieces are used to communicate progress to other +peers. Pieces are also basic data integrity units, as the torrent's +metadata includes a SHA1 hash for every piece. +4. The actual data transfers are requested and made in 16KByte +units, named "blocks" or chunks. +5. Still, one layer lower, TCP also operates with bytes and byte +offsets which are totally different from the torrent's bytes and +offsets, as TCP considers cumulative byte offsets for all content +sent by a connection, be it data, metadata or commands. +6. Finally, another layer lower, IP transfers independent datagrams +(typically around 1.5 kilobyte), which TCP then reassembles into +continuous streams. + +Obviously, such addressing schemes need lots of mappings; from +piece number and block to file(s) and offset(s) to TCP sequence +numbers to the actual packets and the other way around. Lots of +complexity is introduced by mismatch of bounds: packet bounds are +different from file, block or hash/piece bounds. The picture is +typical for a codebase which was historically layered. + +To simplify this aspect, we employ a generic content addressing +scheme based on binary intervals, or "bins" for short. + + + +.ti 0 +Acknowledgements + +Arno Bakker and Victor Grishchenko are partially supported by the +P2P-Next project (http://www.p2p-next.org/), a research project +supported by the European Community under its 7th Framework Programme +(grant agreement no. 216217). The views and conclusions contained +herein are those of the authors and should not be interpreted as +necessarily representing the official policies or endorsements, +either expressed or implied, of the P2P-Next project or the European +Commission. + +.fi +The swift protocol was designed by Victor Grishchenko at Technische Universiteit Delft. +The authors would like to thank the following people for their +contributions to this draft: Mihai Capota, Raul Jiminez, Flutra Osmani, +Riccardo Petrocco, Johan Pouwelse, and Raynor Vliegendhart. + + +.ti 0 +References + +.nf +.in 0 +[RFC2119] Key words for use in RFCs to Indicate Requirement Levels +[HTTP1MLN] Richard Jones. "A Million-user Comet Application with + Mochiweb", Part 3. http://www.metabrew.com/article/ + \%a-million-user-comet-application-with-mochiweb-part-3 +[MOLNAT] J.J.D. Mol, J.A. Pouwelse, D.H.J. Epema and H.J. Sips: + \%"Free-riding, Fairness, and Firewalls in P2P File-Sharing" + Proc. Eighth International Conference on Peer-to-Peer Computing + (P2P '08), Aachen, Germany, 8-11 Sept. 2008, pp. 301 - 310. +[LUCNAT] L. D'Acunto and M. Meulpolder and R. Rahman and J.A. + Pouwelse and H.J. Sips. "Modeling and Analyzing the Effects + of Firewalls and NATs in P2P Swarming Systems". In Proc. of + IEEE IPDPS (HotP2P), Atlanta, USA, April 23, 2010. +[BINMAP] V. Grishchenko, J. Pouwelse: "Binmaps: hybridizing bitmaps + and binary trees". Technical Report PDS-2011-005, Parallel and + Distributed Systems Group, Fac. of Electrical Engineering, + Mathematics, and Computer Science, Delft University of Technology, + The Netherlands, April 2009. +[SNP] B. Ford, P. Srisuresh, D. Kegel: "Peer-to-Peer Communication + Across Network Address Translators", + http://www.brynosaurus.com/pub/net/p2pnat/ +[FIPS180-2] + Federal Information Processing Standards Publication 180-2: + "Secure Hash Standard" 2002 August 1. +[MERKLE] Merkle, R. "Secrecy, Authentication, and Public Key Systems", + Ph.D. thesis, Dept. of Electrical Engineering, Stanford University, + CA, USA, 1979. pp 40-45. +[ABMRKL] Arno Bakker: "Merkle hash torrent extension", BitTorrent + Enhancement Proposal 30, Mar 2009. + http://bittorrent.org/beps/bep_0030.html +[CUBIC] Injong Rhee, and Lisong Xu: "CUBIC: A New TCP-Friendly + \%High-Speed TCP Variant", Proc. Third International Workshop + on Protocols for Fast Long-Distance Networks (PFLDnet), Lyon, + France, Feb 2005. +[LEDBAT] S. Shalunov et al. "Low Extra Delay Background Transport + (LEDBAT)", IETF Internet-Draft draft-ietf-ledbat-congestion + (work in progress), Oct 2011. + \%http://datatracker.ietf.org/doc/draft-ietf-ledbat-congestion/ +[TIT4TAT] Bram Cohen: "Incentives Build Robustness in BitTorrent", + Proc. 1st Workshop on Economics of Peer-to-Peer Systems, Berkeley, + CA, USA, Jun 2003. +[BITTORRENT] B. Cohen, "The BitTorrent Protocol Specification", + February 2008, \%http://www.bittorrent.org/beps/bep_0003.html +[RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. + Jacobson, "RTP: A Transport Protocol for Real-Time + Applications", STD 64, RFC 3550, July 2003. +[RFC3711] M. Baugher, D. McGrew, M. Naslund, E. Carrara, K. Norrman, + "The Secure Real-time Transport Protocol (SRTP), RFC 3711, March + 2004. +[RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing, + "Session Traversal Utilities for NAT (STUN)", RFC 5389, October 2008. +[RFC2616] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, + P. Leach, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", + RFC2616, June 1999. +[I-D.ietf-ppsp-reqs] Zong, N., Zhang, Y., Pascual, V., Williams, C., + and L. Xiao, "P2P Streaming Protocol (PPSP) Requirements", + draft-ietf-ppsp-reqs-05 (work in progress), October 2011. +[PPSPCHART] M. Stiemerling et al. "Peer to Peer Streaming Protocol (ppsp) + Description of Working Group" + \%http://datatracker.ietf.org/wg/ppsp/charter/ +[PERMIDS] A. Bakker et al. "Next-Share Platform M8--Specification + Part", App. C. P2P-Next project deliverable D4.0.1 (revised), + June 2009. + \%http://www.p2p-next.org/download.php?id=E7750C654035D8C2E04D836243E6526E +[PUPPETCAST] A. Bakker and M. van Steen. "PuppetCast: A Secure Peer + Sampling Protocol". Proceedings 4th Annual European Conference on + Computer Network Defense (EC2ND'08), pp. 3-10, Dublin, Ireland, + 11-12 December 2008. +[CLOSED] N. Borch, K. Michell, I. Arntzen, and D. Gabrijelcic: "Access + control to BitTorrent swarms using closed swarms". In Proceedings + of the 2010 ACM workshop on Advanced video streaming techniques + for peer-to-peer networks and social networking (AVSTP2P '10). + ACM, New York, NY, USA, 25-30. + \%http://doi.acm.org/10.1145/1877891.1877898 +[ECLIPSE] E. Sit and R. Morris, "Security Considerations for + Peer-to-Peer Distributed Hash Tables", IPTPS '01: Revised Papers + from the First International Workshop on Peer-to-Peer Systems, pp. + 261-269, Springer-Verlag, London, UK, 2002. +[SECDHTS] G. Urdaneta, G. Pierre, M. van Steen, "A Survey of DHT + Security Techniques", ACM Computing Surveys, vol. 43(2), June 2011. +[SWIFTIMPL] V. Grishchenko, et al. "Swift M40 reference implementation", + \%http://swarmplayer.p2p-next.org/download/Next-Share-M40.tar.bz2 + (subdirectory Next-Share/TUD/swift-trial-r2242/), July 2011. +[CCNWIKI] http://en.wikipedia.org/wiki/Content-centric_networking +[HAC01] A.J. Menezes, P.C. van Oorschot and S.A. Vanstone. "Handbook of + Applied Cryptography", CRC Press, October 1996 (Fifth Printing, + August 2001). +[JIM11] R. Jimenez, F. Osmani, and B. Knutsson. "Sub-Second Lookups on + a Large-Scale Kademlia-Based Overlay". 11th IEEE International + Conference on Peer-to-Peer Computing 2011, Kyoto, Japan, Aug. 2011 + + +.ti 0 +Authors' addresses + +.in 3 +A. Bakker +Technische Universiteit Delft +Department EWI/ST/PDS +Room HB 9.160 +Mekelweg 4 +2628CD Delft +The Netherlands + +Email: arno@cs.vu.nl + +.ce 0