6 Internet-Draft TU Delft
8 Expires: June 21, 2012 December 19, 2011
10 Peer-to-Peer Streaming Protocol (PPSP)
11 <draft-ietf-ppsp-peer-protocol-00.txt>
15 The Generic Multiparty Protocol (swift) is a peer-to-peer based
16 transport protocol for content dissemination. It can be used for
17 streaming on-demand and live video content, as well as conventional
18 downloading. In swift, the clients consuming the content participate
19 in the dissemination by forwarding the content to other clients via a
20 mesh-like structure. It is a generic protocol which can run directly
21 on top of UDP, TCP, HTTP or as a RTP profile. Features of swift are
22 short time-till-playback and extensibility. Hence, it can use
23 different mechanisms to prevent freeriding, and work with different
24 peer discovery schemes (centralized trackers or Distributed Hash
25 Tables). Depending on the underlying transport protocol, swift can
26 also use different congestion control algorithms, such as LEDBAT, and
27 offer transparent NAT traversal. Finally, swift maintains only a
28 small amount of state per peer and detects malicious modification of
29 content. This documents describes swift and how it satisfies the
30 requirements for the IETF Peer-to-Peer Streaming Protocol (PPSP)
31 Working Group's peer protocol.
36 This Internet-Draft is submitted to IETF in full conformance with the
37 provisions of BCP 78 and BCP 79.
39 Internet-Drafts are working documents of the Internet Engineering
40 Task Force (IETF), its areas, and its working groups. Note that
41 other groups may also distribute working documents as Internet-
44 Internet-Drafts are draft documents valid for a maximum of six months
45 and may be updated, replaced, or obsoleted by other documents at any
46 time. It is inappropriate to use Internet-Drafts as reference
47 material or to cite them other than as "work in progress."
49 The list of current Internet-Drafts can be accessed at
50 http://www.ietf.org/ietf/1id-abstracts.txt.
52 The list of Internet-Draft Shadow Directories can be accessed at
56 Grishchenko and Bakker Expires June 21, 2012 [Page 1]
58 Internet-Draft swift December 19, 2011
61 http://www.ietf.org/shadow.html.
64 Copyright (c) 2011 IETF Trust and the persons identified as the
65 document authors. All rights reserved.
67 This document is subject to BCP 78 and the IETF Trust's Legal
68 Provisions Relating to IETF Documents
69 (http://trustee.ietf.org/license-info) in effect on the date of
70 publication of this document. Please review these documents
71 carefully, as they describe your rights and restrictions with respect
72 to this document. Code Components extracted from this document must
73 include Simplified BSD License text as described in Section 4.e of
74 the Trust Legal Provisions and are provided without warranty as
75 described in the Simplified BSD License.
79 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
80 1.1. Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . 3
81 1.2. Conventions Used in This Document . . . . . . . . . . . . . 4
82 1.3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . 5
83 2. Overall Operation . . . . . . . . . . . . . . . . . . . . . . . 6
84 2.1. Joining a Swarm . . . . . . . . . . . . . . . . . . . . . . 6
85 2.2. Exchanging Chunks . . . . . . . . . . . . . . . . . . . . . 6
86 2.3. Leaving a Swarm . . . . . . . . . . . . . . . . . . . . . . 7
87 3. Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
88 3.1. HANDSHAKE . . . . . . . . . . . . . . . . . . . . . . . . . 8
89 3.3. HAVE . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
90 3.3.1. Bin Numbers . . . . . . . . . . . . . . . . . . . . . . 8
91 3.3.2. HAVE Message . . . . . . . . . . . . . . . . . . . . . 9
92 3.4. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
93 3.5. DATA and HASH . . . . . . . . . . . . . . . . . . . . . . . 10
94 3.5.1. Merkle Hash Tree . . . . . . . . . . . . . . . . . . . 10
95 3.5.2. Content Integrity Verification . . . . . . . . . . . . 11
96 3.5.3. The Atomic Datagram Principle . . . . . . . . . . . . . 11
97 3.5.4. DATA and HASH Messages . . . . . . . . . . . . . . . . 12
98 3.6. HINT . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
99 3.7. Peer Address Exchange and NAT Hole Punching . . . . . . . . 13
100 3.8. KEEPALIVE . . . . . . . . . . . . . . . . . . . . . . . . . 14
101 3.9. VERSION . . . . . . . . . . . . . . . . . . . . . . . . . . 14
102 3.10. Conveying Peer Capabilities . . . . . . . . . . . . . . . 14
103 3.11. Directory Lists . . . . . . . . . . . . . . . . . . . . . 14
104 4. Automatic Detection of Content Size . . . . . . . . . . . . . . 14
105 4.1. Peak Hashes . . . . . . . . . . . . . . . . . . . . . . . . 15
106 4.2. Procedure . . . . . . . . . . . . . . . . . . . . . . . . . 16
107 5. Live streaming . . . . . . . . . . . . . . . . . . . . . . . . 17
108 6. Transport Protocols and Encapsulation . . . . . . . . . . . . . 17
112 Grishchenko and Bakker Expires June 21, 2012 [Page 2]
114 Internet-Draft swift December 19, 2011
117 6.1. UDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
118 6.1.1. Chunk Size . . . . . . . . . . . . . . . . . . . . . . 17
119 6.1.2. Datagrams and Messages . . . . . . . . . . . . . . . . 18
120 6.1.3. Channels . . . . . . . . . . . . . . . . . . . . . . . 18
121 6.1.4. HANDSHAKE and VERSION . . . . . . . . . . . . . . . . . 19
122 6.1.5. HAVE . . . . . . . . . . . . . . . . . . . . . . . . . 20
123 6.1.6. ACK . . . . . . . . . . . . . . . . . . . . . . . . . . 20
124 6.1.7. HASH . . . . . . . . . . . . . . . . . . . . . . . . . 20
125 6.1.8. DATA . . . . . . . . . . . . . . . . . . . . . . . . . 20
126 6.1.9. KEEPALIVE . . . . . . . . . . . . . . . . . . . . . . . 20
127 6.1.10. Flow and Congestion Control . . . . . . . . . . . . . 21
128 6.2. TCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
129 6.3. RTP Profile for PPSP . . . . . . . . . . . . . . . . . . . 21
130 6.3.1. Design . . . . . . . . . . . . . . . . . . . . . . . . 22
131 6.3.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 24
132 6.4. HTTP (as PPSP) . . . . . . . . . . . . . . . . . . . . . . 27
133 6.4.1. Design . . . . . . . . . . . . . . . . . . . . . . . . 27
134 6.4.2. PPSP Requirements . . . . . . . . . . . . . . . . . . . 29
135 7. Security Considerations . . . . . . . . . . . . . . . . . . . . 32
136 8. Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . 32
137 8.1. 32 bit vs 64 bit . . . . . . . . . . . . . . . . . . . . . 32
138 8.2. IPv6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
139 8.3. Congestion Control Algorithms . . . . . . . . . . . . . . . 32
140 8.4. Piece Picking Algorithms . . . . . . . . . . . . . . . . . 33
141 8.5. Reciprocity Algorithms . . . . . . . . . . . . . . . . . . 33
142 8.6. Different crypto/hashing schemes . . . . . . . . . . . . . 33
143 9. Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
144 9.1. Design Goals . . . . . . . . . . . . . . . . . . . . . . . 34
145 9.2. Not TCP . . . . . . . . . . . . . . . . . . . . . . . . . 35
146 9.3. Generic Acknowledgments . . . . . . . . . . . . . . . . . 36
147 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . 37
148 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
149 Authors' addresses . . . . . . . . . . . . . . . . . . . . . . . . 39
158 This document describes the Generic Multiparty Protocol (swift),
159 designed from the ground up for the task of disseminating the same
160 content to a group of interested parties. Swift supports streaming
161 on-demand and live video content, as well as conventional
162 downloading, thus covering today's three major use cases for content
163 distribution. To fulfil this task, clients consuming the content are
164 put on equal footing with the servers initially providing the content
168 Grishchenko and Bakker Expires June 21, 2012 [Page 3]
170 Internet-Draft swift December 19, 2011
173 to create a peer-to-peer system where everyone can provide data. Each
174 peer connects to a random set of other peers resulting in a mesh-like
177 Swift uses a simple method of naming content based on self-
178 certification. In particular, content in swift is identified by a
179 single cryptographic hash that is the root hash in a Merkle hash tree
180 calculated recursively from the content [ABMRKL]. This self-
181 certifying hash tree allows every peer to directly detect when a
182 malicious peer tries to distribute fake content. It also ensures only
183 a small amount of information is needed to start a download (just the
184 root hash and some peer addresses).
186 Swift uses a novel method of addressing chunks of content called "bin
187 numbers". Bin numbers allow the addressing of a binary interval of
188 data using a single integer. This reduces the amount of state that
189 needs to be recorded per peer and the space needed to denote
190 intervals on the wire, making the protocol light-weight. In general,
191 this numbering system allows swift to work with simpler data
192 structures, e.g. to use arrays instead of binary trees, thus reducing
195 Swift is a generic protocol which can run directly on top of UDP,
196 TCP, HTTP, or as a layer below RTP, similar to SRTP [RFC3711]. As
197 such, swift defines a common set of messages that make up the
198 protocol, which can have different representations on the wire
199 depending on the lower-level protocol used. When the lower-level
200 transport is UDP, swift can also use different congestion control
201 algorithms and facilitate NAT traversal.
203 In addition, swift is extensible in the mechanisms it uses to promote
204 client contribution and prevent freeriding, that is, how to deal with
205 peers that only download content but never upload to others.
206 Furthermore, it can work with different peer discovery schemes, such
207 as centralized trackers or fast Distributed Hash Tables [JIM11].
209 This documents describes not only the swift protocol but also how it
210 satisfies the requirements for the IETF Peer-to-Peer Streaming
211 Protocol (PPSP) Working Group's peer protocol [PPSPCHART,I-D.ietf-
212 ppsp-reqs]. A reference implementation of swift over UDP is available
216 1.2. Conventions Used in This Document
218 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
219 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
220 document are to be interpreted as described in [RFC2119].
224 Grishchenko and Bakker Expires June 21, 2012 [Page 4]
226 Internet-Draft swift December 19, 2011
232 The basic unit of swift communication. A message will have
233 different representations on the wire depending on the transport
234 protocol used. Messages are typically multiplexed into a
235 datagram for transmission.
238 A sequence of messages that is offered as a unit to the
239 underlying transport protocol (UDP, etc.). The datagram is
240 swift's Protocol Data Unit (PDU).
243 Either a live transmission, a pre-recorded multimedia asset, or
247 A number denoting a specific binary interval of the content
248 (i.e., one or more consecutive chunks).
251 The basic unit in which the content is divided. E.g. a block of
255 The result of applying a cryptographic hash function, more
256 specifically a modification detection code (MDC) [HAC01], such
257 as SHA1 [FIPS180-2], to a piece of data.
260 The root in a Merkle hash tree calculated recursively from the
264 A group of peers participating in the distribution of the same
268 Unique identifier for a swarm of peers, in swift the root hash
269 of the content (video-on-demand,download) or a public key (live
273 An entity that records the addresses of peers participating in a
274 swarm, usually for a set of swarms, and makes this membership
275 information available to other peers on request.
280 Grishchenko and Bakker Expires June 21, 2012 [Page 5]
282 Internet-Draft swift December 19, 2011
286 When a peer A is choking peer B it means that A is currently not
287 willing to accept requests for content from B.
292 The basic unit of communication in swift is the message. Multiple
293 messages are multiplexed into a single datagram for transmission. A
294 datagram (and hence the messages it contains) will have different
295 representations on the wire depending on the transport protocol used
301 Consider a peer A that wants to download a certain content asset. To
302 commence a swift download, peer A must have the swarm ID of the
303 content and a list of one or more tracker contact points (e.g.
304 host+port). The list of trackers is optional in the presence of a
305 decentralized tracking mechanism. The swarm ID consists of the swift
306 root hash of the content (video-on-demand, downloading) or a public
307 key (live streaming).
309 Peer A now registers with the tracker following e.g. the PPSP tracker
310 protocol [I-D.ietf.ppsp-reqs] and receives the IP address and port of
311 peers already in the swarm, say B, C, and D. Peer A now sends a
312 datagram containing a HANDSHAKE message to B, C, and D. This message
313 serves as an end-to-end check that the peers are actually in the
314 correct swarm, and contains the root hash of the swarm. Peer B and C
315 respond with datagrams containing a HANDSHAKE message and one or more
316 HAVE messages. A HAVE message conveys (part of) the chunk
317 availability of a peer and thus contains a bin number that denotes
318 what chunks of the content peer B, resp. C have. Peer D sends a
319 datagram with just a HANDSHAKE and omits HAVE messages as a way of
322 2.2. Exchanging Chunks
324 In response to B and C, A sends new datagrams to B and C containing
325 HINT messages. A HINT or request message indicates the chunks that a
326 peer wants to download, and contains a bin number. The HINT messages
327 to B and C refer to disjunct sets of chunks. B and C respond with
328 datagrams containing HASH, HAVE and DATA messages. The HASH messages
329 contains all cryptographic hashes that peer A needs to verify the
330 integrity of the content chunk sent in the DATA message, using the
331 content's root hash as trusted anchor, see Sec. 3.5. Using these
332 hashes peer A verifies that the chunks received from B and C are
336 Grishchenko and Bakker Expires June 21, 2012 [Page 6]
338 Internet-Draft swift December 19, 2011
341 correct. It also updates the chunk availability of B and C using the
342 information in the received HAVE messages.
344 After processing, A sends a datagram containing HAVE messages for the
345 chunks it just received to all its peers. In the datagram to B and C
346 it includes an ACK message acknowledging the receipt of the chunks,
347 and adds HINT messages for new chunks. ACK messages are not used when
348 a reliable transport protocol is used. When e.g. C finds that A
349 obtained a chunk (from B) that C did not yet have, C's next datagram
350 includes a HINT for that chunk.
352 Peer D does not send HAVE messages to A when it downloads chunks from
353 other peers, until D decides to unchoke peer A. In the case, it sends
354 a datagram with HAVE messages to inform A of its current
355 availability. If B or C decide to choke A they stop sending HAVE and
356 DATA messages and A should then rerequest from other peers. They may
357 continue to send HINT messages, or periodic KEEPALIVE messages such
358 that A keeps sending them HAVE messages.
360 Once peer A has received all content (video-on-demand use case) it
361 stops sending messages to all other peers that have all content
362 (a.k.a. seeders). Peer A MAY also contact the tracker or another
363 source again to obtain more peer addresses.
368 Depending on the transport protocol used, peers should either use
369 explicit leave messages or implicitly leave a swarm by stopping to
370 respond to messages. Peers that learn about the departure should
371 remove these peers from the current peer list. The implicit-leave
372 mechanism works for both graceful and ungraceful leaves (i.e., peer
373 crashes or disconnects). When leaving gracefully, a peer should
374 deregister from the tracker following the (PPSP) tracker protocol.
379 In general, no error codes or responses are used in the protocol;
380 absence of any response indicates an error. Invalid messages are
383 For the sake of simplicity, one swarm of peers always deals with one
384 content asset (e.g. file) only. Retrieval of large collections of
385 files is done by retrieving a directory list file and then
386 recursively retrieving files, which might also turn to be directory
387 lists, as described in Sec. 3.11.
392 Grishchenko and Bakker Expires June 21, 2012 [Page 7]
394 Internet-Draft swift December 19, 2011
399 As an end-to-end check that the peers are actually in the correct
400 swarm, the initiating peer and the addressed peer SHOULD send a
401 HANDSHAKE message in the first datagrams they exchange. The only
402 payload of the HANDSHAKE message is the root hash of the content.
404 After the handshakes are exchanged, the initiator knows that the peer
405 really responds. Hence, the second datagram the initiator sends MAY
406 already contain some heavy payload. To minimize the number of
407 initialization roundtrips, implementations MAY dispense with the
408 HANDSHAKE message. To the same end, the first two datagrams exchanged
409 MAY also contain some minor payload, e.g. HAVE messages to indicate
410 the current progress of a peer or a HINT (see Sec. 3.6).
415 The HAVE message is used to convey which chunks a peers has
416 available, expressed in a new content addressing scheme called "bin
421 Swift employs a generic content addressing scheme based on binary
422 intervals ("bins" in short). The smallest interval is a chunk (e.g. a
423 N kilobyte block), the top interval is the complete 2**63 range. A
424 novel addition to the classical scheme are "bin numbers", a scheme of
425 numbering binary intervals which lays them out into a vector nicely.
426 Consider an chunk interval of width W. To derive the bin numbers of
427 the complete interval and the subintervals, a minimal balanced binary
428 tree is built that is at least W chunks wide at the base. The leaves
429 from left-to-right correspond to the chunks 0..W in the interval, and
430 have bin number I*2 where I is the index of the chunk (counting
431 beyond W-1 to balance the tree). The higher level nodes P in the tree
434 binP = (binL + binR) / 2
436 where binL is the bin of node P's left-hand child and binR is the bin
437 of node P's right-hand child. Given that each node in the tree
438 represents a subinterval of the original interval, each such
439 subinterval now is addressable by a bin number, a single integer. The
440 bin number tree of a interval of width W=8 looks like this:
448 Grishchenko and Bakker Expires June 21, 2012 [Page 8]
450 Internet-Draft swift December 19, 2011
466 So bin 7 represents the complete interval, 3 represents the interval
467 of chunk 0..3 and 1 represents the interval of chunks 0 and 1. The
468 special numbers 0xFFFFFFFF (32-bit) or 0xFFFFFFFFFFFFFFFF (64-bit)
469 stands for an empty interval, and 0x7FFF...FFF stands for
475 When a receiving peer has successfully checked the integrity of a
476 chunk or interval of chunks it MUST send a HAVE message to all peers
477 it wants to interact with. The latter allows the HAVE message to be
478 used as a method of choking. The HAVE message MUST contain the bin
479 number of the biggest complete interval of all chunks the receiver
480 has received and checked so far that fully includes the interval of
481 chunks just received. So the bin number MUST denote at least the
482 interval received, but the receiver is supposed to aggregate and
483 acknowledge bigger bins, when possible.
485 As a result, every single chunk is acknowledged a logarithmic number
486 of times. That provides some necessary redundancy of acknowledgments
487 and sufficiently compensates for unreliable transport protocols.
489 To record which chunks a peer has in the state that an implementation
490 keeps for each peer, an implementation MAY use the "binmap" data
491 structure, which is a hybrid of a bitmap and a binary tree, discussed
492 in detail in [BINMAP].
497 When swift is run over an unreliable transport protocol, an
498 implementation MAY choose to use ACK messages to acknowledge received
499 data. When a receiving peer has successfully checked the integrity of
500 a chunk or interval of chunks C it MUST send a ACK message containing
504 Grishchenko and Bakker Expires June 21, 2012 [Page 9]
506 Internet-Draft swift December 19, 2011
509 the bin number of its biggest, complete, interval covering C to the
510 sending peer (see HAVE). To facilitate delay-based congestion
511 control, an ACK message contains a timestamp.
516 The DATA message is used to transfer chunks of content. The
517 associated HASH message carries cryptographic hashes that are
518 necessary for a receiver to check the the integrity of the chunk.
519 Swift's content integrity protection is based on a Merkle hash tree
520 and works as follows.
522 3.5.1. Merkle Hash Tree
524 Swift uses a method of naming content based on self-certification. In
525 particular, content in swift is identified by a single cryptographic
526 hash that is the root hash in a Merkle hash tree calculated
527 recursively from the content [ABMRKL]. This self-certifying hash tree
528 allows every peer to directly detect when a malicious peer tries to
529 distribute fake content. It also ensures only a small the amount of
530 information is needed to start a download (the root hash and some
531 peer addresses). For live streaming public keys and dynamic trees are
534 The Merkle hash tree of a content asset that is divided into N chunks
535 is constructed as follows. Note the construction does not assume
536 chunks of content to be fixed size. Given a cryptographic hash
537 function, more specifically a modification detection code (MDC)
538 [HAC01], such as SHA1, the hashes of all the chunks of the content
539 are calculated. Next, a binary tree of sufficient height is created.
540 Sufficient height means that the lowest level in the tree has enough
541 nodes to hold all chunk hashes in the set, as before, see HAVE
542 message. The figure below shows the tree for a content asset
543 consisting of 7 chunks. As before with the content addressing scheme,
544 the leaves of the tree correspond to a chunk and in this case are
545 assigned the hash of that chunk, starting at the left-most leaf. As
546 the base of the tree may be wider than the number of chunks, any
547 remaining leaves in the tree are assigned a empty hash value of all
548 zeros. Finally, the hash values of the higher levels in the tree are
549 calculated, by concatenating the hash values of the two children
550 (again left to right) and computing the hash of that aggregate. This
551 process ends in a hash value for the root node, which is called the
552 "root hash". Note the root hash only depends on the content and any
553 modification of the content will result in a different root hash.
560 Grishchenko and Bakker Expires June 21, 2012 [Page 10]
562 Internet-Draft swift December 19, 2011
574 1 5 9 13* = uncle hash
578 C0 C1 C2 C3 C4 C5 C6 E
579 =chunk index ^^ = empty hash
582 3.5.2. Content Integrity Verification
584 Assuming a peer receives the root hash of the content it wants to
585 download from a trusted source, it can can check the integrity of any
586 chunk of that content it receives as follows. It first calculates the
587 hash of the chunk it received, for example chunk C4 in the previous
588 figure. Along with this chunk it MUST receive the hashes required to
589 check the integrity of that chunk. In principle, these are the hash
590 of the chunk's sibling (C5) and that of its "uncles". A chunk's
591 uncles are the sibling Y of its parent X, and the uncle of that Y,
592 recursively until the root is reached. For chunk C4 its uncles are
593 bins 13 and 3, marked with * in the figure. Using this information
594 the peer recalculates the root hash of the tree, and compares it to
595 the root hash it received from the trusted source. If they match the
596 chunk of content has been positively verified to be the requested
597 part of the content. Otherwise, the sending peer either sent the
598 wrong content or the wrong sibling or uncle hashes. For simplicity,
599 the set of sibling and uncles hashes is collectively referred to as
602 In the case of live streaming the tree of chunks grows dynamically
603 and content is identified with a public key instead of a root hash,
604 as the root hash is undefined or, more precisely, transient, as long
605 as new data is generated by the live source. Live streaming is
606 described in more detail below, but content verification works the
607 same for both live and predefined content.
609 3.5.3. The Atomic Datagram Principle
611 As explained above, a datagram consists of a sequence of messages.
612 Ideally, every datagram sent must be independent of other datagrams,
616 Grishchenko and Bakker Expires June 21, 2012 [Page 11]
618 Internet-Draft swift December 19, 2011
621 so each datagram SHOULD be processed separately and a loss of one
622 datagram MUST NOT disrupt the flow. Thus, as a datagram carries zero
623 or more messages, neither messages nor message interdependencies
624 should span over multiple datagrams.
626 This principle implies that as any chunk is verified using its uncle
627 hashes the necessary hashes MUST be put into the same datagram as the
628 chunk's data (Sec. 3.5.4). As a general rule, if some additional
629 data is still missing to process a message within a datagram, the
630 message SHOULD be dropped.
632 The hashes necessary to verify a chunk are in principle its sibling's
633 hash and all its uncle hashes, but the set of hashes to sent can be
634 optimized. Before sending a packet of data to the receiver, the
635 sender inspects the receiver's previous acknowledgments (HAVE or ACK)
636 to derive which hashes the receiver already has for sure. Suppose,
637 the receiver had acknowledged bin 1 (first two chunks of the file),
638 then it must already have uncle hashes 5, 11 and so on. That is
639 because those hashes are necessary to check packets of bin 1 against
640 the root hash. Then, hashes 3, 7 and so on must be also known as they
641 are calculated in the process of checking the uncle hash chain.
642 Hence, to send bin 12 (i.e. the 7th chunk of content), the sender
643 needs to include just the hashes for bins 14 and 9, which let the
644 data be checked against hash 11 which is already known to the
647 The sender MAY optimistically skip hashes which were sent out in
648 previous, still unacknowledged datagrams. It is an optimization
649 tradeoff between redundant hash transmission and possibility of
650 collateral data loss in the case some necessary hashes were lost in
651 the network so some delivered data cannot be verified and thus has to
652 be dropped. In either case, the receiver builds the Merkle tree on-
653 demand, incrementally, starting from the root hash, and uses it for
656 In short, the sender MUST put into the datagram the missing hashes
657 necessary for the receiver to verify the chunk.
659 3.5.4. DATA and HASH Messages
661 Concretely, a peer that wants to send a chunk of content creates a
662 datagram that MUST consist of one or more HASH messages and a DATA
663 message. The datagram MUST contain a HASH message for each hash the
664 receiver misses for integrity checking. A HASH message MUST contain
665 the bin number and hash data for each of those hashes. The DATA
666 message MUST contain the bin number of the chunk and chunk itself. A
667 peer MAY send the required messages for multiple chunks in the same
672 Grishchenko and Bakker Expires June 21, 2012 [Page 12]
674 Internet-Draft swift December 19, 2011
679 While bulk download protocols normally do explicit requests for
680 certain ranges of data (i.e., use a pull model, for example,
681 BitTorrent [BITTORRENT]), live streaming protocols quite often use a
682 request-less push model to save round trips. Swift supports both
685 A peer MUST send a HINT message containing the bin of the chunk
686 interval it wants to download. A peer receiving a HINT message MAY
687 send out requested pieces. When it receives multiple HINTs (either in
688 one datagram or in multiple), the peer SHOULD process the HINTs
689 sequentially. When live streaming, it also may send some other chunks
690 in case it runs out of requests or for some other reason. In that
691 case the only purpose of HINT messages is to coordinate peers and to
692 avoid unnecessary data retransmission, hence the name.
696 3.7. Peer Address Exchange and NAT Hole Punching
698 Peer address exchange messages (or PEX messages for short) are common
699 for many peer-to-peer protocols. By exchanging peer addresses in
700 gossip fashion, peers relieve central coordinating entities (the
701 trackers) from unnecessary work. swift optionally features two types
702 of PEX messages: PEX_REQ and PEX_ADD. A peer that wants to retrieve
703 some peer addresses MUST send a PEX_REQ message. The receiving peer
704 MAY respond with a PEX_ADD message containing the addresses of
705 several peers. The addresses MUST be of peers it has recently
706 exchanged messages with to guarantee liveliness.
708 To unify peer exchange and NAT hole punching functionality, the
709 sending pattern of PEX messages is restricted. As the swift handshake
710 is able to do simple NAT hole punching [SNP] transparently, PEX
711 messages must be emitted in the way to facilitate that. Namely, once
712 peer A introduces peer B to peer C by sending a PEX_ADD message to C,
713 it SHOULD also send a message to B introducing C. The messages SHOULD
714 be within 2 seconds from each other, but MAY not be, simultaneous,
715 instead leaving a gap of twice the "typical" RTT, i.e. 300-600ms. The
716 peers are supposed to initiate handshakes to each other thus forming
717 a simple NAT hole punching pattern where the introducing peer
718 effectively acts as a STUN server [RFC5389]. Still, peers MAY ignore
719 PEX messages if uninterested in obtaining new peers or because of
720 security considerations (rate limiting) or any other reason.
722 The PEX messages can be used to construct a dedicated tracker peer.
728 Grishchenko and Bakker Expires June 21, 2012 [Page 13]
730 Internet-Draft swift December 19, 2011
735 A peer MUST send a datagram containing a KEEPALIVE message
736 periodically to each peer it wants to interact with in the future but
737 has no other messages to send them at present.
741 Peers MUST convey which version of the swift protocol they support
742 using a VERSION message. This message MUST be included in the initial
743 (handshake) datagrams and MUST indicate which version of the swift
744 protocol the sending peer supports.
747 3.10. Conveying Peer Capabilities
748 Peers may support just a subset of the swift messages. For example,
749 peers running over TCP may not accept ACK messages, or peers used
750 with a centralized tracking infrastructure may not accept PEX
751 messages. For these reasons, peers SHOULD signal which subset of the
752 swift messages they support by means of the MSGTYPE_RCVD message.
753 This message SHOULD be included in the initial (handshake) datagrams
754 and MUST indicate which swift protocol messages the sending peer
758 3.11. Directory Lists
760 Directory list files MUST start with magic bytes ".\n..\n". The rest
761 of the file is a newline-separated list of hashes and file names for
762 the content of the directory. An example:
766 1234567890ABCDEF1234567890ABCDEF12345678 readme.txt
767 01234567890ABCDEF1234567890ABCDEF1234567 big_file.dat
771 4. Automatic Detection of Content Size
773 In swift, the root hash of a static content asset, such as a video
774 file, along with some peer addresses is sufficient to start a
775 download. In addition, swift can reliably and automatically derive
776 the size of such content from information received from the network
777 when fixed sized chunks are used. As a result, it is not necessary to
778 include the size of the content asset as the metadata of the content,
779 in addition to the root hash. Implementations of swift MAY use this
780 automatic detection feature.
784 Grishchenko and Bakker Expires June 21, 2012 [Page 14]
786 Internet-Draft swift December 19, 2011
791 The ability for a newcomer peer to detect the size of the content
792 depends heavily on the concept of peak hashes. Peak hashes, in
793 general, enable two cornerstone features of swift: reliable file size
794 detection and download/live streaming unification (see Sec. 5). The
795 concept of peak hashes depends on the concepts of filled and
796 incomplete bins. Recall that when constructing the binary trees for
797 content verification and addressing the base of the tree may have
798 more leaves than the number of chunks in the content. In the Merkle
799 hash tree these leaves were assigned empty all-zero hashes to be able
800 to calculate the higher level hashes. A filled bin is now defined as
801 a bin number that addresses an interval of leaves that consists only
802 of hashes of content chunks, not empty hashes. Reversely, an
803 incomplete (not filled) bin addresses an interval that contains also
804 empty hashes, typically an interval that extends past the end of the
805 file. In the following figure bins 7, 11, 13 and 14 are incomplete
808 Formally, a peak hash is a hash in the Merkle tree defined over a
809 filled bin, whose sibling is defined over an incomplete bin.
810 Practically, suppose a file is 7162 bytes long and a chunk is 1
811 kilobyte. That file fits into 7 chunks, the tail chunk being 1018
812 bytes long. The Merkle tree for that file looks as follows. Following
813 the definition the peak hashes of this file are in bins 3, 9 and 12,
814 denoted with a *. E denotes an empty hash.
829 C0 C1 C2 C3 C4 C5 C6 E
832 Peak hashes can be explained by the binary representation of the
833 number of chunks the file occupies. The binary representation for 7
834 is 111. Every "1" in binary representation of the file's packet
835 length corresponds to a peak hash. For this particular file there are
836 indeed three peaks, bin numbers 3, 9, 12. The number of peak hashes
840 Grishchenko and Bakker Expires June 21, 2012 [Page 15]
842 Internet-Draft swift December 19, 2011
845 for a file is therefore also at most logarithmic with its size.
847 A peer knowing which bins contain the peak hashes for the file can
848 therefore calculate the number of chunks it consists of, and thus get
849 an estimate of the file size (given all chunks but the last are fixed
850 size). Which bins are the peaks can be securely communicated from
851 one (untrusted) peer A to another B by letting A send the peak hashes
852 and their bin numbers to B. It can be shown that the root hash that B
853 obtained from a trusted source is sufficient to verify that these are
854 indeed the right peak hashes, as follows.
856 Lemma: Peak hashes can be checked against the root hash.
858 Proof: (a) Any peak hash is always the left sibling. Otherwise, be it
859 the right sibling, its left neighbor/sibling must also be defined
860 over a filled bin, because of the way chunks are laid out in the
861 leaves, contradiction. (b) For the rightmost peak hash, its right
862 sibling is zero. (c) For any peak hash, its right sibling might be
863 calculated using peak hashes to the left and zeros for empty bins.
864 (d) Once the right sibling of the leftmost peak hash is calculated,
865 its parent might be calculated. (e) Once that parent is calculated,
866 we might trivially get to the root hash by concatenating the hash
867 with zeros and hashing it repeatedly.
869 Informally, the Lemma might be expressed as follows: peak hashes
870 cover all data, so the remaining hashes are either trivial (zeros) or
871 might be calculated from peak hashes and zero hashes.
873 Finally, once peer B has obtained the number of chunks in the content
874 it can determine the exact file size as follows. Given that all
875 chunks except the last are fixed size B just needs to know the size
876 of the last chunk. Knowing the number of chunks B can calculate the
877 bin number of the last chunk and download it. As always B verifies
878 the integrity of this chunk against the trusted root hash. As there
879 is only one chunk of data that leads to a successful verification the
880 size of this chunk must be correct. B can then determine the exact
883 (number of chunks -1) * fixed chunk size + size of last chunk
888 A swift implementation that wants to use automatic size detection
889 MUST operate as follows. When a peer B sends a DATA message for the
890 first time to a peer A, B MUST include all the peak hashes for the
891 content in the same datagram, unless A has already signalled earlier
892 in the exchange that it knows the peak hashes by having acknowledged
896 Grishchenko and Bakker Expires June 21, 2012 [Page 16]
898 Internet-Draft swift December 19, 2011
901 any bin, even the empty one. The receiver A MUST check the peak
902 hashes against the root hash to determine the approximate content
903 size. To obtain the definite content size peer A MUST download the
904 last chunk of the content from any peer that offers it.
912 In the case of live streaming a transfer is bootstrapped with a
913 public key instead of a root hash, as the root hash is undefined or,
914 more precisely, transient, as long as new data is being generated by
915 the live source. Live/download unification is achieved by sending
916 signed peak hashes on-demand, ahead of the actual data. As before,
917 the sender might use acknowledgements to derive which content range
918 the receiver has peak hashes for and to prepend the data hashes with
919 the necessary (signed) peak hashes. Except for the fact that the set
920 of peak hashes changes with time, other parts of the algorithm work
921 as described in Sec. 3.
923 As with static content assets in the previous section, in live
924 streaming content length is not known on advance, but derived
925 on-the-go from the peak hashes. Suppose, our 7 KB stream extended to
926 another kilobyte. Thus, now hash 7 becomes the only peak hash, eating
927 hashes 3, 9 and 12. So, the source sends out a SIGNED_HASH message to
930 The number of cryptographic operations will be limited. For example,
931 consider a 25 frame/second video transmitted over UDP. When each
932 frame is transmitted in its own chunk, only 25 signature verification
933 operations per second are required at the receiver for bitrates up to
934 ~12.8 megabit/second. For higher bitrates multiple UDP packets per
935 frame are needed and the number of verifications doubles.
941 6. Transport Protocols and Encapsulation
947 Currently, swift-over-UDP is the preferred deployment option.
948 Effectively, UDP allows the use of IP with minimal overhead and it
952 Grishchenko and Bakker Expires June 21, 2012 [Page 17]
954 Internet-Draft swift December 19, 2011
957 also allows userspace implementations. The default is to use chunks
958 of 1 kilobyte such that a datagram fits in an Ethernet-sized IP
959 packet. The bin numbering allows to use swift over Jumbo
960 frames/datagrams. Both DATA and HAVE/ACK messages may use e.g. 8
961 kilobyte packets instead of the standard 1 KiB. The hashing scheme
962 stays the same. Using swift with 512 or 256-byte packets is
963 theoretically possible with 64-bit byte-precise bin numbers, but IP
964 fragmentation might be a better method to achieve the same result.
967 6.1.2. Datagrams and Messages
969 When using UDP, the abstract datagram described above corresponds
970 directly to a UDP datagram. Each message within a datagram has a
971 fixed length, which depends on the type of the message. The first
972 byte of a message denotes its type. The currently defined types are:
987 Furthermore, integers are serialized in the network (big-endian) byte
988 order. So consider the example of an ACK message (Sec 3.4). It has
989 message type of 0x02 and a payload of a bin number, a four-byte
990 integer (say, 1); hence, its on the wire representation for UDP can
991 be written in hex as: "02 00000001". This hex-like two character-per-
992 byte notation is used to represent message formats in the rest of
997 As it is increasingly complex for peers to enable UDP communication
998 between each other due to NATs and firewalls, swift-over-UDP uses a
999 multiplexing scheme, called "channels", to allow multiple swarms to
1000 use the same UDP port. Channels loosely correspond to TCP connections
1001 and each channel belongs to a single swarm. When channels are used,
1002 each datagram starts with four bytes corresponding to the receiving
1008 Grishchenko and Bakker Expires June 21, 2012 [Page 18]
1010 Internet-Draft swift December 19, 2011
1013 6.1.4. HANDSHAKE and VERSION
1015 A channel is established with a handshake. To start a handshake, the
1016 initiating peer needs to know:
1018 (1) the IP address of a peer
1019 (2) peer's UDP port and
1020 (3) the root hash of the content (see Sec. 3.5.1).
1022 To do the handshake the initiating peer sends a datagram that MUST
1023 start with an all 0-zeros channel number followed by a VERSION
1024 message, then a HASH message whose payload is the root hash, and a
1025 HANDSHAKE message, whose only payload is a locally unused channel
1028 On the wire the datagram will look something like this:
1030 04 7FFFFFFF 1234123412341234123412341234123412341234
1032 (to unknown channel, handshake from channel 0x11 speaking protocol
1033 version 0x01, initiating a transfer of a file with a root hash
1036 The receiving peer MUST respond with a datagram that starts with the
1037 channel number from the sender's HANDSHAKE message, followed by a
1038 VERSION message, then a HANDSHAKE message, whose only payload is a
1039 locally unused channel number, followed by any other messages it
1042 Peer's response datagram on the wire:
1044 00 00000022 03 00000003
1045 (peer to the initiator: use channel number 0x22 for this transfer and
1046 proto version 0x01; I also have first 4 chunks of the file, see Sec.
1049 At this point, the initiator knows that the peer really responds; for
1050 that purpose channel ids MUST be random enough to prevent easy
1051 guessing. So, the third datagram of a handshake MAY already contain
1052 some heavy payload. To minimize the number of initialization
1053 roundtrips, the first two datagrams MAY also contain some minor
1054 payload, e.g. a couple of HAVE messages roughly indicating the
1055 current progress of a peer or a HINT (see Sec. 3.6). When receiving
1056 the third datagram, both peers have the proof they really talk to
1057 each other; three-way handshake is complete.
1059 A peer MAY explicit close a channel by sending a HANDSHAKE message
1060 that MUST contain an all 0-zeros channel number.
1064 Grishchenko and Bakker Expires June 21, 2012 [Page 19]
1066 Internet-Draft swift December 19, 2011
1075 A HAVE message (type 0x03) states that the sending peer has the
1076 complete specified bin and successfully checked its integrity:
1078 (got/checked first four kilobytes of a file/stream)
1084 An ACK message (type 0x02) acknowledges data that was received from
1085 its addressee; to facilitate delay-based congestion control, an
1086 ACK message contains a timestamp, in particular, a 64-bit microsecond
1088 02 00000002 12345678
1089 (got the second kilobyte of the file from you; my microsecond
1090 timer was showing 0x12345678 at that moment)
1095 A HASH message (type 0x04) consists of a four-byte bin number and
1096 the cryptographic hash (e.g. a 20-byte SHA1 hash)
1097 04 7FFFFFFF 1234123412341234123412341234123412341234
1102 A DATA message (type 0x01) consists of a four-byte bin number and the
1103 actual chunk. In case a datagram contains a DATA message, a sender
1104 MUST always put the data message in the tail of the datagram. For
1106 01 00000000 48656c6c6f20776f726c6421
1107 (This message accommodates an entire file: "Hello world!")
1112 Keepalives do not have a message type on UDP. They are just simple
1113 datagrams consisting of a 4-byte channel id only.
1120 Grishchenko and Bakker Expires June 21, 2012 [Page 20]
1122 Internet-Draft swift December 19, 2011
1125 6.1.10. Flow and Congestion Control
1127 Explicit flow control is not necessary in swift-over-UDP. In the case
1128 of video-on-demand the receiver will request data explicitly from
1129 peers and is therefore in control of how much data is coming towards
1130 it. In the case of live streaming, where a push-model may be used,
1131 the amount of data incoming is limited to the bitrate, which the
1132 receiver must be able to process otherwise it cannot play the stream.
1133 Should, for any reason, the receiver get saturated with data that
1134 situation is perfectly detected by the congestion control. Swift-
1135 over-UDP can support different congestion control algorithms, in
1136 particular, it supports the new IETF Low Extra Delay Background
1137 Transport (LEDBAT) congestion control algorithm that ensures that
1138 peer-to-peer traffic yields to regular best-effort traffic [LEDBAT].
1143 When run over TCP, swift becomes functionally equivalent to
1144 BitTorrent. Namely, most swift messages have corresponding BitTorrent
1145 messages and vice versa, except for BitTorrent's explicit interest
1146 declarations and choking/unchoking, which serve the classic
1147 implementation of the tit-for-tat algorithm [TIT4TAT]. However, TCP
1148 is not well suited for multiparty communication, as argued in Sec. 9.
1151 6.3. RTP Profile for PPSP
1153 In this section we sketch how swift can be integrated into RTP
1154 [RFC3550] to form the Peer-to-Peer Streaming Protocol (PPSP) [I-
1155 D.ietf-ppsp-reqs] running over UDP. The PPSP charter requires
1156 existing media transfer protocols be used [PPSPCHART]. Hence, the
1157 general idea is to define swift as a profile of RTP, in the same way
1158 as the Secure Real-time Transport Protocol (SRTP) [RFC3711]. SRTP,
1159 and therefore swift is considered ``a "bump in the stack"
1160 implementation which resides between the RTP application and the
1161 transport layer. [swift] intercepts RTP packets and then forwards an
1162 equivalent [swift] packet on the sending side, and intercepts [swift]
1163 packets and passes an equivalent RTP packet up the stack on the
1164 receiving side.'' [RFC3711].
1166 In particular, to encode a swift datagram in an RTP packet all the
1167 non-DATA messages of swift such as HINT and HAVE are postfixed to the
1168 RTP packet using the UDP encoding and the content of DATA messages is
1169 sent in the payload field. Implementations MAY omit the RTP header
1170 for packets without payload. This construction allows the streaming
1171 application to use of all RTP's current features, and with a
1172 modification to the Merkle tree hashing scheme (see below) meets
1176 Grishchenko and Bakker Expires June 21, 2012 [Page 21]
1178 Internet-Draft swift December 19, 2011
1181 swift's atomic datagram principle. The latter means that a receiving
1182 peer can autonomously verify the RTP packet as being correct content,
1183 thus preventing the spread of corrupt data (see requirement PPSP.SEC-
1186 The use of ACK messages for reliability is left as a choice of the
1187 application using PPSP.
1192 6.3.1.1. Joining a Swarm
1194 To commence a PPSP download a peer A must have the swarm ID of the
1195 stream and a list of one or more tracker contact points (e.g.
1196 host+port). The list of trackers is optional in the presence of a
1197 decentralized tracking mechanism. The swarm ID consists of the swift
1198 root hash of the content, which is divided into chunks (see
1201 Peer A now registers with the PPSP tracker following the tracker
1202 protocol [I-D.ietf.ppsp-reqs] and receives the IP address and RTP
1203 port of peers already in the swarm, say B, C, and D. Peer A now sends
1204 an RTP packet containing a HANDSHAKE without channel information to
1205 B, C, and D. This serves as an end-to-end check that the peers are
1206 actually in the correct swarm. Optionally A could include a HINT
1207 message in some RTP packets if it wants to start receiving content
1208 immediately. B and C respond with a HANDSHAKE and HAVE messages. D
1209 sends just a HANDSHAKE and omits HAVE messages as a way of choking A.
1212 6.3.1.2. Exchanging Chunks
1214 In response to B and C, A sends new RTP packets to B and C with HINTs
1215 for disjunct sets of chunks. B and C respond with the requested
1216 chunks in the payload and HAVE messages, updating their chunk
1217 availability. Upon receipt, A sends HAVE for the chunks received and
1218 new HINT messages to B and C. When e.g. C finds that A obtained a
1219 chunk (from B) that C did not yet have, C's response includes a HINT
1222 D does not send HAVE messages, instead if D decides to unchoke peer
1223 A, it sends an RTP packet with HAVE messages to inform A of its
1224 current availability. If B or C decide to choke A they stop sending
1225 HAVE and DATA messages and A should then rerequest from other peers.
1226 They may continue to send HINT messages, or exponentially slowing
1227 KEEPALIVE messages such that A keeps sending them HAVE messages.
1232 Grishchenko and Bakker Expires June 21, 2012 [Page 22]
1234 Internet-Draft swift December 19, 2011
1237 Once A has received all content (video-on-demand use case) it stops
1238 sending messages to all other peers that have all content (a.k.a.
1242 6.3.1.3. Leaving a Swarm
1244 Peers can implicitly leave a swarm by stopping to respond to
1245 messages. Sending peers should remove these peers from the current
1246 peer list. This mechanism works for both graceful and ungraceful
1247 leaves (i.e., peer crashes or disconnects). When leaving gracefully,
1248 a peer should deregister from the tracker following the PPSP tracker
1251 More explicit graceful leaves could be implemented using RTCP. In
1252 particular, a peer could send a RTCP BYE on the RTCP port that is
1253 derivable from a peer's RTP port for all peers in its current peer
1254 list. However, to prevent malicious peers from sending BYEs a form of
1255 peer authentication is required (e.g. using public keys as peer IDs
1261 Using swift as an RTP profile requires a change to the content
1262 integrity protection scheme (see Sec. 3.5). The fields in the RTP
1263 header, such as the timestamp and PT fields, must be protected by the
1264 Merkle tree hashing scheme to prevent malicious alterations.
1265 Therefore, the Merkle tree is no longer constructed from pure content
1266 chunks, but from the complete RTP packet for a chunk as it would be
1267 transmitted (minus the non-DATA swift messages). In other words, the
1268 hash of the leaves in the tree is the hash over the Authenticated
1269 Portion of the RTP packet as defined by SRTP, illustrated in the
1270 following figure (extended from [RFC3711]). There is no need for the
1271 RTP packets to be fixed size, as the hashing scheme can deal with
1272 variable-sized leaves.
1288 Grishchenko and Bakker Expires June 21, 2012 [Page 23]
1290 Internet-Draft swift December 19, 2011
1294 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1295 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<+
1296 |V=2|P|X| CC |M| PT | sequence number | |
1297 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1299 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1300 | synchronization source (SSRC) identifier | |
1301 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |
1302 | contributing source (CSRC) identifiers | |
1304 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1305 | RTP extension (OPTIONAL) | |
1306 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1308 | +-------------------------------+ |
1309 | | RTP padding | RTP pad count | |
1310 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+<+
1311 ~ swift non-DATA messages (REQUIRED) ~ |
1312 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1313 | length of swift messages (REQUIRED) | |
1314 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |
1316 Authenticated Portion ---+
1318 Figure: The format of an RTP-Swift packet.
1321 As a downside, with variable-sized payloads the automatic content
1322 size detection of Section 4 no longer works, so content length MUST
1323 be explicit in the metadata. In addition, storage on disk is more
1324 complex with out-of-order, variable-sized packets. On the upside,
1325 carrying RTP over swift allow decryption-less caching.
1327 As with UDP, another matter is how much data is carried inside each
1328 packet. An important swift-specific factor here is the resulting
1329 number of hash calculations per second needed to verify chunks.
1330 Experiments should be conducted to ensure they are not excessive for,
1331 e.g., mobile hardware.
1333 At present, Peer IDs are not required in this design.
1336 6.3.2. PPSP Requirements
1338 6.3.2.1. Basic Requirements
1340 - PPSP.REQ-1: The swift PEX message can also be used as the basis for
1344 Grishchenko and Bakker Expires June 21, 2012 [Page 24]
1346 Internet-Draft swift December 19, 2011
1349 a tracker protocol, to be discussed elsewhere.
1351 - PPSP.REQ-2: This draft preserves the properties of RTP.
1353 - PPSP.REQ-3: This draft does not place requirements on peer IDs,
1354 IP+port is sufficient.
1356 - PPSP.REQ-4: The content is identified by its root hash (video-on-
1357 demand) or a public key (live streaming).
1359 - PPSP.REQ-5: The content is partitioned by the streaming
1362 - PPSP.REQ-6: Each chunk is identified by a bin number (and its
1363 cryptographic hash.)
1365 - PPSP.REQ-7: The protocol is carried over UDP because RTP is.
1367 - PPSP.REQ-8: The protocol has been designed to allow meaningful data
1368 transfer between peers as soon as possible and to avoid unnecessary
1369 round-trips. It supports small and variable chunk sizes, and its
1370 content integrity protection enables wide scale caching.
1373 6.3.2.2. Peer Protocol Requirements
1375 - PPSP.PP.REQ-1: A GET_HAVE would have to be added to request which
1376 chunks are available from a peer, if the proposed push-based HAVE
1377 mechanism is not sufficient.
1379 - PPSP.PP.REQ-2: A set of HAVE messages satisfies this.
1381 - PPSP.PP.REQ-3: The PEX_REQ message satisfies this. Care should be
1382 taken with peer address exchange in general, as the use of such
1383 hearsay is a risk for the protocol as it may be exploited by
1384 malicious peers (as a DDoS attack mechanism). A secure tracking /
1385 peer sampling protocol like [PUPPETCAST] may be needed to make peer-
1386 address exchange safe.
1388 - PPSP.PP.REQ-4: HAVE messages convey current availability via a push
1391 - PPSP.PP.REQ-5: Bin numbering enables a compact representation of
1394 - PPSP.PP.REQ-6: A new PPSP specific Peer Report message would have
1395 to be added to RTCP.
1400 Grishchenko and Bakker Expires June 21, 2012 [Page 25]
1402 Internet-Draft swift December 19, 2011
1405 - PPSP.PP.REQ-7: Transmission and chunk requests are integrated in
1409 6.3.2.3. Security Requirements
1411 - PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms
1412 [CLOSED] would have to be added.
1414 - PPSP.SEC.REQ-2: As RTP is carried verbatim over swift, RTP
1415 encryption can be used. Note that just encrypting the RTP part will
1416 allow for caching servers that are part of the swarm but do not need
1417 access to the decryption keys. They just need access to the swift
1418 HASHES in the postfix to verify the packet's integrity.
1420 - PPSP.SEC.REQ-3: RTP encryption or IPsec [RFC4303] can be used, if
1421 the swift messages must also be encrypted.
1423 - PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the
1424 indirect spread of corrupt content, as peers will only forward chunks
1425 to others if their integrity check out. Another protection mechanism
1426 is to not depend on hearsay (i.e., do not forward other peers'
1427 availability information), or to only use it when the information
1428 spread is self-certified by its subjects.
1430 Other attacks, such as a malicious peer claiming it has content but
1431 not replying, are still possible. Or wasting CPU and bandwidth at a
1432 receiving peer by sending packets where the DATA doesn't match the
1436 - PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving
1437 peer to detect a malicious or faulty sender, which it can
1438 subsequently ignore. Spreading this knowledge to other peers such
1439 that they know about this bad behavior is hearsay.
1442 - PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that
1443 malicious peers launch an Eclipse [ECLIPSE] attack on the initial
1444 injectors of the content (in particular in live streaming). The
1445 attack tries to let the injector upload to just malicious peers which
1446 then do not forward the content to others, thus stopping the
1447 distribution. An Eclipse attack could also be launched on an
1448 individual peer. Letting these injectors only use trusted trackers
1449 that provide true random samples of the population or using a secure
1450 peer sampling service [PUPPETCAST] can help negate such an attack.
1456 Grishchenko and Bakker Expires June 21, 2012 [Page 26]
1458 Internet-Draft swift December 19, 2011
1461 - PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or
1462 additional mechanisms such as DHTs [SECDHTS], but self-certification
1463 of addresses is needed. Self-certification means For example, that
1464 each peer has a public/private key pair [PERMIDS] and creates self-
1465 certified address changes that include the swarm ID and a timestamp,
1466 which are then exchanged among peers or stored in DHTs. See also
1467 discussion of PPSP.PP.REQ-3 above. Content distribution can continue
1468 as long as there are peers that have it available.
1470 - PPSP.SEC.REQ-8: The verification of data via hashes obtained from a
1471 trusted source is well-established in the BitTorrent protocol
1472 [BITTORRENT]. The proposed Merkle tree scheme is a secure extension
1473 of this idea. Self-certification and not using hearsay are other
1474 lessons learned from existing distributed systems.
1476 - PPSP.SEC.REQ-9: Swift has built-in content integrity protection via
1477 self-certified naming of content, see SEC.REQ-5 and Sec. 3.5.1.
1482 In this section we sketch how swift can be carried over HTTP
1483 [RFC2616] to form the PPSP running over TCP. The general idea is to
1484 encode a swift datagram in HTTP GET and PUT requests and their
1485 replies by transmitting all the non-DATA messages such as HINTs and
1486 HAVEs as headers and send DATA messages in the body. This idea
1487 follows the atomic datagram principle for each request and reply. So
1488 a receiving peer can autonomously verify the message as carrying
1489 correct data, thus preventing the spread of corrupt data (see
1490 requirement PPSP.SEC-REQ-4).
1492 A problem with HTTP is that it is a client/server protocol. To
1493 overcome this problem, a peer A uses a PUT request instead of a GET
1494 request if the peer B has indicated in a reply that it wants to
1495 retrieve a chunk from A. In cases where peer A is no longer
1496 interested in receiving requests from B (described below) B may need
1497 to establish a new HTTP connection to A to quickly download a chunk,
1498 instead of waiting for a convenient time when A sends another
1499 request. As an alternative design, two HTTP connections could be used
1500 always., but this is inefficient.
1504 6.4.1.1. Joining a Swarm
1506 To commence a PPSP download a peer A must have the swarm ID of the
1507 stream and a list of one or more tracker contact points, as above.
1508 The swarm ID as earlier also consists of the swift root hash of the
1512 Grishchenko and Bakker Expires June 21, 2012 [Page 27]
1514 Internet-Draft swift December 19, 2011
1517 content, divided in chunks by the streaming application (e.g. fixed-
1518 size chunks of 1 kilobyte for video-on-demand).
1520 Peer A now registers with the PPSP tracker following the tracker
1521 protocol [I-D.ietf-ppsp-reqs] and receives the IP address and HTTP
1522 port of peers already in the swarm, say B, C, and D. Peer A now
1523 establishes persistent HTTP connections with B, C, D and sends GET
1524 requests with the Request-URI set to /<encoded roothash>. Optionally
1525 A could include a HINT message in some requests if it wants to start
1526 receiving content immediately. A HINT is encoded as a Range header
1527 with a new "bins" unit [RFC2616,$14.35].
1529 B and C respond with a 200 OK reply with header-encoded HAVE
1530 messages. A HAVE message is encoded as an extended Accept-Ranges:
1531 header [RFC2616,$14.5] with the new bins unit and the possibility of
1532 listing the set of accepted bins. If no HINT/Range header was present
1533 in the request, the body of the reply is empty. D sends just a 200 OK
1534 reply and omits the HAVE/Accept-Ranges header as a way of choking A.
1536 6.4.1.2. Exchanging Chunks
1538 In response to B and C, A sends GET requests with Range headers,
1539 requesting disjunct sets of chunks. B and C respond with 206 Partial
1540 Content replies with the requested chunks in the body and Accept-
1541 Ranges headers, updating their chunk availability. The HASHES for the
1542 chunks are encoded in a new Content-Merkle header and the Content-
1543 Range is set to identify the chunk [RFC2616,$14.16]. A new
1544 "multipart-bin ranges" equivalent to the "multipart-bytes ranges"
1545 media type may be used to transmit multiple chunks in one reply.
1547 Upon receipt, A sends a new GET request with a HAVE/Accept-Ranges
1548 header for the chunks received and new HINT/Range headers to B and C.
1549 Now when e.g. C finds that A obtained a chunk (from B) that C did not
1550 yet have, C's response includes a HINT/Range for that chunk. In this
1551 case, A's next request to C is not a GET request, but a PUT request
1552 with the requested chunk sent in the body.
1554 Again, working around the fact that HTTP is a client/server protocol,
1555 peer A periodically sends HEAD requests to peer D (which was
1556 virtually choking A) that serve as keepalives and may contain
1557 HAVE/Accept-Ranges headers. If D decides to unchoke peer A, it
1558 includes an Accept-Ranges header in the "200 OK" reply to inform A of
1559 its current chunk availability.
1561 If B or C decide to choke A they start responding with 204 No Content
1562 replies without HAVE/Accept-Ranges headers and A should then re-
1563 request from other peers. However, if their replies contain
1564 HINT/Range headers A should keep on sending PUT requests with the
1568 Grishchenko and Bakker Expires June 21, 2012 [Page 28]
1570 Internet-Draft swift December 19, 2011
1573 desired data (another client/server workaround). If not, A should
1574 slowly send HEAD requests as keepalive and content availability
1577 Once A has received all content (video-on-demand use case) it closes
1578 the persistent connections to all other peers that have all content
1582 6.4.1.3. Leaving a Swarm
1584 Peers can explicitly leave a swarm by closing the connection. This
1585 mechanism works for both graceful and ungraceful leaves (i.e., peer
1586 crashes or disconnects). When leaving gracefully, a peer should
1587 deregister from the tracker following the PPSP tracker protocol.
1592 As mentioned earlier, this design suffers from the fact that HTTP is
1593 a client/server protocol. A solution where a peer establishes two
1594 HTTP connections with every other peer may be more elegant, but
1595 inefficient. The mapping of swift messages to headers remains the
1599 HAVE = Accept-Ranges
1600 HASH = Content-Merkle
1601 PEX = e.g. extended Content-Location
1603 The Content-Merkle header should include some parameters to indicate
1604 the hash function and chunk size (e.g. SHA1 and 1K) used to build the
1608 6.4.2. PPSP Requirements
1610 6.4.2.1. Basic Requirements
1612 - PPSP.REQ-1: The HTTP-based BitTorrent tracker protocol [BITTORRENT]
1613 can be used as the basis for a tracker protocol, to be discussed
1616 - PPSP.REQ-2: This draft preserves the properties of HTTP, but extra
1617 mechanisms may be necessary to protect against faulty or malicious
1620 - PPSP.REQ-3: This draft does not place requirements on peer IDs,
1624 Grishchenko and Bakker Expires June 21, 2012 [Page 29]
1626 Internet-Draft swift December 19, 2011
1629 IP+port is sufficient.
1631 - PPSP.REQ-4: The content is identified by its root hash (video-on-
1632 demand) or a public key (live streaming).
1634 - PPSP.REQ-5: The content is partitioned into chunks by the streaming
1635 application (see 6.4.1.1.)
1637 - PPSP.REQ-6: Each chunk is identified by a bin number (and its
1638 cryptographic hash.)
1640 - PPSP.REQ-7: The protocol is carried over TCP because HTTP is.
1643 6.4.2.2. Peer Protocol Requirements
1645 - PPSP.PP.REQ-1: A HEAD request can be used to find out which chunks
1646 are available from a peer, which returns the new Accept-Ranges
1649 - PPSP.PP.REQ-2: The new Accept-Ranges header satisfies this.
1651 - PPSP.PP.REQ-3: A GET with a request-URI requesting the peers of a
1652 resource (e.g. /<encoded roothash>/peers) would have to be added to
1653 request known peers from a peer, if the proposed push-based
1654 PEX/~Content-Location mechanism is not sufficient. Care should be
1655 taken with peer address exchange in general, as the use of such
1656 hearsay is a risk for the protocol as it may be exploited by
1657 malicious peers (as a DDoS attack mechanism). A secure tracking /
1658 peer sampling protocol like [PUPPETCAST] may be needed to make peer-
1659 address exchange safe.
1662 - PPSP.PP.REQ-4: HAVE/Accept-Ranges headers convey current
1665 - PPSP.PP.REQ-5: Bin numbering enables a compact representation of
1668 - PPSP.PP.REQ-6: A new PPSP specific Peer-Report header would have to
1671 - PPSP.PP.REQ-7: Transmission and chunk requests are integrated in
1680 Grishchenko and Bakker Expires June 21, 2012 [Page 30]
1682 Internet-Draft swift December 19, 2011
1685 6.4.2.3. Security Requirements
1687 - PPSP.SEC.REQ-1: An access control mechanism like Closed Swarms
1688 [CLOSED] would have to be added.
1690 - PPSP.SEC.REQ-2: As swift is carried over HTTP, HTTPS encryption can
1691 be used instead. Alternatively, just the body could be encrypted. The
1692 latter allows for caching servers that are part of the swarm but do
1693 not need access to the decryption keys (they need access to the swift
1694 HASHES in the headers to verify the packet's integrity).
1696 - PPSP.SEC.REQ-3: HTTPS encryption or the content encryption
1697 facilities of HTTP can be used.
1699 - PPSP.SEC.REQ-4: The Merkle tree hashing scheme prevents the
1700 indirect spread of corrupt content, as peers will only forward
1701 content to others if its integrity checks out. Another protection
1702 mechanism is to not depend on hearsay (i.e., do not forward other
1703 peers' availability information), or to only use it when the
1704 information spread is self-certified by its subjects.
1706 Other attacks such as a malicious peer claiming it has content, but
1707 not replying are still possible. Or wasting CPU and bandwidth at a
1708 receiving peer by sending packets where the body doesn't match the
1709 HASH/Content-Merkle headers.
1712 - PPSP.SEC.REQ-5: The Merkle tree hashing scheme allows a receiving
1713 peer to detect a malicious or faulty sender, which it can
1714 subsequently close its connection to and ignore. Spreading this
1715 knowledge to other peers such that they know about this bad behavior
1719 - PPSP.SEC.REQ-6: A risk in peer-to-peer streaming systems is that
1720 malicious peers launch an Eclipse [ECLIPSE] attack on the initial
1721 injectors of the content (in particular in live streaming). The
1722 attack tries to let the injector upload to just malicious peers which
1723 then do not forward the content to others, thus stopping the
1724 distribution. An Eclipse attack could also be launched on an
1725 individual peer. Letting these injectors only use trusted trackers
1726 that provide true random samples of the population or using a secure
1727 peer sampling service [PUPPETCAST] can help negate such an attack.
1730 - PPSP.SEC.REQ-7: swift supports decentralized tracking via PEX or
1731 additional mechanisms such as DHTs [SECDHTS], but self-certification
1732 of addresses is needed. Self-certification means For example, that
1736 Grishchenko and Bakker Expires June 21, 2012 [Page 31]
1738 Internet-Draft swift December 19, 2011
1741 each peer has a public/private key pair [PERMIDS] and creates self-
1742 certified address changes that include the swarm ID and a timestamp,
1743 which are then exchanged among peers or stored in DHTs. See also
1744 discussion of PPSP.PP.REQ-3 above. Content distribution can continue
1745 as long as there are peers that have it available.
1747 - PPSP.SEC.REQ-8: The verification of data via hashes obtained from a
1748 trusted source is well-established in the BitTorrent protocol
1749 [BITTORRENT]. The proposed Merkle tree scheme is a secure extension
1750 of this idea. Self-certification and not using hearsay are other
1751 lessons learned from existing distributed systems.
1753 - PPSP.SEC.REQ-9: Swift has built-in content integrity protection via
1754 self-certified naming of content, see SEC.REQ-5 and Sec. 3.5.1.
1757 7. Security Considerations
1759 As any other network protocol, the swift faces a common set of
1760 security challenges. An implementation must consider the possibility
1761 of buffer overruns, DoS attacks and manipulation (i.e. reflection
1762 attacks). Any guarantee of privacy seems unlikely, as the user is
1763 exposing its IP address to the peers. A probable exception is the
1764 case of the user being hidden behind a public NAT or proxy.
1769 8.1. 32 bit vs 64 bit
1771 While in principle the protocol supports bigger (>1TB) files, all the
1772 mentioned counters are 32-bit. It is an optimization, as using
1773 64-bit numbers on-wire may cost ~2% practical overhead. The 64-bit
1774 version of every message has typeid of 64+t, e.g. typeid 68 for
1775 64-bit hash message:
1776 44 000000000000000E 01234567890ABCDEF1234567890ABCDEF1234567
1780 IPv6 versions of PEX messages use the same 64+t shift as just
1784 8.3. Congestion Control Algorithms
1786 Congestion control algorithm is left to the implementation and may
1787 even vary from peer to peer. Congestion control is entirely
1788 implemented by the sending peer, the receiver only provides clues,
1792 Grishchenko and Bakker Expires June 21, 2012 [Page 32]
1794 Internet-Draft swift December 19, 2011
1797 such as hints, acknowledgments and timestamps. In general, it is
1798 expected that servers would use TCP-like congestion control schemes
1799 such as classic AIMD or CUBIC [CUBIC]. End-user peers are expected to
1800 use weaker-than-TCP (least than best effort) congestion control, such
1801 as [LEDBAT] to minimize seeding counter-incentives.
1804 8.4. Piece Picking Algorithms
1806 Piece picking entirely depends on the receiving peer. The sender peer
1807 is made aware of preferred pieces by the means of HINT messages. In
1808 some scenarios it may be beneficial to allow the sender to ignore
1809 those hints and send unrequested data.
1812 8.5. Reciprocity Algorithms
1814 Reciprocity algorithms are the sole responsibility of the sender
1815 peer. Reciprocal intentions of the sender are not manifested by
1816 separate messages (as BitTorrent's CHOKE/UNCHOKE), as it does not
1817 guarantee anything anyway (the "snubbing" syndrome).
1820 8.6. Different crypto/hashing schemes
1822 Once a flavor of swift will need to use a different crypto scheme
1823 (e.g., SHA-256), a message should be allocated for that. As the root
1824 hash is supplied in the handshake message, the crypto scheme in use
1825 will be known from the very beginning. As the root hash is the
1826 content's identifier, different schemes of crypto cannot be mixed in
1827 the same swarm; different swarms may distribute the same content
1828 using different crypto.
1833 Historically, the Internet was based on end-to-end unicast and,
1834 considering the failure of multicast, was addressed by different
1835 technologies, which ultimately boiled down to maintaining and
1836 coordinating distributed replicas. On one hand, downloading from a
1837 nearby well-provisioned replica is somewhat faster and/or cheaper; on
1838 the other hand, it requires to coordinate multiple parties (the data
1839 source, mirrors/CDN sites/peers, consumers). As the Internet
1840 progresses to richer and richer content, the overhead of peer/replica
1841 coordination becomes dwarfed by the mass of the download itself.
1842 Thus, the niche for multiparty transfers expands. Still, current,
1843 relevant technologies are tightly coupled to a single use case or
1844 even infrastructure of a particular corporation. The mission of our
1848 Grishchenko and Bakker Expires June 21, 2012 [Page 33]
1850 Internet-Draft swift December 19, 2011
1853 project is to create a generic content-centric multiparty transport
1854 protocol to allow seamless, effortless data dissemination on the Net.
1858 | mirror-based peer-assisted peer-to-peer
1859 ------+----------------------------------------------------
1860 data | SunSITE CacheLogic VelociX BitTorrent
1861 VoD | YouTube Azureus(+seedboxes) SwarmPlayer
1862 live | Akamai Str. Octoshape, Joost PPlive
1864 The protocol must be designed for maximum genericity, thus focusing
1865 on the very core of the mission, contain no magic constants and no
1866 hardwired policies. Effectively, it is a set of messages allowing to
1867 securely retrieve data from whatever source available, in parallel.
1868 Ideally, the protocol must be able to run over IP as an independent
1869 transport protocol. Practically, it must run over UDP and TCP.
1874 The technical focus of the swift protocol is to find the simplest
1875 solution involving the minimum set of primitives, still being
1876 sufficient to implement all the targeted usecases (see Table 1),
1877 suitable for use in general-purpose software and hardware (i.e. a web
1878 browser or a set-top box). The five design goals for the protocol
1881 1. Embeddable kernel-ready protocol.
1882 2. Embrace real-time streaming, in- and out-of-order download.
1883 3. Have short warm-up times.
1884 4. Traverse NATs transparently.
1885 5. Be extensible, allow for multitude of implementation over
1886 diverse mediums, allow for drop-in pluggability.
1888 The objectives are referenced as (1)-(5).
1890 The goal of embedding (1) means that the protocol must be ready to
1891 function as a regular transport protocol inside a set-top box, mobile
1892 device, a browser and/or in the kernel space. Thus, the protocol must
1893 have light footprint, preferably less than TCP, in spite of the
1894 necessity to support numerous ongoing connections as well as to
1895 constantly probe the network for new possibilities. The practical
1896 overhead for TCP is estimated at 10KB per connection [HTTP1MLN]. We
1897 aim at <1KB per peer connected. Also, the amount of code necessary to
1898 make a basic implementation must be limited to 10KLoC of C.
1899 Otherwise, besides the resource considerations, maintaining and
1900 auditing the code might become prohibitively expensive.
1904 Grishchenko and Bakker Expires June 21, 2012 [Page 34]
1906 Internet-Draft swift December 19, 2011
1909 The support for all three basic usecases of real-time streaming,
1910 in-order download and out-of-order download (2) is necessary for the
1911 manifested goal of THE multiparty transport protocol as no single
1912 usecase dominates over the others.
1914 The objective of short warm-up times (3) is the matter of end-user
1915 experience; the playback must start as soon as possible. Thus any
1916 unnecessary initialization roundtrips and warm-up cycles must be
1917 eliminated from the transport layer.
1919 Transparent NAT traversal (4) is absolutely necessary as at least 60%
1920 of today's users are hidden behind NATs. NATs severely affect
1921 connection patterns in P2P networks thus impacting performance and
1922 fairness [MOLNAT,LUCNAT].
1924 The protocol must define a common message set (5) to be used by
1925 implementations; it must not hardwire any magic constants, algorithms
1926 or schemes beyond that. For example, an implementation is free to use
1927 its own congestion control, connection rotation or reciprocity
1928 algorithms. Still, the protocol must enable such algorithms by
1929 supplying sufficient information. For example, trackerless peer
1930 discovery needs peer exchange messages, scavenger congestion control
1931 may need timestamped acknowledgments, etc.
1936 To large extent, swift's design is defined by the cornerstone
1937 decision to get rid of TCP and not to reinvent any TCP-like
1938 transports on top of UDP or otherwise. The requirements (1), (4), (5)
1939 make TCP a bad choice due to its high per-connection footprint,
1940 complex and less reliable NAT traversal and fixed predefined
1941 congestion control algorithms. Besides that, an important
1942 consideration is that no block of TCP functionality turns out to be
1943 useful for the general case of swarming downloads. Namely,
1944 1. in-order delivery is less useful as peer-to-peer protocols
1945 often employ out-of-order delivery themselves and in either case
1946 out-of-order data can still be stored;
1947 2. reliable delivery/retransmissions are not useful because
1948 the same data might be requested from different sources; as
1949 in-order delivery is not required, packet losses might be
1950 patched up lazily, without stopping the flow of data;
1951 3. flow control is not necessary as the receiver is much less
1952 likely to be saturated with the data and even if so, that
1953 situation is perfectly detected by the congestion control;
1954 4. TCP congestion control is less useful as custom congestion
1955 control is often needed [LEDBAT].
1956 In general, TCP is built and optimized for a different usecase than
1960 Grishchenko and Bakker Expires June 21, 2012 [Page 35]
1962 Internet-Draft swift December 19, 2011
1965 we have with swarming downloads. The abstraction of a "data pipe"
1966 orderly delivering some stream of bytes from one peer to another
1967 turned out to be irrelevant. In even more general terms, TCP
1968 supports the abstraction of pairwise _conversations_, while we need
1969 a content-centric protocol built around the abstraction of a cloud
1970 of participants disseminating the same _data_ in any way and order
1971 that is convenient to them.
1973 Thus, the choice is to design a protocol that runs on top of
1974 unreliable datagrams. Instead of reimplementing TCP, we create a
1975 datagram-based protocol, completely dropping the sequential data
1976 stream abstraction. Removing unnecessary features of TCP makes it
1977 easier both to implement the protocol and to verify it; numerous TCP
1978 vulnerabilities were caused by complexity of the protocol's state
1979 machine. Still, we reserve the possibility to run swift on top of TCP
1982 Pursuing the maxim of making things as simple as possible but not
1983 simpler, we fit the protocol into the constraints of the transport
1984 layer by dropping all the transmission's technical metadata except
1985 for the content's root hash (compare that to metadata files used in
1986 BitTorrent). Elimination of technical metadata is achieved through
1987 the use of Merkle [MERKLE,ABMRKL] hash trees, exclusively single-file
1988 transfers and other techniques. As a result, a transfer is identified
1989 and bootstrapped by its root hash only.
1991 To avoid the usual layering of positive/negative acknowledgment
1992 mechanisms we introduce a scale-invariant acknowledgment system (see
1993 Sec 4.4). The system allows for aggregation and variable level of
1994 detail in requesting, announcing and acknowledging data, serves
1995 in-order and out-of-order retrieval with equal ease. Besides the
1996 protocol's footprint, we also aim at lowering the size of a minimal
1997 useful interaction. Once a single datagram is received, it must be
1998 checked for data integrity, and then either dropped or accepted,
1999 consumed and relayed.
2003 9.3. Generic Acknowledgments
2005 Generic acknowledgments came out of the need to simplify the
2006 data addressing/requesting/acknowledging mechanics, which tends
2007 to become overly complex and multilayered with the conventional
2008 approach. Take the BitTorrent+TCP tandem for example:
2010 1. The basic data unit is a byte of content in a file.
2011 2. BitTorrent's highest-level unit is a "torrent", physically a
2012 byte range resulting from concatenation of content files.
2016 Grishchenko and Bakker Expires June 21, 2012 [Page 36]
2018 Internet-Draft swift December 19, 2011
2021 3. A torrent is divided into "pieces", typically about a thousand
2022 of them. Pieces are used to communicate progress to other
2023 peers. Pieces are also basic data integrity units, as the torrent's
2024 metadata includes a SHA1 hash for every piece.
2025 4. The actual data transfers are requested and made in 16KByte
2026 units, named "blocks" or chunks.
2027 5. Still, one layer lower, TCP also operates with bytes and byte
2028 offsets which are totally different from the torrent's bytes and
2029 offsets, as TCP considers cumulative byte offsets for all content
2030 sent by a connection, be it data, metadata or commands.
2031 6. Finally, another layer lower, IP transfers independent datagrams
2032 (typically around 1.5 kilobyte), which TCP then reassembles into
2035 Obviously, such addressing schemes need lots of mappings; from
2036 piece number and block to file(s) and offset(s) to TCP sequence
2037 numbers to the actual packets and the other way around. Lots of
2038 complexity is introduced by mismatch of bounds: packet bounds are
2039 different from file, block or hash/piece bounds. The picture is
2040 typical for a codebase which was historically layered.
2042 To simplify this aspect, we employ a generic content addressing
2043 scheme based on binary intervals, or "bins" for short.
2049 Arno Bakker and Victor Grishchenko are partially supported by the
2050 P2P-Next project (http://www.p2p-next.org/), a research project
2051 supported by the European Community under its 7th Framework Programme
2052 (grant agreement no. 216217). The views and conclusions contained
2053 herein are those of the authors and should not be interpreted as
2054 necessarily representing the official policies or endorsements,
2055 either expressed or implied, of the P2P-Next project or the European
2058 The swift protocol was designed by Victor Grishchenko at Technische
2059 Universiteit Delft. The authors would like to thank the following
2060 people for their contributions to this draft: Mihai Capota, Raul
2061 Jiminez, Flutra Osmani, Riccardo Petrocco, Johan Pouwelse, and Raynor
2067 [RFC2119] Key words for use in RFCs to Indicate Requirement Levels
2068 [HTTP1MLN] Richard Jones. "A Million-user Comet Application with
2072 Grishchenko and Bakker Expires June 21, 2012 [Page 37]
2074 Internet-Draft swift December 19, 2011
2077 Mochiweb", Part 3. http://www.metabrew.com/article/
2078 a-million-user-comet-application-with-mochiweb-part-3
2079 [MOLNAT] J.J.D. Mol, J.A. Pouwelse, D.H.J. Epema and H.J. Sips:
2080 "Free-riding, Fairness, and Firewalls in P2P File-Sharing"
2081 Proc. Eighth International Conference on Peer-to-Peer Computing
2082 (P2P '08), Aachen, Germany, 8-11 Sept. 2008, pp. 301 - 310.
2083 [LUCNAT] L. D'Acunto and M. Meulpolder and R. Rahman and J.A.
2084 Pouwelse and H.J. Sips. "Modeling and Analyzing the Effects
2085 of Firewalls and NATs in P2P Swarming Systems". In Proc. of
2086 IEEE IPDPS (HotP2P), Atlanta, USA, April 23, 2010.
2087 [BINMAP] V. Grishchenko, J. Pouwelse: "Binmaps: hybridizing bitmaps
2088 and binary trees". Technical Report PDS-2011-005, Parallel and
2089 Distributed Systems Group, Fac. of Electrical Engineering,
2090 Mathematics, and Computer Science, Delft University of Technology,
2091 The Netherlands, April 2009.
2092 [SNP] B. Ford, P. Srisuresh, D. Kegel: "Peer-to-Peer Communication
2093 Across Network Address Translators",
2094 http://www.brynosaurus.com/pub/net/p2pnat/
2096 Federal Information Processing Standards Publication 180-2:
2097 "Secure Hash Standard" 2002 August 1.
2098 [MERKLE] Merkle, R. "Secrecy, Authentication, and Public Key Systems",
2099 Ph.D. thesis, Dept. of Electrical Engineering, Stanford University,
2100 CA, USA, 1979. pp 40-45.
2101 [ABMRKL] Arno Bakker: "Merkle hash torrent extension", BitTorrent
2102 Enhancement Proposal 30, Mar 2009.
2103 http://bittorrent.org/beps/bep_0030.html
2104 [CUBIC] Injong Rhee, and Lisong Xu: "CUBIC: A New TCP-Friendly
2105 High-Speed TCP Variant", Proc. Third International Workshop
2106 on Protocols for Fast Long-Distance Networks (PFLDnet), Lyon,
2108 [LEDBAT] S. Shalunov et al. "Low Extra Delay Background Transport
2109 (LEDBAT)", IETF Internet-Draft draft-ietf-ledbat-congestion
2110 (work in progress), Oct 2011.
2111 http://datatracker.ietf.org/doc/draft-ietf-ledbat-congestion/
2112 [TIT4TAT] Bram Cohen: "Incentives Build Robustness in BitTorrent",
2113 Proc. 1st Workshop on Economics of Peer-to-Peer Systems, Berkeley,
2115 [BITTORRENT] B. Cohen, "The BitTorrent Protocol Specification",
2116 February 2008, http://www.bittorrent.org/beps/bep_0003.html
2117 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V.
2118 Jacobson, "RTP: A Transport Protocol for Real-Time
2119 Applications", STD 64, RFC 3550, July 2003.
2120 [RFC3711] M. Baugher, D. McGrew, M. Naslund, E. Carrara, K. Norrman,
2121 "The Secure Real-time Transport Protocol (SRTP), RFC 3711, March
2123 [RFC5389] Rosenberg, J., Mahy, R., Matthews, P., and D. Wing,
2124 "Session Traversal Utilities for NAT (STUN)", RFC 5389, October 2008.
2128 Grishchenko and Bakker Expires June 21, 2012 [Page 38]
2130 Internet-Draft swift December 19, 2011
2133 [RFC2616] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter,
2134 P. Leach, T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1",
2136 [I-D.ietf-ppsp-reqs] Zong, N., Zhang, Y., Pascual, V., Williams, C.,
2137 and L. Xiao, "P2P Streaming Protocol (PPSP) Requirements",
2138 draft-ietf-ppsp-reqs-05 (work in progress), October 2011.
2139 [PPSPCHART] M. Stiemerling et al. "Peer to Peer Streaming Protocol (ppsp)
2140 Description of Working Group"
2141 http://datatracker.ietf.org/wg/ppsp/charter/
2142 [PERMIDS] A. Bakker et al. "Next-Share Platform M8--Specification
2143 Part", App. C. P2P-Next project deliverable D4.0.1 (revised),
2145 http://www.p2p-next.org/download.php?id=E7750C654035D8C2E04D836243E6526E
2146 [PUPPETCAST] A. Bakker and M. van Steen. "PuppetCast: A Secure Peer
2147 Sampling Protocol". Proceedings 4th Annual European Conference on
2148 Computer Network Defense (EC2ND'08), pp. 3-10, Dublin, Ireland,
2149 11-12 December 2008.
2150 [CLOSED] N. Borch, K. Michell, I. Arntzen, and D. Gabrijelcic: "Access
2151 control to BitTorrent swarms using closed swarms". In Proceedings
2152 of the 2010 ACM workshop on Advanced video streaming techniques
2153 for peer-to-peer networks and social networking (AVSTP2P '10).
2154 ACM, New York, NY, USA, 25-30.
2155 http://doi.acm.org/10.1145/1877891.1877898
2156 [ECLIPSE] E. Sit and R. Morris, "Security Considerations for
2157 Peer-to-Peer Distributed Hash Tables", IPTPS '01: Revised Papers
2158 from the First International Workshop on Peer-to-Peer Systems, pp.
2159 261-269, Springer-Verlag, London, UK, 2002.
2160 [SECDHTS] G. Urdaneta, G. Pierre, M. van Steen, "A Survey of DHT
2161 Security Techniques", ACM Computing Surveys, vol. 43(2), June 2011.
2162 [SWIFTIMPL] V. Grishchenko, et al. "Swift M40 reference implementation",
2163 http://swarmplayer.p2p-next.org/download/Next-Share-M40.tar.bz2
2164 (subdirectory Next-Share/TUD/swift-trial-r2242/), July 2011.
2165 [CCNWIKI] http://en.wikipedia.org/wiki/Content-centric_networking
2166 [HAC01] A.J. Menezes, P.C. van Oorschot and S.A. Vanstone. "Handbook of
2167 Applied Cryptography", CRC Press, October 1996 (Fifth Printing,
2169 [JIM11] R. Jimenez, F. Osmani, and B. Knutsson. "Sub-Second Lookups on
2170 a Large-Scale Kademlia-Based Overlay". 11th IEEE International
2171 Conference on Peer-to-Peer Computing 2011, Kyoto, Japan, Aug. 2011
2177 Technische Universiteit Delft
2178 Department EWI/ST/PDS
2184 Grishchenko and Bakker Expires June 21, 2012 [Page 39]
2186 Internet-Draft swift December 19, 2011
2192 Email: arno@cs.vu.nl
2240 Grishchenko and Bakker Expires June 21, 2012 [Page 40]