5 PPSP WG V. Grishchenko
\r
6 Internet-Draft TU Delft
\r
7 Intended status: Experimental July 27, 2010
\r
8 Expires: December 27, 2010
\r
11 The Generic Multiparty Transport Protocol (swift)
\r
12 <draft-grishchenko-ppsp-swift-01.txt>
\r
16 The swift is a generic multiparty (swarming) transport protocol.
\r
18 The TCP, today's dominating transport protocol, is connection/
\r
19 conversation-oriented. But traffic-wise, the currently dominating
\r
20 usecase is content dissemination. There is a multitude of
\r
21 incompatible approaches to resolve that discrepancy above/below the
\r
22 transport layer: peer-to-peer, CDN, caches, mirrors, multicast, etc.
\r
23 The swift aims at creating a single unified content-centric transport
\r
24 protocol serving as a lingua-franca of content distribution. To
\r
25 implement that ultimate data cloud model, the protocol has to unify
\r
26 use cases of data download, video-on-demand and live streaming. It
\r
27 must work in the settings of client-server, peer-to-peer, CDN or
\r
28 peer-assisted networks, effectively blending those architectures.
\r
33 This Internet-Draft is submitted to IETF in full conformance with the
\r
34 provisions of BCP 78 and BCP 79.
\r
36 Internet-Drafts are working documents of the Internet Engineering
\r
37 Task Force (IETF), its areas, and its working groups. Note that
\r
38 other groups may also distribute working documents as Internet-
\r
41 Internet-Drafts are draft documents valid for a maximum of six months
\r
42 and may be updated, replaced, or obsoleted by other documents at any
\r
43 time. It is inappropriate to use Internet-Drafts as reference
\r
44 material or to cite them other than as "work in progress."
\r
46 The list of current Internet-Drafts can be accessed at
\r
47 http://www.ietf.org/ietf/1id-abstracts.txt.
\r
49 The list of Internet-Draft Shadow Directories can be accessed at
\r
50 http://www.ietf.org/shadow.html.
\r
56 Grishchenko Expires October 12, 2010 [Page 1]
\r
58 Internet-Draft swift April 2010
\r
61 Copyright (c) 2010 IETF Trust and the persons identified as the
\r
62 document authors. All rights reserved.
\r
63 This document is subject to BCP 78 and the IETF Trust's Legal
\r
64 Provisions Relating to IETF Documents
\r
65 (http://trustee.ietf.org/license-info) in effect on the date of
\r
66 publication of this document. Please review these documents
\r
67 carefully, as they describe your rights and restrictions with respect
\r
68 to this document. Code Components extracted from this document must
\r
69 include Simplified BSD License text as described in Section 4.e of
\r
70 the Trust Legal Provisions and are provided without warranty as
\r
71 described in the Simplified BSD License.
\r
76 1. Requirements notation
\r
79 4. swift subsystems and design choices
\r
80 4.1. The atomic datagram principle
\r
81 4.2. Handshake and multiplexing
\r
82 4.3. Generic acknowledgments
\r
83 4.4. Data integrity and on-demand Merkle hashes
\r
84 4.5. Peer exchange and NAT hole punching
\r
85 4.6. Data requests (HINTs)
\r
86 4.7. Subsetting of the protocol
\r
87 4.8. Directory lists
\r
92 6. Security Considerations
\r
98 1. Requirements notation
\r
100 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
\r
101 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
\r
102 this document are to be interpreted as described in [RFC2119].
\r
107 Historically, the Internet was based on end-to-end unicast
\r
108 and, considering the failure of multicast, was addressed by
\r
112 Grishchenko Expires October 12, 2010 [Page 2]
\r
114 Internet-Draft swift April 2010
\r
117 different technologies, which ultimately boiled down to maintaining
\r
118 and coordinating distributed replicas. On one hand, downloading
\r
119 from a nearby well-provisioned replica is somewhat faster and/or
\r
120 cheaper; on the other hand, it requires to coordinate multiple
\r
121 parties (the data source, mirrors/CDN sites/peers, consumers). As
\r
122 the Internet progresses to richer and richer content, the overhead
\r
123 of peer/replica coordination becomes dwarfed by the mass of the
\r
124 download itself. Thus, the niche for multiparty transfers expands.
\r
125 Still, current, relevant technologies are tightly coupled to a
\r
126 single usecase or even infrastructure of a particular corporation.
\r
127 The mission of the project is to create a generic content-centric
\r
128 multiparty transport protocol to allow seamless, effortless data
\r
129 dissemination on the Net.
\r
131 | mirror-based peer-assisted peer-to-peer
\r
132 ------+----------------------------------------------------
\r
133 data | SunSITE CacheLogic VelociX BitTorrent
\r
134 VoD | YouTube Azureus(+seedboxes) SwarmPlayer
\r
135 live | Akamai Str. Octoshape, Joost PPlive
\r
138 The protocol must be designed for maximum genericity, thus focusing
\r
139 on the very core of the mission, contain no magic constants and no
\r
140 hardwired policies. Effectively, it is a set of messages allowing to
\r
141 securely retrieve data from whatever source available, in parallel.
\r
142 The protocol must be able to run over IP as an independent transport
\r
143 protocol. For compatibility reasons, it must also run over UDP and
\r
149 The technical focus of the swift protocol is to find the simplest
\r
150 solution involving the minimum set of primitives, still being
\r
151 sufficient to implement all the targeted usecases (see Table 1),
\r
152 suitable for use in general-purpose software and hardware (i.e. a web
\r
153 browser or a set-top box). The five design goals for the protocol
\r
156 1. Embeddable kernel-ready protocol.
\r
157 2. Embrace real-time streaming, in- and out-of-order download.
\r
158 3. Have short warm-up times.
\r
159 4. Traverse NATs transparently.
\r
160 5. Be extensible, allow for multitude of implementation over
\r
161 diverse mediums, allow for drop-in pluggability.
\r
163 Later in the draft, the objectives are referenced as (1)-(5).
\r
168 Grishchenko Expires October 12, 2010 [Page 3]
\r
170 Internet-Draft swift April 2010
\r
173 The goal of embedding (1) means that the protocol must be ready to
\r
174 function as a regular transport protocol inside a set-top box, mobile
\r
175 device, a browser and/or in the kernel space. Thus, the protocol must
\r
176 have light footprint, preferably less than TCP, in spite the
\r
177 necessity to support numerous ongoing connections as well as to
\r
178 constantly probe the network for new possibilities. The practical
\r
179 overhead for TCP is estimated at 10KB per connection [HTTP1MLN]. We
\r
180 aim at <1KB per peer connected. Also, the amount of code necessary to
\r
181 make a basic implementation must be limited to 10KLoC of C.
\r
182 Otherwise, besides the resource considerations, maintaining and
\r
183 auditing the code might become prohibitively expensive.
\r
185 The support for all three basic usecases of real-time streaming,
\r
186 in-order download and out-of-order download (2) is necessary for the
\r
187 manifested goal of THE multiparty transport protocol as no single
\r
188 usecase dominates over the others.
\r
190 The objective of short warm-up times (3) is the matter of end-user
\r
191 experience; the playback must start as soon as possible. Thus any
\r
192 unnecessary initialization roundtrips and warm-up cycles must be
\r
193 eliminated from the transport layer.
\r
195 Transparent NAT traversal (4) is absolutely necessary as at least 60%
\r
196 of today's users are hidden behind NATs. NATs severely affect
\r
197 connection patterns in P2P networks thus impacting performance and
\r
198 fairness [MOLNAT,LUCNAT].
\r
200 The protocol must define a common message set (5) to be used by
\r
201 implementations; it must not hardwire any magic constants, algorithms
\r
202 or schemes beyound that. For example, an implementation is free to
\r
203 use its own congestion control, connection rotation or reciprocity
\r
204 algorithms. Still, the protocol must enable such algorithms by
\r
205 supplying sufficient information. For example, trackerless peer
\r
206 discovery needs peer exchange messages, scavenger congestion control
\r
207 may need timestamped acknowledgments, etc.
\r
210 4. swift subsystems and design choices
\r
212 To large extent, swift design is defined by the cornerstone decision
\r
213 to get rid of TCP and not to reinvent any TCP-like transports on top
\r
214 of UDP or otherwise. The requirements (1), (4), (5) make TCP a bad
\r
215 choice due to its high per-connection footprint, complex and less
\r
216 reliable NAT traversal and fixed predefined congestion control
\r
217 algorithms. Besides that, an important consideration is that no block
\r
218 of TCP functionality turns out to be useful for the general case of
\r
219 swarming downloads. Namely,
\r
220 1. in-order delivery is less useful as peer-to-peer protocols
\r
224 Grishchenko Expires October 12, 2010 [Page 4]
\r
226 Internet-Draft swift April 2010
\r
229 often employ out-of-order delivery themselves and in either case
\r
230 out-of-order data can still be stored;
\r
231 2. reliable delivery/retransmissions are less useful because
\r
232 the same data might be requested from different sources; as
\r
233 in-order delivery is not required, packet losses might be
\r
234 patched up lazily, without stopping the flow of data;
\r
235 3. flow control is not necessary as the receiver is much less
\r
236 likely to be saturated with the data and even if so, that
\r
237 situation is perfectly detected by the congestion control;
\r
238 4. TCP congestion control is less useful as custom congestion
\r
239 control is often needed [LEDBAT].
\r
240 In general, TCP is built and optimized for a different usecase than
\r
241 we have with swarmed downloads. The abstraction of a "data pipe"
\r
242 orderly delivering some stream of bytes from one peer to another
\r
243 turned out to be irrelevant. In even more general terms, TCP
\r
244 supports the abstraction of pairwise _conversations_, while we need
\r
245 a content-centric protocol built around the abstraction of a cloud
\r
246 of participants disseminating the same _data_ in any way and order
\r
247 that is convenient to them.
\r
249 Thus, the choice is to design a protocol that runs on top of
\r
250 unreliable datagrams. Instead of reimplementing TCP, we create a
\r
251 datagram-based protocol, completely dropping the sequential data
\r
252 stream abstraction. Removing unnecessary features of TCP makes it
\r
253 easier both to implement the protocol and to verify it; numerous TCP
\r
254 vulnerabilities were caused by complexity of the protocol's state
\r
255 machine. Still, we reserve the possibility to run swift on top of TCP
\r
256 or HTTP. The draft itself assumes swift-over-UDP implementation; the
\r
257 necessary adjustments to run the protocol over IP or TCP are listed
\r
260 Pursuing the maxim of making things as simple as possible but not
\r
261 simpler, we fit the protocol into the constraints of the transport
\r
262 layer by dropping all the transmission's technical metadata except
\r
263 for the content's root hash (compare that to metadata files used in
\r
264 BitTorrent). Elimination of technical metadata is achieved through
\r
265 the use of Merkle [MERKLE,ABMRKL] hash trees, exclusively single-file
\r
266 transfers and other techniques. As a result, a transfer is identified
\r
267 and bootstrapped by its root hash only.
\r
269 To avoid the usual layering of positive/negative acknowledgment
\r
270 mechanisms we introduce a scale-invariant acknowledgment system (see
\r
271 Sec 4.4). The system allows for aggregation and variable level of
\r
272 detail in requesting, announcing and acknowledging data, serves
\r
273 in-order and out-of-order retrieval with equal ease. Besides the
\r
274 protocol's footprint, we also aim at lowering the size of a minimal
\r
275 useful interaction. Once a single datagram is received, it must be
\r
276 checked for data integrity, and then either dropped or accepted,
\r
280 Grishchenko Expires October 12, 2010 [Page 5]
\r
282 Internet-Draft swift April 2010
\r
285 consumed and relayed.
\r
287 4.1. The atomic datagram principle
\r
289 Ideally, every datagram sent must be independent of other datagrams,
\r
290 so each datagram SHOULD be processed separately and a loss of one
\r
291 datagram MUST NOT disrupt the flow. Thus, a datagram carries zero or
\r
292 more messages, and neither messages nor message interdependencies
\r
293 should span over multiple datagrams. In particular, any data piece is
\r
294 verified using uncle hash chains; all hashes necessary for verifying
\r
295 data integrity are put into the same datagram as the data (Sec. 4.3).
\r
296 As a general rule, if some additional data is still missing to
\r
297 process a message within a datagram, the message SHOULD be dropped.
\r
299 Each datagram starts with four bytes corresponding to the receiving
\r
300 channel number (Sec. 4.2). The rest of a datagram is a concatenation
\r
301 of messages. Each message within a datagram has fixed length,
\r
302 depending on the type of the message. The first byte of a message
\r
303 denotes its type. Integers are serialized in the network (big-endian)
\r
304 byte order. Variable-length messages, free-form text or JSON/bencoded
\r
305 objects are not allowed. Consider an example of an acknowledgment
\r
306 message (Sec 4.4). It has message type of 2 and a payload of a four-
\r
307 byte integer (say, 1); it might be written in hex as: "02 00000001".
\r
308 Later in the document, a hex-like two char per byte notation is used
\r
309 to represent message formats.
\r
311 In case a datagram has a piece of data, a sender MUST always put the
\r
312 data message (type id 1) in the tail of a datagram. Such a message
\r
313 consists of type id, bin number (see Sec. 4.3) and the actual data.
\r
314 Normally there is 1 kilobyte of data, except the case when file size
\r
315 is not a multiple of 1024 bytes, so the tail packet is somewhat
\r
317 01 00000000 48656c6c6f20776f726c6421
\r
318 (This message accommodates an entire file: "Hello world!")
\r
321 4.2. Handshake and multiplexing
\r
323 For the sake of simplicity, one transfer always deals with one file
\r
324 only. Retrieval of large collections of files is done by retrieving a
\r
325 directory list file and then recursively retrieving files, which
\r
326 might also turn to be directory lists (see Sec. 4.9). To distinguish
\r
327 different transfers between the same pair of peers, the protocol
\r
328 introduces an additional layer of multiplexing, the channels.
\r
329 "Channels" loosely correspond to TCP connections; "content" of a
\r
330 single "channel" is a single file. A channel is established with a
\r
331 handshake. To start a handshake, the initiating peer needs to know:
\r
332 (1) the IP address of a peer
\r
336 Grishchenko Expires October 12, 2010 [Page 6]
\r
338 Internet-Draft swift April 2010
\r
341 (2) peer's UDP port and
\r
342 (3) the root hash of the content (see Sec. 4.5).
\r
343 The handshake is made by a HANDSHAKE message, whose only payload is a
\r
344 channel number. HANDSHAKE message type is 0. The initiating handshake
\r
345 must be followed by the transfer's root hash.
\r
347 The initiator sends first datagram to its peer:
\r
348 00000000 04 7FFFFFFF 1234123412341234123412341234123412341234
\r
350 (to unknown channel, handshake from channel 0x11, initiating a
\r
351 transfer of a file with a root hash 123...1234)
\r
353 Peer's response datagram:
\r
354 00000011 00 00000022 03 00000003
\r
355 (peer to the initiator: use channel number 0x22 for this transfer;
\r
356 I also have first 4 kilobytes of the file, see Sec. 4.3)
\r
358 At this point, the initiator knows that the peer really responds; for
\r
359 that purpose channel ids MUST be random enough to prevent easy
\r
360 guessing. So, the third datagram of a handshake MAY already contain
\r
361 some heavy payload. To minimize the number of initialization
\r
362 roundtrips, the first two datagrams MAY also contain some minor
\r
363 payload, e.g. a couple of HAVE messages roughly indicating the
\r
364 current progress of a peer or a HINT (see Sec. 4.7).
\r
366 (this is a simple zero-payload keepalive datagram consisting of
\r
367 a 4-byte channel id only. At this point both peers have the
\r
368 proof they really talk to each other; three-way handshake is
\r
371 In general, no error codes or responses are used in the protocol;
\r
372 absence of any response indicates an error. Invalid messages are
\r
373 discarded. Explicit closing of a channel may be achieved by setting
\r
374 channel number to zero by a handshake message: 00 00000000.
\r
376 Simple NAT hole punching [SNP] introduces the scenario when both
\r
377 parties of the handshake are initiators. To avoid creation of two
\r
378 transfers in the case both initiating datagrams get through, both
\r
379 peers must then act as responding peers. Thus, once an initiating
\r
380 datagram is sent and another initiating "counter"-datagram is
\r
381 received, the initiating peer sends a response datagram with the same
\r
382 channel id as in the outstanding initiating datagram.
\r
385 4.3. Generic acknowledgments
\r
387 Generic acknowledgments came out of the need to simplify the
\r
388 data addressing/requesting/acknowledging mechanics, which tends
\r
392 Grishchenko Expires October 12, 2010 [Page 7]
\r
394 Internet-Draft swift April 2010
\r
397 to become overly complex and multilayered with the conventional
\r
398 approach. Take BitTorrent+TCP tandem for example:
\r
400 1. The basic data unit is of course a byte of content in a file.
\r
401 2. BitTorrent's highest-level unit is a "torrent", physically a
\r
402 byte range resulting from concatenation of content files.
\r
403 3. A torrent is divided into "pieces", typically about a thousand
\r
404 of them. Pieces are used to communicate own progress to other
\r
405 peers. Pieces are also basic data integrity units, as the torrent's
\r
406 metadata includes SHA1 hash for every piece.
\r
407 4. The actual data transfers are requested and made in 16KByte
\r
408 units, named "blocks" or chunks.
\r
409 5. Still, one layer lower, TCP also operates with bytes and byte
\r
410 offsets which are totally different from the torrent's bytes and
\r
411 offsets, as TCP considers cumulative byte offsets for all content
\r
412 sent by a connection, be it data, metadata or commands.
\r
413 6. Finally, another layer lower, IP transfers independent datagrams
\r
414 (typically around a kilobyte), which TCP then reassembles into
\r
415 continuous streams.
\r
417 Obviously, such addressing schemes need lots of mappings; from
\r
418 piece number and block to file(s) and offset(s) to TCP sequence
\r
419 numbers to the actual packets and the other way around. Lots of
\r
420 complexity is introduced by mismatch of bounds: packet bounds are
\r
421 different from file, block or hash/piece bounds. The picture is
\r
422 typical for a codebase which was historically layered.
\r
424 To simplify this aspect, we employ a generic content addressing
\r
425 scheme based on binary intervals (shortcutted "bins"). The base
\r
426 interval is 1KB "packet", the top interval is the complete 2**63
\r
427 range. Till Sec. 4.4.1, any file is considered to be 2**k bytes long.
\r
428 The binary tree of intervals is simple, well-understood, correlates
\r
429 well with machine representation of integers and the structure of
\r
430 Merkle hashes (Sec. 4.4). A novel addition to the classical scheme
\r
431 are "bin numbers", a scheme of numbering binary intervals which
\r
432 lays them out into a vector nicely. Bin numbering is done in the
\r
433 order of interval's "center", ascending, namely:
\r
440 The number 0xFFFFFFFF (32-bit) or 0xFFFFFFFFFFFFFFFF (64-bit) stands
\r
441 for an empty interval; 0x7FFF...FFF stands for "everything". In
\r
442 general, this numbering system allows to work with simpler data
\r
443 structures, e.g. to use arrays instead of binary trees in many cases.
\r
444 As a minor convenience, it also allows to use one integer instead of
\r
448 Grishchenko Expires October 12, 2010 [Page 8]
\r
450 Internet-Draft swift April 2010
\r
453 two to denote an interval. By requiring that every message uses bin
\r
454 numbers, we enforce genericity.
\r
456 Back to the acknowledgment message. A HAVE message (type 3) states
\r
457 that the sending peer obtained the specified bin and successfully
\r
458 checked its integrity:
\r
460 (got/checked first four kilobytes of a file/stream)
\r
462 The data is acknowledged in terms of bins; as a result, every
\r
463 single packet is acknowledged logarithmic number of times. That
\r
464 provides some necessary redundancy of acknowledgments and
\r
465 sufficiently compensates unreliability of datagrams. Compare that
\r
466 e.g. to TCP acknowledgments, which are (linearly) cumulative.
\r
467 For keeping the state information, an implementation MAY use the
\r
468 "binmap" data structure, which is a hybrid of a bitmap and a binary
\r
469 tree, discussed in detail in [BINMAP].
\r
470 An ACK message (type 2) acknowledges data that was received from
\r
471 its addressee; to facilitate delay-based congestion control, an
\r
472 ACK message contains a timestamp:
\r
474 02 00000002 12345678
\r
475 (got the second kilobyte of the file from you; my microsecond
\r
476 timer was showing 0x12345678 at that moment)
\r
479 4.4. Data integrity and on-demand Merkle hashes
\r
481 The integrity checking scheme is unified for two usecases of download
\r
482 and streaming. Also, it works down to the level of a single datagram
\r
483 by employing Merkle hash trees [MERKLE]. Peers receive chains of
\r
484 uncle hashes just in time to check the incoming data. As metadata is
\r
485 restricted to just a single root hash, newcomer peers derive the size
\r
486 of a file from hashes. That functionality heavily depends on the
\r
487 concept of peak hashes, discussed in Sec. 4.4.1. Any specifics
\r
488 related to the cases of file download and streaming is discussed in
\r
489 Sec. 4.4.2 and 4.4.3, respectively.
\r
491 Here, we discuss the common part of the workflow. As a general rule,
\r
492 the sender SHOULD prepend data with hashes which are necessary for
\r
493 verifying that data, no more, no less. While some optimistic
\r
494 optimizations are definitely possible, the receiver SHOULD drop data
\r
495 if it is impossible to verify it. Before sending a packet of data to
\r
496 the receiver, the sender inspects the receiver's previous
\r
497 acknowledgments to derive which hashes the receiver already has for
\r
498 sure. Suppose, the receiver had acknowledged bin 1 (first two
\r
499 kilobytes of the file), then it must already have uncle hashes 5, 11
\r
500 and so on. That is because those hashes are necessary to check
\r
504 Grishchenko Expires October 12, 2010 [Page 9]
\r
506 Internet-Draft swift April 2010
\r
509 packets of bin 1 against the root hash. Then, hashes 3, 7 and so on
\r
510 must be also known as they are calculated in the process of checking
\r
511 the uncle hash chain. Hence, to send bin 12 (i.e. the 7th kilobyte of
\r
512 data), the sender needs to prepend hashes for bins 14 and 9, which
\r
513 let the data be checked against hash 11 which is already known to the
\r
514 receiver. The sender MUST put into the datagram the chain of uncle
\r
515 hashes necessary for verification of the packet, always before the
\r
516 data message itself, i.e.:
\r
518 04 00000009 F01234567890ABCDEF1234567890ABCDEF123456
\r
519 04 0000000E 01234567890ABCDEF1234567890ABCDEF1234567
\r
520 (uncle hashes for the packet 12)
\r
521 01 0000000C DA1ADA1ADA1A...
\r
524 The sender MAY optimistically skip hashes which were sent out in
\r
525 previous (still unacknowledged) datagrams. It is an optimization
\r
526 tradeoff between redundant hash transmission and possibility of
\r
527 collateral data loss in the case some necessary hashes were lost in
\r
528 the network so some delivered data cannot be verified and thus has to
\r
529 be dropped. In either case, the receiver builds the Merkle tree on-
\r
530 demand, incrementally, starting from the root hash, and uses it for
\r
536 The concept of peak hashes enables two cornerstone features of swift:
\r
537 download/streaming unification and file size proving. Formally, peak
\r
538 hashes are hashes defined over filled bins, whose parent hashes are
\r
539 defined over incomplete (not filled) bins. Filled bin is a bin which
\r
540 does not extend past the end of the file, or, more precisely,
\r
541 contains no empty packets. Practically, we use peaks to cover the
\r
542 data range with logarithmic number of hashes, so each hash is defined
\r
543 over a "round" aligned 2^k interval. As an example, suppose a file is
\r
544 7162 bytes long. That fits into 7 packets, the tail packet being 1018
\r
545 bytes long. The binary representation for 7 is 111. Here we might
\r
546 note that in general, every "1" in binary representation of the
\r
547 file's packet length corresponds to a peak hash. Namely, for this
\r
548 particular file we'll have three peaks, bin numbers 3, 9, 12. Thus,
\r
549 once a newcomer joins a swarm, the first peer who sends him data
\r
550 prepends it with peak hashes. The newcomer checks them against the
\r
551 root hash (see Sec 4.4.2).
\r
553 04 00000003 1234567890ABCDEF1234567890ABCDEF12345678
\r
554 04 00000009 234567890ABCDEF1234567890ABCDEF123456789
\r
555 04 0000000C 34567890ABCDEF1234567890ABCDEF1234567890
\r
556 (this sequence of peak hashes proves that a file is 7KB long)
\r
560 Grishchenko Expires October 12, 2010 [Page 10]
\r
562 Internet-Draft swift April 2010
\r
565 4.4.2. Hash trees for files
\r
567 the entire data range (2**63 bytes). Every hash in the tree is
\r
568 defined in the usual way, as a SHA1 hash of a concatenation of two
\r
569 lower-level SHA1 hashes, which correspond to left and right data
\r
570 half-ranges respectively. For example,
\r
571 hash_1 = SHA1 (hash_0+hash_2)
\r
572 where + stands for concatenation and hash_i stands for Merkle hash
\r
573 of the bin number i. Obviously, that does not hold for the
\r
574 base-layer hashes. Those are normal SHA1 hashes over 1KB data
\r
575 ranges ("packets"), except probably for the tail packet, which
\r
576 might have less than 1KB of data. The normal recursive formula does
\r
577 not apply to empty bins, i.e. bins that have no data absolutely;
\r
578 their hashes are just zeros.
\r
580 Lemma. Peak hashes could be checked against the root hash.
\r
581 Proof. (a) Any peak hash is always the left sibling. Otherwise, be
\r
582 it the right sibling, its left neighbor/sibling must also be
\r
583 defined over a filled bin, so their parent is also defined over a
\r
584 filled bin, contradiction. (b) For the rightmost peak hash, its
\r
585 right sibling is zero. (c) For any peak hash, its right sibling
\r
586 might be calculated using peak hashes to the left and zeros for
\r
587 empty bins. (d) Once the right sibling of the leftmost peak hash
\r
588 is calculated, its parent might be calculated. (e) Once that parent
\r
589 is calculated, we might trivially get to the root hash by
\r
590 concatenating the hash with zeros and hashing it repeatedly.
\r
592 Informally, the Lemma might be expressed as follows: peak hashes
\r
593 cover all data, so the remaining hashes are either trivial (zeros) or
\r
594 might be calculated from peak hashes and zero hashes.
\r
596 Thus, once a peer gets peak hashes and checks them against the
\r
597 root hash, it learns the file size and it also gets practical
\r
598 anchors for building uncle chains during the transmission (as the
\r
599 root hash is too high in the sky). A newcomer peer MAY signal it
\r
600 already has peak hashes by acknowledging any bin, even the empty one:
\r
604 Otherwise, the first of the senders SHOULD bootstrap him with all the
\r
608 4.4.3. Hash trees for streams
\r
610 In the case of live streaming a transfer is bootstrapped with a
\r
611 public key instead of a root hash, as the root hash is undefined or,
\r
612 more precisely, transient, as long as new data keeps coming.
\r
616 Grishchenko Expires October 12, 2010 [Page 11]
\r
618 Internet-Draft swift April 2010
\r
621 Streaming/download unification is achieved by sending signed peak
\r
622 hashes on-demand, ahead of the actual data. Similarly to the previous
\r
623 case, the sender mightuse acknowledgements to derive which data range
\r
624 the receiver has peak hashes for and to prepend the data hashes with
\r
625 the necessary (signed) peak hashes. Except for the fact that the set
\r
626 of peak hashes changes with the time, other parts of the algorithm
\r
627 work as described in 4.4.2. As we see, in both cases data length is
\r
628 not known on advance, but derived on-the-go from the peak hashes.
\r
629 Suppose, our 7KB stream extended to another kilobyte. Thus, now hash
\r
630 7 becomes the only peak hash, eating hashes 3, 9 and 12. So, the
\r
631 source sends out a signed peak hash message (type 7) to announce the
\r
634 07 00000007 1234567890ABCDEF1234567890ABCDEF12345678 SOME-SIGN-HERE
\r
637 4.5. Peer exchange and NAT hole punching
\r
639 Peer exchange messages are common for many peer-to-peer protocols. By
\r
640 exchanging peer IP addresses in gossip fashion, peers relieve central
\r
641 coordinating entities (the trackers) from unnecessary work. Following
\r
642 the example of BitTorrent, swift features two types of PEX messages:
\r
643 "peer connected" (type 5) and "peer disconnected" (type 6). Peers are
\r
644 represented as IPv4 address-port pairs:
\r
646 (connected to 127.0.0.1:8000)
\r
648 To unify peer exchange and NAT hole punching functionality, the
\r
649 sending pattern of PEX messages is restricted. As swift handshake is
\r
650 able to do simple NAT hole punching [SNP] transparently, PEX messages
\r
651 must be emitted in the way to facilitate that. Namely, once peer A
\r
652 introduces peer B to peer C by sending a PEX message to C, it SHOULD
\r
653 also send a message to B introducing C. The messages SHOULD be within
\r
654 2 seconds from each other, but MAY and better not be simultaneous,
\r
655 leaving a gap of twice the "typical" RTT, i.e. 300-600ms. The peers
\r
656 are supposed to initiate handshakes to each other thus forming a
\r
657 simple NAT hole punching pattern where the introducing peer
\r
658 effectively acts as a STUN server. Still, peers MAY ignore PEX
\r
659 messages if uninterested in obtaining new peers or because of
\r
660 security considerations (rate limiting) or any other reason.
\r
663 4.6. Data requests (HINTs)
\r
665 While bulk download protocols normally do explicit requests for
\r
666 certain ranges of data (e.g. BitTorrent's REQUEST message), live
\r
667 streaming protocols quite often do without to save round trips.
\r
668 Explicit requests are often needed for security purposes; consider
\r
672 Grishchenko Expires October 12, 2010 [Page 12]
\r
674 Internet-Draft swift April 2010
\r
677 that BitTorrent can only verify hashes of complete pieces that might
\r
678 consist of multiple blocks requested from many peers. As swift has no
\r
679 such implications, it is supposed to work both ways. Namely, a peer
\r
680 SHOULD send out requested pieces, while it also may send some other
\r
681 data in case it runs out of requests or on some other reason. To
\r
682 emphasize that, request messages are named HINTs; their only purpose
\r
683 is to coordinate peers and to avoid unnecessary data retransmission.
\r
684 A peer SHOULD to process HINTs sequentially. HINT message type is 8.
\r
686 (a peer requests fifth and sixth packets)
\r
689 4.7. Subsetting of the protocol
\r
691 As the same protocol is supposed to serve diverse usecases, different
\r
692 peers may support different subsets of messages. The supported subset
\r
693 SHOULD be signaled in the handshake packets. The SWIFT_MSGTYPE_RCVD
\r
694 message (type 9) serves exactly this purpose. It contains a 32-bit
\r
695 big-endian number with bits set to 1 at offsets corresponding to
\r
696 supported message type ids. E.g. for a tracker peer which receives
\r
697 only handshakes and (root) hashes, sends out handshakes and PEX_ADD
\r
698 messages, that message will look like: 09 00000011 Peers running over
\r
699 TCP may not accept ACK messages, etc etc.
\r
702 4.8. Directory lists
\r
704 Directory list files MUST start with magic bytes ".\n..\n". The rest
\r
705 of the file is a newline-separated list of hashes and file names for
\r
706 the content of the directory. An example:
\r
710 1234567890ABCDEF1234567890ABCDEF12345678 readme.txt
\r
711 01234567890ABCDEF1234567890ABCDEF1234567 big_file.dat
\r
718 The most theoretically correct way is to run swift on top of IP, as
\r
719 another transport protocol like TCP or UDP. Albeit, that option has
\r
720 significant downsides. First, that is inevitable NAT/firewall
\r
721 compatibility problems. Second, that necessitates in-kernel
\r
722 implementation for all peers.
\r
728 Grishchenko Expires October 12, 2010 [Page 13]
\r
730 Internet-Draft swift April 2010
\r
735 Currently, swift-over-UDP is the default deployment option.
\r
736 Effectively, UDP allows to use IP with minimal overhead, it also
\r
737 allows userspace implementations. Besides the classic 1KB packet
\r
738 scenario, the bin numbering allows to use swift over Jumbo
\r
739 frames/datagrams. Both data and acknowledgments may use e.g. 8KB
\r
740 packets instead of "standard" 1KB. Hashing scheme stays the same.
\r
741 Using swift with 512 or 256-byte packets is theoretically possible
\r
742 with 64-bit byte-precise bin numbers, but IP fragmentation might be a
\r
743 better method to achieve the same result.
\r
748 If ran over TCP, the swift becomes functionally equivalent to
\r
749 BitTorrent. Namely, most swift messages have corresponding BitTorrent
\r
750 messages and vice versa, except for BitTorrent's explicit interest
\r
751 declarations and choking/unchoking, which serve the classic
\r
752 implementation of the tit-for-tat algorithm [TIT4TAT].
\r
755 6. Security Considerations
\r
757 As any other network protocol, the swift faces a common set of
\r
758 security challenges. An implementation must consider the possibility
\r
759 of buffer overruns, DoS attacks and manipulation (i.e. reflection
\r
760 attacks). Any guarantee of privacy seems unlikely, as the user is
\r
761 exposing its IP address to the peers. A probable exception is the
\r
762 case of user being hidden behind a public NAT or proxy.
\r
767 7.1. 32 bit vs 64 bit
\r
769 While in principle the protocol supports bigger (>1TB) files, all
\r
770 the mentioned counters are 32-bit. It is an optimization, as using
\r
771 64-bit numbers on-wire may cost ~2% practical overhead. 64-bit
\r
772 version of every message has typeid of 64+t, e.g. typeid 68 for
\r
773 64-bit hash message:
\r
774 44 000000000000000E 01234567890ABCDEF1234567890ABCDEF1234567
\r
775 Once 32-bit message is supported, its 64-bit version MUST be
\r
784 Grishchenko Expires October 12, 2010 [Page 14]
\r
786 Internet-Draft swift April 2010
\r
791 IPv6 versions of PEX messages use the same 64+t shift as in 6.1.1.
\r
794 7.3. Congestion control algorithms
\r
796 Congestion control algorithm is left to the implementation and may
\r
797 even vary from peer to peer. Congestion control is entirely
\r
798 implemented by the sending peer, the receiver only provides clues,
\r
799 such as hints, acknowledgments and timestamps. In general, it is
\r
800 expected that servers would use TCP-like congestion control schemes
\r
801 such as classic AIMD or CUBIC [CUBIC]. End-user peers are expected to
\r
802 use weaker-than-TCP (least than best effort) congestion control, such
\r
803 as [LEDBAT] to minimize seeding counter-incentives.
\r
806 7.4. Piece picking algorithms
\r
808 Piece picking entirely depends on the receiving peer. The sender peer
\r
809 is made aware of preferred pieces by the means of HINT messages, but
\r
810 may ignore those hints and send unrequested data.
\r
813 7.5. Reciprocity algorithms
\r
815 Reciprocity algorithms is the sole responsibility of the sender peer.
\r
816 Reciprocal intentions of the sender are not manifested by separate
\r
817 messages (as BitTorrent's CHOKE/UNCHOKE), as it does not guarantee
\r
818 anything anyway (the "snubbing" syndrome).
\r
821 7.6. Different crypto/hashing schemes
\r
823 Once a flavour of swift will need to use a different crypto scheme
\r
824 (e.g. SHA-256), a message should be allocated for that. As the root
\r
825 hash is supplied in the handshake message, the crypto scheme in use
\r
826 will be known from the very beginning. As the root hash is the
\r
827 content's identifier, different schemes of crypto cannot be mixed in
\r
828 the same swarm; different swarms may distribute the same content
\r
829 using different crypto.
\r
834 [RFC2119] Key words for use in RFCs to Indicate Requirement Levels
\r
835 [HTTP1MLN] Richard Jones. "A Million-user Comet Application with
\r
836 Mochiweb", Part 3. http://www.metabrew.com/article/
\r
840 Grishchenko Expires October 12, 2010 [Page 15]
\r
842 Internet-Draft swift April 2010
\r
845 a-million-user-comet-application-with-mochiweb-part-3
\r
846 [MOLNAT] J.J.D. Mol, J.A. Pouwelse, D.H.J. Epema and H.J. Sips:
\r
847 "Free-riding, Fairness, and Firewalls in P2P File-Sharing"
\r
849 [BINMAP] V. Grishchenko, J. Pouwelse: "Binmaps: hybridizing bitmaps
\r
850 and binary trees" http://bouillon.math.usu.ru/articles/
\r
852 [SNP] B. Ford, P. Srisuresh, D. Kegel: "Peer-to-Peer Communication
\r
853 Across Network Address Translators",
\r
854 http://www.brynosaurus.com/pub/net/p2pnat/
\r
855 [MERKLE] Merkle, R. A Digital Signature Based on a Conventional
\r
856 Encryption Function. Proceedings CRYPTO'87, Santa Barbara, CA,
\r
857 USA, Aug 1987. pp 369-378.
\r
858 [ABMRKL] Arno Bakker: "Merkle hash torrent extension", BEP 30,
\r
859 http://bittorrent.org/beps/bep_0030.html
\r
860 [CUBIC] Injong Rhee, and Lisong Xu: "CUBIC: A New TCP-Friendly
\r
861 High-Speed TCP Variant",
\r
862 http://www4.ncsu.edu/~rhee/export/bitcp/cubic-paper.pdf
\r
863 [LEDBAT] S. Shalunov: "Low Extra Delay Background Transport (LEDBAT)"
\r
864 http://www.ietf.org/id/draft-ietf-ledbat-congestion-00.txt
\r
865 [TIT4TAT] Bram Cohen: "Incentives Build Robustness in BitTorrent", 2003,
\r
866 http://www.bittorrent.org/bittorrentecon.pdf
\r
872 Mekelweg 4, HB 9.240
\r
876 Email: victor.grishchenko@gmail.com
\r
896 Grishchenko Expires October 12, 2010 [Page 16]
\r