5 PPSP WG V. Grishchenko
\r
6 Internet-Draft TU Delft
\r
7 Intended status: Experimental April 12, 2010
\r
8 Expires: October 12, 2010
\r
11 The Generic Multiparty Transport Protocol (swift)
\r
12 <draft-ietf-ppsp-grishchenko-swift-00.txt>
\r
16 The swift is a generic multiparty (swarming) transport protocol.
\r
18 The TCP, today's dominating transport protocol, is connection/
\r
19 conversation-oriented. But traffic-wise, the currently dominating
\r
20 usecase is content dissemination. There is a multitude of
\r
21 incompatible approaches to resolve that discrepancy above/below the
\r
22 transport layer: peer-to-peer, CDN, caches, mirrors, multicast, etc.
\r
23 The swift aims at creating a single unified content-centric transport
\r
24 protocol serving as a lingua-franca of content distribution.
\r
25 To implement that ultimate data cloud model, the protocol has to
\r
26 unify use cases of data download, video-on-demand and live streaming.
\r
27 It must work in the settings of client-server, peer-to-peer, CDN or
\r
28 peer-assisted networks, effectively blending those architectures.
\r
33 This Internet-Draft is submitted to IETF in full conformance with the
\r
34 provisions of BCP 78 and BCP 79.
\r
36 Internet-Drafts are working documents of the Internet Engineering
\r
37 Task Force (IETF), its areas, and its working groups. Note that
\r
38 other groups may also distribute working documents as Internet-
\r
41 Internet-Drafts are draft documents valid for a maximum of six months
\r
42 and may be updated, replaced, or obsoleted by other documents at any
\r
43 time. It is inappropriate to use Internet-Drafts as reference
\r
44 material or to cite them other than as "work in progress."
\r
46 The list of current Internet-Drafts can be accessed at
\r
47 http://www.ietf.org/ietf/1id-abstracts.txt.
\r
49 The list of Internet-Draft Shadow Directories can be accessed at
\r
50 http://www.ietf.org/shadow.html.
\r
56 Grishchenko Expires October 12, 2010 [Page 1]
\r
58 Internet-Draft swift April 2010
\r
61 Copyright (c) 2010 IETF Trust and the persons identified as the
\r
62 document authors. All rights reserved.
\r
63 This document is subject to BCP 78 and the IETF Trust's Legal
\r
64 Provisions Relating to IETF Documents
\r
65 (http://trustee.ietf.org/license-info) in effect on the date of
\r
66 publication of this document. Please review these documents
\r
67 carefully, as they describe your rights and restrictions with respect
\r
68 to this document. Code Components extracted from this document must
\r
69 include Simplified BSD License text as described in Section 4.e of
\r
70 the Trust Legal Provisions and are provided without warranty as
\r
71 described in the Simplified BSD License.
\r
76 1. Requirements notation
\r
79 4. swift subsystems and design choices
\r
80 4.1. The atomic datagram principle
\r
81 4.2. Handshake and multiplexing
\r
82 4.3. Data integrity and on-demand Merkle hashes
\r
83 4.4. Generic acknowledgments
\r
84 4.5. Peer exchange and NAT hole punching
\r
85 4.6. Congestion control
\r
86 4.7. Hints and piece picking
\r
91 6. Security Considerations
\r
93 8. Normative References
\r
97 1. Requirements notation
\r
99 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
\r
100 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in
\r
101 this document are to be interpreted as described in [RFC2119].
\r
106 Historically, the Internet was based on end-to-end unicast
\r
107 and, considering the failure of multicast, was addressed by
\r
108 different technologies, which ultimately boiled down to maintaining
\r
112 Grishchenko Expires October 12, 2010 [Page 2]
\r
114 Internet-Draft swift April 2010
\r
117 and coordinating distributed replicas. On one hand, downloading
\r
118 from a nearby well-provisioned replica is somewhat faster and/or
\r
119 cheaper; on the other hand, it requires to coordinate multiple
\r
120 parties (the data source, mirrors/CDN sites/peers, consumers). As
\r
121 the Internet progresses to richer and richer content, the overhead
\r
122 of peer/replica coordination becomes dwarfed by the mass of the
\r
123 download itself. Thus, the niche for multiparty transfers expands.
\r
124 Still, current, relevant technologies are tightly coupled to a
\r
125 single usecase or even infrastructure of a particular corporation.
\r
126 The mission of the project is to create a generic content-centric
\r
127 multiparty transport protocol to allow seamless, effortless data
\r
128 dissemination on the Net.
\r
130 | mirror-based peer-assisted peer-to-peer
\r
131 ------+----------------------------------------------------
\r
132 data | SunSITE CacheLogic VelociX BitTorrent
\r
133 VoD | YouTube Azureus(+seedboxes) SwarmPlayer
\r
134 live | Akamai Str. Octoshape, Joost PPlive
\r
137 The protocol must be designed for maximum genericity, thus focusing
\r
138 on the very core of the mission, contain no magic constants and no
\r
139 hardwired policies. Effectively, it is a set of messages allowing to
\r
140 securely retrieve data from whatever source available, in parallel.
\r
141 The protocol must be able to run over IP as an independent transport
\r
142 protocol. For compatibility reasons, it must also run over UDP and
\r
148 The technical focus of the swift protocol is to find the simplest
\r
149 solution involving the minimum set of primitives, still being
\r
150 sufficient to implement all the targeted usecases (see Table 1),
\r
151 suitable for use in general-purpose software and hardware (i.e. a
\r
152 web browser or a set-top box).
\r
153 The five design goals for the protocol are:
\r
155 1. Embeddable kernel-ready protocol.
\r
156 2. Embrace real-time streaming, in- and out-of-order download.
\r
157 3. Have short warm-up times.
\r
158 4. Traverse NATs transparently.
\r
159 5. Pluggability/extensibility.
\r
161 Later in the draft, the objectives are referenced as (1)-(5).
\r
163 device, a browser or even in the kernel space. Thus, the protocol
\r
164 must have light footprint, preferably less than TCP, in spite
\r
168 Grishchenko Expires October 12, 2010 [Page 3]
\r
170 Internet-Draft swift April 2010
\r
173 the necessity to support numerous ongoing connections as well as to
\r
174 constantly probe the network for new possibilities. The practical
\r
175 overhead for TCP is estimated at 10KB per connection [HTTP1MLN]. We
\r
176 aim at <1KB per peer connected. Also, the amount of code necessary
\r
177 to make a basic implementation must be limited to 10KLoC of C.
\r
178 Otherwise, besides the resource considerations, maintaining and
\r
179 auditing the code might become prohibitively expensive.
\r
181 The support for all three basic usecases of real-time streaming,
\r
182 in-order download and out-of-order download (2) is necessary for
\r
183 the manifested goal of THE multiparty transport protocol as no
\r
184 single usecase dominates over the others.
\r
186 The objective of short warm-up times (3) is the matter of end-user
\r
187 experience; the playback must start as soon as possible. Thus
\r
188 any unnecessary initialization roundtrips and warm-up cycles must be
\r
189 eliminated from the transport layer.
\r
191 Transparent NAT traversal (4) is absolutely necessary as at least 60%
\r
192 of today's users are hidden behind NATs. NATs severely affect
\r
193 connection patterns in P2P networks thus impacting performance and
\r
194 fairness [MOLNAT,LUCNAT].
\r
196 The protocol must define a common message set (5) to be used by
\r
197 implementations; it must not hardwire any magic constants, algorithms
\r
198 or schemes beyound that. For example, an implementation is free to
\r
199 use its own congestion control, connection rotation or reciprocity
\r
200 algorithms. Still, the protocol must enable such algorithms by
\r
201 supplying sufficient information. For example, trackerless peer
\r
202 discovery needs peer exchange messages, scavenger congestion control
\r
203 may need timestamped acknowledgments, etc.
\r
206 4. swift subsystems and design choices
\r
208 To large extent, swift design is defined by the cornerstone decision
\r
209 to get rid of TCP and not to reinvent any TCP-like transports on
\r
210 top of UDP or otherwise. The requirements (1), (4), (5) make TCP a
\r
211 bad choice due to its high per-connection footprint, complex and
\r
212 less reliable NAT traversal and fixed predefined congestion control
\r
213 algorithms. Besides that, an important consideration is that no
\r
214 block of TCP functionality turns out to be useful for the general
\r
215 case of swarming downloads. Namely,
\r
216 1. in-order delivery is less useful as peer-to-peer protocols
\r
217 often employ out-of-order delivery themselves and in either case
\r
218 out-of-order data can still be stored;
\r
219 in-order delivery is not necessary, packet losses might be
\r
220 patched up lazily, without stopping the flow of data;
\r
224 Grishchenko Expires October 12, 2010 [Page 4]
\r
226 Internet-Draft swift April 2010
\r
229 3. flow control is not necessary as the receiver is much less
\r
230 likely to be saturated with the data and even if so, that
\r
231 situation is perfectly detected by the congestion control
\r
232 4. TCP congestion control is less useful as custom congestion
\r
233 control is often needed.
\r
234 In general, TCP is built and optimized for a different usecase than
\r
235 we have with swarmed downloads. The abstraction of a "data pipe"
\r
236 orderly delivering some stream of bytes from one peer to another
\r
237 turned out to be irrelevant. In even more general terms, TCP
\r
238 supports the abstraction of pairwise _conversations_, while we need
\r
239 a content-centric protocol built around the abstraction of a cloud
\r
240 of participants disseminating the same _data_ in any way and order
\r
241 that is convenient to them.
\r
243 Thus, the choice is to design the protocol that runs on top of
\r
244 unreliable datagrams. Instead of reimplementing TCP we create a
\r
245 datagram-based protocol, completely dropping the sequential data
\r
246 stream abstraction. Ripping off unnecessary features of TCP makes it
\r
247 easier both to implement the protocol, and to check/verify it; e.g.
\r
248 numerous TCP vulnerabilities were caused by complexity of the
\r
249 protocol's state machine. Still, we reserve the possibility to run
\r
250 swift on top of TCP or HTTP. The draft itself assumes swift-over-UDP
\r
251 implementation; the necessary adjustments to run the protocol over IP
\r
252 or TCP a listed in Sec. 5.
\r
254 Pursuing the maxim of making things as simple as possible but not
\r
255 simpler, we fit the protocol into the constraints of the transport
\r
256 layer by dropping all the transmission's technical metadata except
\r
257 for the content's root hash (compare that to metadata files used in
\r
258 BitTorrent). Elimination of technical metadata is achieved through
\r
259 the use of Merkle [MERKLE,ABMRKL] hash trees, exclusively single-file
\r
260 transfers and other techniques. As a result, a transfer is identified
\r
261 and bootstrapped by its root hash only.
\r
263 To avoid the usual layering of positive/negative acknowledgment
\r
264 mechanisms we introduce a scale-invariant acknowledgment system (see
\r
265 Sec 4.4). The system allows for aggregation and variable level of
\r
266 detail in requesting, announcing and acknowledging data, serves
\r
267 in-order and out-of-order retrieval with equal ease.
\r
268 Besides the protocol's footprint, we also aim at lowering the size of
\r
269 a minimal useful interaction. Once a single datagram is received,
\r
270 it must be checked for data integrity, and then either dropped or
\r
271 accepted, consumed and relayed.
\r
273 4.1. The atomic datagram principle
\r
275 loss of one datagram MUST NOT disrupt the flow. Thus, a datagram
\r
276 carries zero or more messages, and neither messages nor message
\r
280 Grishchenko Expires October 12, 2010 [Page 5]
\r
282 Internet-Draft swift April 2010
\r
285 interdependencies should span over multiple datagrams. In particular,
\r
286 any data piece is verified using uncle hash chains; all hashes
\r
287 necessary for verifying data integrity are put into the same datagram
\r
288 as the data (Sec. 4.3). As a general rule, if some additional data is
\r
289 still missing to process a message within a datagram, the message
\r
292 Each datagram starts with four bytes corresponding to the receiving
\r
293 channel number (Sec. 4.2). The rest of a datagram is a
\r
294 concatenation of messages. Each message within a datagram has fixed
\r
295 length, depending on the type of the message. The first byte of a
\r
296 message denotes its type. Integers are serialized in the network
\r
297 (big-endian) byte order. Variable-length messages, free-form text
\r
298 or JSON/bencoded objects are not allowed.
\r
299 Consider an example of an acknowledgment message (Sec 4.4). It has
\r
300 message type of 2 and a payload of a four-byte integer (say, 1); it
\r
301 might be written in hex as: "02 00000001". Later in the document, a
\r
302 hex-like two char per byte notation is used to represent message
\r
305 In case a datagram has a piece of data, a sender MUST always put
\r
306 the data message (type id 1) in the tail of a datagram. Such a
\r
307 message consists of type id, bin number (see Sec. 4.3) and the
\r
308 actual data. Normally there is 1 kilobyte of data, except the case
\r
309 when file size is not a multiple of 1024 bytes, so the tail packet
\r
310 is somewhat shorter. Example:
\r
311 01 00000000 48656c6c6f20776f726c6421
\r
312 (This message accommodates an entire file: "Hello world!")
\r
315 4.2. Handshake and multiplexing
\r
317 For the sake of simplicity, one transfer always deals with one file
\r
318 only. Retrieval of large collections of files is done by retrieving
\r
319 a directory list file and then recursively retrieving files, which
\r
320 might also turn to be directory lists (see Sec. 4.9). To distinguish
\r
321 different transfers between the same pair of peers, the protocol
\r
322 introduces an additional layer of multiplexing, the channels.
\r
323 "Channels" loosely correspond to TCP connections; "content" of a
\r
324 single "channel" is a single file. A channel is established with a
\r
325 handshake. To start a handshake, the initiating peer needs to know
\r
326 (1) the IP address of a peer
\r
327 (2) peer's UDP port and
\r
328 (3) the root hash of the content (see Sec. 4.5).
\r
329 The handshake is made by a HANDSHAKE message, whose only payload is
\r
330 a channel number. HANDSHAKE message type is 0. The initiating
\r
331 Initiator sends an initiating datagram to a peer:
\r
332 00000000 04 7FFFFFFF 1234123412341234123412341234123412341234
\r
336 Grishchenko Expires October 12, 2010 [Page 6]
\r
338 Internet-Draft swift April 2010
\r
342 (to unknown channel, handshake from channel 0x11, initiating a
\r
343 transfer of a file with a root hash 123...1234)
\r
345 Peer's response datagram:
\r
346 00000011 00 00000022 03 00000003
\r
347 (peer to the initiator: use channel number 0x22 for this transfer;
\r
348 I also have first 4 kilobytes of the file, see Sec. 4.3)
\r
350 At this point, the initiator knows that the peer really responds;
\r
351 for that purpose channel ids MUST be random enough to prevent easy
\r
352 guessing. So, the third datagram of a handshake MAY already contain
\r
353 some heavy payload. To minimize the number of initialization
\r
354 roundtrips, the first two datagrams MAY also contain some minor
\r
355 payload, e.g. a couple of HAVE messages roughly indicating the
\r
356 current progress of a peer or a HINT (see Sec. 4.7).
\r
358 (this is a simple zero-payload keepalive datagram consisting of
\r
359 a 4-byte channel id only. At this point both peers have the
\r
360 proof they really talk to each other; three-way handshake is
\r
363 In general, no error codes or responses are used in the protocol;
\r
364 absence of any response indicates an error. Invalid messages are
\r
365 discarded. Explicit closing of a channel may be achieved by setting
\r
366 channel number to zero by a handshake message: 00 00000000.
\r
368 Simple NAT hole punching [SNP] introduces the scenario when both
\r
369 parties of the handshake are initiators. To avoid creation of two
\r
370 transfers in the case both initiating datagrams get through, both
\r
371 peers must then act as responding peers. Thus, once an initiating
\r
372 datagram is sent and another initiating "counter"-datagram is
\r
373 received, the initiating peer sends a response datagram with the same
\r
374 channel id as in the outstanding initiating datagram.
\r
377 4.3. Generic acknowledgments
\r
379 Generic acknowledgments came out of the need to simplify the
\r
380 data addressing/requesting/acknowledging mechanics, which tends
\r
381 to become overly complex and multilayered with the conventional
\r
382 approach. Take BitTorrent+TCP tandem for example:
\r
384 1. The basic data unit is of course a byte of content in a file.
\r
385 2. BitTorrent's highest-level unit is a "torrent", physically a
\r
386 2. A torrent is divided into "pieces", typically about a thousand
\r
387 of them. Pieces are used to communicate own progress to other
\r
388 peers. Pieces are also basic data integrity units, as the torrent's
\r
392 Grishchenko Expires October 12, 2010 [Page 7]
\r
394 Internet-Draft swift April 2010
\r
397 metadata includes SHA1 hash for every piece.
\r
398 3. The actual data transfers are requested and made in 16KByte
\r
399 units, named "blocks" or chunks.
\r
400 5. Still, one layer lower, TCP also operates with bytes and byte
\r
401 offsets which are totally different from the torrent's bytes and
\r
402 offsets as TCP considers cumulative byte offsets for all content
\r
403 sent by a connection, be it data, metadata or commands.
\r
404 6. Finally, another layer lower, IP transfers independent datagrams
\r
405 (typically around a kilobyte), which TCP then reassembles into
\r
406 continuous streams.
\r
408 Obviously, such addressing schemes need lots of mappings; from
\r
409 piece number and block to file(s) and offset(s) to TCP sequence
\r
410 numbers to the actual packets and the other way around. Lots of
\r
411 complexity is introduced by mismatch of bounds: packet bounds are
\r
412 different from file, block or hash/piece bounds. The picture is
\r
413 typical for a codebase which was historically layered.
\r
415 To simplify this aspect, we employ a generic content addressing
\r
416 scheme based on binary intervals (shortcutted "bins"). The base
\r
417 interval is 1KB "packet", the top interval is the complete 2**63
\r
418 range. Till Sec. 4.4.1 any file is considered to be 2**k bytes long.
\r
419 The binary tree of intervals is simple, well-understood, correlates
\r
420 well with machine representation of integers and the structure of
\r
421 Merkle hashes (Sec. 4.4). A novel addition to the classical scheme
\r
422 are "bin numbers", a scheme of numbering binary intervals which
\r
423 lays them out into a vector nicely. Bin numbering is done in the
\r
424 order of interval's "center", ascending, namely:
\r
431 The number 0xFFFFFFFF (32-bit) or 0xFFFFFFFFFFFFFFFF (64-bit)
\r
432 stands for an empty interval; 0x7FFF...FFF stands for "everything".
\r
433 In general, this numbering system allows to work with
\r
434 simpler data structures, e.g. to use arrays instead of binary trees
\r
435 in many cases. As a minor convenience, it also allows to use one
\r
436 integer instead of two to denote an interval. By requiring every
\r
437 message to use bin numbers, we enforce genericity.
\r
439 Back to the acknowledgment message. A HAVE message (type 3) states
\r
440 that the sending peer obtained the specified bin and successfully
\r
441 checked its integrity:
\r
442 (got/checked first four kilobytes of a file/stream)
\r
444 The data is acknowledged in terms of bins; as a result, every
\r
448 Grishchenko Expires October 12, 2010 [Page 8]
\r
450 Internet-Draft swift April 2010
\r
453 single packet is acknowledged logarithmic number of times. That
\r
454 provides some necessary redundancy of acknowledgments and
\r
455 sufficiently compensates unreliability of datagrams. Compare that
\r
456 e.g. to TCP acknowledgments, which are (linearly) cumulative.
\r
457 For keeping the state information, an implementation MAY use the
\r
458 "binmap" data structure, which is a hybrid of a bitmap and a binary
\r
459 tree, discussed in detail in [BINMAP].
\r
460 An ACK message (type 2) acknowledges data that was received from
\r
461 its addressee; to facilitate delay-based congestion control, an
\r
462 ACK message contains a timestamp:
\r
464 02 00000002 12345678
\r
465 (got the second kilobyte of the file from you; my microsecond
\r
466 timer was showing 0x12345678 at that moment)
\r
469 4.4. Data integrity and on-demand Merkle hashes
\r
471 The integrity checking scheme is unified for two usecases of
\r
472 download and streaming. Also, it works down to the level of a
\r
473 single datagram by employing Merkle hash trees [MERKLE]. Peers
\r
474 receive chains of uncle hashes just in time to check the incoming
\r
475 data. As metadata is restricted to just a single root hash,
\r
476 newcomer peers derive the size of a file from hashes. That
\r
477 functionalities heavily depend on the concept of peak hashes,
\r
478 discussed in Sec. 4.4.1. Any specifics related to the cases of file
\r
479 download and streaming is discussed in Sec. 4.4.2, 4.4.3
\r
482 Here, we discuss the common part of the workflow. As a general
\r
483 rule, the sender SHOULD prepend data with hashes which are
\r
484 necessary for verifying that data, no more, no less. While some
\r
485 optimistic optimizations are definitely possible, the receiver
\r
486 SHOULD drop data if it is impossible to verify it. Before sending a
\r
487 packet of data to the receiver, the sender inspects the receiver's
\r
488 previous acknowledgments to derive which hashes the receiver
\r
489 already has for sure.
\r
490 Suppose, the receiver had acknowledged bin 1 (first two kilobytes
\r
491 of the file), then it must already have uncle hashes 5, 11 and so
\r
492 on. That is because those hashes are necessary to check packets of
\r
493 bin 1 against the root hash. Then, hashes 3, 7 and so on must be
\r
494 also known as they are calculated in the process of checking the
\r
495 uncle hash chain. Hence, to send bin 12 (i.e. the 7th kilobyte of
\r
496 data), the sender needs to prepend hashes for bins 14 and 9, which
\r
497 let the data be checked against hash 11 which is already known to
\r
498 message itself, i.e.:
\r
500 04 00000009 F01234567890ABCDEF1234567890ABCDEF123456
\r
504 Grishchenko Expires October 12, 2010 [Page 9]
\r
506 Internet-Draft swift April 2010
\r
509 04 0000000E 01234567890ABCDEF1234567890ABCDEF1234567
\r
510 (uncle hashes for the packet 12)
\r
511 01 0000000C DA1ADA1ADA1A...
\r
514 The sender MAY optimistically skip hashes which were sent out in
\r
515 previous (still unacknowledged) datagrams. It is an optimization
\r
516 tradeoff between redundant hash transmission and possibility of
\r
517 collateral data loss in the case some necessary hashes were lost in
\r
518 the network so some delivered data cannot be verified and thus
\r
520 In either way, the receiver builds the Merkle tree on-demand,
\r
521 incrementally, starting from the root hash, and uses it for data
\r
527 The concept of peak hashes enables two cornerstone features of swift:
\r
528 download/streaming unification and file size proving. Formally,
\r
529 peak hashes are hashes defined over filled bins, whose parent
\r
530 hashes are defined over incomplete (not filled) bins. Filled bin is
\r
531 a bin which does not extend past the end of the file, or, more
\r
532 precisely, contains no empty packets. Practically, we use peak
\r
533 hashes to cover the data range with logarithmic number of hashes,
\r
534 so each hash is defined over a "round" aligned 2^k interval.
\r
535 As an example, suppose a file is 7162 bytes long. That fits into
\r
536 7 packets, the tail packet being 1018 bytes long. The binary
\r
537 representation for 7 is 111. Here we might note that in general,
\r
538 every "1" in binary representation of the file's packet length
\r
539 corresponds to a peak hash. Namely, for this particular file we'll
\r
540 have three peaks, bin numbers 3, 9, 12.
\r
541 Thus, once a newcomer joins a swarm, the first peer sending him
\r
542 data prepends it with peak hashes; the newcomer checks them against
\r
543 the root hash (see Sec 4.4.2).
\r
545 04 00000003 1234567890ABCDEF1234567890ABCDEF12345678
\r
546 04 00000009 234567890ABCDEF1234567890ABCDEF123456789
\r
547 04 0000000C 34567890ABCDEF1234567890ABCDEF1234567890
\r
548 (this sequence of peak hashes proves that a file is 7KB long)
\r
551 4.4.2. Hash trees for files
\r
553 the entire data range (2**63 bytes). Every hash in the tree is
\r
554 defined in the usual way, as a SHA1 hash of a concatenation of two
\r
555 lower-level SHA1 hashes, corresponding to left and right data
\r
556 half-ranges respectively. For example,
\r
560 Grishchenko Expires October 12, 2010 [Page 10]
\r
562 Internet-Draft swift April 2010
\r
565 hash_1 = SHA1 (hash_0+hash_2)
\r
566 where + stands for concatenation and hash_i stands for Merkle hash
\r
567 of the bin number i. Obviously, that does not hold for the
\r
568 base-layer hashes, which are normal SHA1 hashes over 1KB data
\r
569 ranges ("packets"), except probably for the tail packet, which
\r
570 might have less than 1KB of data. The normal recursive formula does
\r
571 not apply to empty bins, i.e. bins that have no data absolutely;
\r
572 their hashes are just zeros.
\r
574 Lemma. Peak hashes could be checked against the root hash.
\r
575 Proof. (a) Any peak hash is always the left sibling. Otherwise, be
\r
576 it the right sibling, its left neighbor/sibling must also be
\r
577 defined over a filled bin, so their parent is also defined over a
\r
578 filled bin, contradiction. (b) For the rightmost peak hash, its
\r
579 right sibling is zero. (c) For any peak hash, its right sibling
\r
580 might be calculated using peak hashes to the left and zeros for
\r
581 empty bins. (d) Once the right sibling of the leftmost peak hash
\r
582 is calculated, its parent might be calculated. (e) Once that parent
\r
583 is calculated, we might trivially get to the root hash by
\r
584 concatenating the hash with zeros and hashing it repeatedly.
\r
586 Informally, the Lemma might be expressed as follows: peak hashes
\r
587 cover all data, so the remaining hashes are either trivial (zeros) or
\r
588 might be calculated from peak hashes and zero hashes.
\r
590 Thus, once a peer gets peak hashes and checks them against the
\r
591 root hash, it learns the file size and it also gets practical
\r
592 anchors for building uncle chains during the transmission (as the
\r
593 root hash is too high in the sky). A newcomer peer MAY signal it
\r
594 already has peak hashes by acknowledging any bin, even the empty one:
\r
598 Otherwise, the first of the senders SHOULD bootstrap him with all the
\r
602 4.4.3. Hash trees for streams
\r
604 In the case of live streaming a transfer is bootstrapped with a
\r
605 public key instead of a root hash, as the root hash is undefined
\r
606 or, more precisely, transient, as long as new data keeps coming.
\r
607 Stream/download unification is achieved by sending signed peak
\r
608 hashes on-demand, ahead of the actual data. Similarly to the
\r
609 hashes with the necessary (signed) peak hashes.
\r
610 Except for the fact that the set of peak hashes changes with the
\r
611 time, other parts of the algorithm work as described in 4.4.2 As we
\r
612 see, in both cases data length is not known on advance, but derived
\r
616 Grishchenko Expires October 12, 2010 [Page 11]
\r
618 Internet-Draft swift April 2010
\r
621 on-the-go from the peak hashes. Suppose, our 7KB stream extended to
\r
622 another kilobyte. Thus, now hash 7 becomes the only peak hash,
\r
623 eating hashes 3, 9 and 12, so the source sends out a signed peak hash
\r
624 message (type 7) to announce the fact:
\r
626 07 00000007 1234567890ABCDEF1234567890ABCDEF12345678 SOME-SIGN-HERE
\r
629 4.5. Peer exchange and NAT hole punching
\r
631 Peer exchange messages are common for many peer-to-peer protocols.
\r
632 By exchanging peer IP addresses in gossip fashion, the central
\r
633 coordinating entities (trackers) might be relieved of unnecessary
\r
634 work. Following the example of BitTorrent, swift features two types
\r
635 of PEX messages: "peer connected" (type 5) and "peer disconnected"
\r
636 (type 6). Peers are represented as IPv4 address-port pairs:
\r
638 (connected to 127.0.0.1:8000)
\r
640 To unify peer exchange and NAT hole punching functionality, the
\r
641 sending pattern of PEX messages is restricted. As swift handshake
\r
642 is able to do simple NAT hole punching [SNP] transparently, PEX
\r
643 messages must be emitted in the way to facilitate that. Namely,
\r
644 once peer A introduces peer B to peer C by sending a PEX message to
\r
645 C, it SHOULD also send a message to B introducing C. The messages
\r
646 SHOULD be within 2 seconds from each other, but MAY and better not be
\r
647 simultaneous, leaving a gap of twice the "typical" RTT, i.e.
\r
648 300-600ms. The peers are supposed to initiate handshakes to each
\r
649 other thus forming a simple NAT hole punching pattern where the
\r
650 introducing peer effectively acts as a STUN server. Still, peers
\r
651 MAY ignore PEX messages if uninterested in obtaining new peers or
\r
652 because of security considerations (rate limiting) or any other
\r
656 4.6. Congestion control
\r
658 swift employs pluggable congestion control. In general, it is
\r
659 expected that servers would use TCP-like congestion control schemes
\r
660 such as classic AIMD or CUBIC [CUBIC]. End-user peers are expected to
\r
661 use weaker-than-TCP (least than best effort) congestion control, such
\r
662 as [LEDBAT] to minimize seeding counter-incentives.
\r
665 While bulk download protocols normally do explicit requests for
\r
666 certain ranges of data (e.g. BitTorrent's REQUEST message), live
\r
667 streaming protocols quite often do without to save round trips.
\r
668 Explicit requests are often needed for security purposes; consider
\r
672 Grishchenko Expires October 12, 2010 [Page 12]
\r
674 Internet-Draft swift April 2010
\r
677 that BitTorrent can only verify hashes of complete pieces that
\r
678 might consist of multiple blocks requested from many peers.
\r
679 As swift has no such implications, it is supposed to work both
\r
680 ways. Namely, a peer SHOULD send out requested pieces, while it
\r
681 also may send some other data in case it runs out of requests or
\r
682 on some other reason. To emphasize that, request messages are named
\r
683 HINTs; their only purpose is to coordinate peers and to avoid
\r
684 unnecessary data retransmission. A peer is supposed to process
\r
685 HINTs sequentially. HINT message type is 8.
\r
687 (a peer requests fifth and sixth packets)
\r
690 4.8. Subsetting of the protocol
\r
692 As the same protocol is supposed to serve diverse usecases,
\r
693 different peers may support different subsets of messages. The
\r
694 supported subset SHOULD be signaled in the handshake packets.
\r
695 The SWIFT_MSGTYPE_RCVD message (type 9) serves exactly this
\r
696 purpose. It contains a 32-bit big-endian number with bits set
\r
697 to 1 at offsets corresponding to supported message type ids.
\r
698 E.g. for a tracker peer which receives only handshakes and
\r
699 (root) hashes, sends out handshakes and PEX_ADD messages, that
\r
700 message will look like:
\r
702 Peers running over TCP may not accept ACK messages, etc etc.
\r
705 4.9. Directory lists
\r
707 Directory list files MUST start with magic bytes ".\n..\n\n". The
\r
708 rest of the file is a newline-separated list of hashes and file names
\r
709 for the content of the directory. An example:
\r
711 1234567890ABCDEF1234567890ABCDEF12345678 readme.txt
\r
712 01234567890ABCDEF1234567890ABCDEF1234567 big_file.dat
\r
719 albeit it has downsides: first NAT/firewall compatibility problems,
\r
720 and second the necessity of in-kernel implementation on both ends.
\r
728 Grishchenko Expires October 12, 2010 [Page 13]
\r
730 Internet-Draft swift April 2010
\r
733 Currently, swift-over-UDP is the default deployment option.
\r
734 Effectively, UDP allows to use IP with minimal overhead, it also
\r
735 allows userspace implementations.
\r
736 Besides the classic 1KB packet scenario, the bin numbering allows
\r
737 to use swift over Jumbo frames/datagrams. Both data and
\r
738 acknowledgments may use e.g. 8KB packets instead of "standard"
\r
739 1KB. Hashing scheme stays the same.
\r
740 Using swift with 512 or 256-byte packets is theoretically possible
\r
741 with 64-bit byte-precise bin numbers, but IP fragmentation might be
\r
742 a better method to achieve the same result.
\r
747 If ran over TCP, the swift becomes functionally equivalent to
\r
748 BitTorrent. Namely, most swift messages have corresponding BitTorrent
\r
749 messages and vice versa, except for BitTorrent's explicit interest
\r
750 declarations and choking/unchoking, which serve the classic
\r
751 implementation of the tit-for-tat algorithm [TIT4TAT].
\r
754 6. Security Considerations
\r
756 As any other network protocol, the swift faces a common set of
\r
757 security challenges. An implementation must consider the possibility
\r
758 of buffer overruns, DoS attacks and manipulation (i.e. reflection
\r
759 attacks). Any guarantee of privacy seems unlikely, as the user is
\r
760 exposing its IP address to the peers. A probable exception is the
\r
761 case of user being hidden behind a public NAT or proxy.
\r
766 7.1. 32 bit vs 64 bit
\r
768 While in principle the protocol supports bigger (>1TB) files, all
\r
769 the mentioned counters are 32-bit. It is an optimization, as using
\r
770 64-bit numbers on-wire may cost ~2% practical overhead. 64-bit
\r
771 version of every message has typeid of 64+t, e.g. typeid 68 for
\r
772 64-bit hash message:
\r
773 44 000000000000000E 01234567890ABCDEF1234567890ABCDEF1234567
\r
774 Once 32-bit message is supported, its 64-bit version MUST be
\r
784 Grishchenko Expires October 12, 2010 [Page 14]
\r
786 Internet-Draft swift April 2010
\r
791 IPv6 versions of PEX messages use the same 64+t shift as in 6.1.1.
\r
794 7.3. Congestion control algorithms
\r
796 Congestion control algorithm is left to the implementation and may
\r
797 even vary from peer to peer. Congestion control is entirely
\r
798 implemented by the sending peer, the receiver only provides clues,
\r
799 such as hints, acknowledgments and timestamps.
\r
802 7.4. Piece picking algorithms
\r
804 Piece picking entirely depends on the receiving peer. The sender peer
\r
805 is made aware of preferred pieces by the means of HINT messages, but
\r
806 may ignore those hints and send unrequested data.
\r
809 7.5. Reciprocity algorithms
\r
811 Reciprocity algorithms is the sole responsibility of the sender peer.
\r
812 Reciprocal intentions of the sender are not manifested by separate
\r
813 messages (as BitTorrent's CHOKE/UNCHOKE), as it does not guarantee
\r
814 anything anyway (the "snubbing" syndrome).
\r
817 7.6. Different crypto/hashing schemes
\r
819 Once a flavour of swift will need to use a different crypto scheme
\r
820 (e.g. SHA-256), a message should be allocated for that. As the root
\r
821 hash is supplied in the handshake message, the crypto scheme in use
\r
822 will be known from the very beginning. As the root hash is the
\r
823 content's identifier, different schemes of crypto cannot be mixed
\r
824 in the same swarm; different swarms may distribute the same content
\r
825 using different crypto.
\r
830 [RFC2119] Key words for use in RFCs to Indicate Requirement Levels
\r
831 [HTTP1MLN] Richard Jones. "A Million-user Comet Application with
\r
832 Mochiweb", Part 3. http://www.metabrew.com/article/
\r
833 a-million-user-comet-application-with-mochiweb-part-3
\r
834 [MOLNAT] J.J.D. Mol, J.A. Pouwelse, D.H.J. Epema and H.J. Sips:
\r
835 "Free-riding, Fairness, and Firewalls in P2P File-Sharing"
\r
840 Grishchenko Expires October 12, 2010 [Page 15]
\r
842 Internet-Draft swift April 2010
\r
845 [BINMAP] V. Grishchenko, J. Pouwelse: "Binmaps: hybridizing bitmaps
\r
846 and binary trees" http://bouillon.math.usu.ru/articles/
\r
848 [SNP] B. Ford, P. Srisuresh, D. Kegel: "Peer-to-Peer Communication
\r
849 Across Network Address Translators",
\r
850 http://www.brynosaurus.com/pub/net/p2pnat/
\r
851 [MERKLE] Merkle, R. A Digital Signature Based on a Conventional
\r
852 Encryption Function. Proceedings CRYPTO'87, Santa Barbara, CA,
\r
853 USA, Aug 1987. pp 369-378.
\r
854 [ABMRKL] Arno Bakker: "Merkle hash torrent extension", BEP 30,
\r
855 http://bittorrent.org/beps/bep_0030.html
\r
856 [CUBIC] Injong Rhee, and Lisong Xu: "CUBIC: A New TCP-Friendly
\r
857 High-Speed TCP Variant",
\r
858 http://www4.ncsu.edu/~rhee/export/bitcp/cubic-paper.pdf
\r
859 [LEDBAT] S. Shalunov: "Low Extra Delay Background Transport (LEDBAT)"
\r
860 http://www.ietf.org/id/draft-ietf-ledbat-congestion-00.txt
\r
861 [TIT4TAT] Bram Cohen: "Incentives Build Robustness in BitTorrent", 2003,
\r
862 http://www.bittorrent.org/bittorrentecon.pdf
\r
869 Email: victor.grishchenko@gmail.com
\r
896 Grishchenko Expires October 12, 2010 [Page 16]
\r