This is a list of arguments and counter arguments about raising the Internet MTU. The central position is that MTU should scale with data rate. See the main page on raising the Internet MTU.
Table of contents
TCP worked really well on the 1988 Internet. After all, Van Jacobson tested his landmark TCP congestion control algorithms at this time. When the NSFnet backbone was T1 (1.5 Mbit/s) and the default Maximum Segment Size (MSS) was 512 bytes, typical packet transmission times were about 3 ms. With the lower data rate the coast-to-coast Round-Trip-Time was about 120ms, so it only took 40 packets in flight to ``fill the pipe''.
Imagine a ``jumbo TCP'' that uses ``blocks'' (1kByte) instead of ``bytes'' for it's basic data unit (and links with an appropriate MTU). Then, a jumbo TCP packet using an MSS of 1000 Blocks (1 MByte) on 10 Gb/s Ethernet would have a packet time of 800 microseconds and would only take about 100 packets in flight to fill a coast-to-coast pipe.
Jumbo TCP with 100 packets in flight would have roughly the same protocol behavior and dynamics as plain old TCP had in 1988 with 40 512 Byte packets in flight. Clearly much of the network overhead, CPU, I/O, and memory processing, is proportional to packet rate (or the number of packets in flight). However there are a couple of overhead terms that scale as the square of the pipe size in packets. In particular the gain (and noise immunity) of the TCP congestion control system go as the square of the window size in packets. On this path, these terms would be (1000/1.5)2 or about 400,000 times more expensive with 1500 Byte packets than with 1 MB packets.
Put another way: With constant size packets each order of magnitude rise in link rate lowers TCP's tolerance to other problems in the network by two orders of magnitude. If industry had scaled up packet sizes rather than scaling down packet times, the network throughput of individual data flows would be very different today.
While larger MTU sizes do exist within the Internet, they have not been widely deployed and used since the MTU path discovery algorithm [RFC1191, RFC1981] is not effective.
Unfortunately, this is correct. We are just going have to fix it.
The problems with path MTU discovery are documented in RFC2923. It is fragile because it depends on ICMP "Can't Fragment" messages from the network. When there is an "ICMP black hole" the messages don't get delivered, causing path MTU discovery to fail and the TCP connection to hang.
Since path MTU discovery is fragile, vendors ship computer systems with it turned off. They instead pick a safe MTU for a system wide default, typically 1500 bytes.
Since computer systems are shipped not to use path MTU discovery, almost nobody notices the paths on the Internet that do support larger MTUs.
The first step to mitigating this large and legitimate market disincentive to deploying larger MTUs is to fix path MTU discovery.
Check out the new path MSS discovery rough draft Internet-Draft.
It has been reported in several places, such as Phil Dykstra's page on Jumbo frames that the CRC-32 used by Ethernet is limited to about 12 kBytes.
This statement is based on an engineering requirement that the maximum allowed probability that the CRC fails to detect a corrupted packet is below some threshold. (I would greatly appreciate it if somebody can provide a specific reference on this calculation.)
I believe the logic behind the calculation is flawed in the following sense: If you have a large quantity of data to move (say a Peta Byte, 1 × 1015 Bytes) then the total undetected error rate is independent of the packet size across a huge range of sizes. This is because changes in per-packet exposure to undetected errors are exactly offset by the change in the number of packets.
For example if I send 1 × 1015 bytes over a link that has a raw bit error rate of 1 × 10-12 then I compute:
Using 1000 byte packets:
Raw bit error rate: 1.0 × 10-12
Per packet error rate: 8.0 × 10-9
Undetected packet errors: 2.0 × 10-18
Packets per data set: 1.0 × 1012
Total undetected errors per data set: 2.0 × 10-6
Using 10000000 byte packets:
Raw bit error rate: 1.0 × 10-12
Per packet error rate: 8.0 × 10-5
Undetected packet errors: 2.0 × 10-14
Packets per data set: 1.0 × 108
Total undetected errors per data set: 2.0 × 10-6
The undetected packet error rate was calculated using the following assumptions:
The "raw strength" of the CRC is 1 part in 232. I.e. a single arbitrary burst error will yield a "random" CRC, which will be a false pass once per 4 × 109 packets. Actual CRCs are stronger than this because all errors patterns in some of the more common cases (e.g. single bit errors) can be proven to never cause a false pass.
The ability of the CRC to detect a given burst error is not affected by the amount of correct data in the same packet.
The probability of there being 2 burst errors in the same packet is low. If this is not the case, you introduce second order terms on both sides of the calculation.
I am in complete agreement that CRC-32 is not strong enough for large data sets. It probably does need to be improved.
The CRC issue does not provide an argument for limiting MTU, only that current Ethernet may not be suitable for large data sets.
The packet times shown on the MTU main page are monotonically decreasing as the date rates get higher, so newer networks will never break existing applications. If a mission oriented network needs substantially smaller jitter there is always the option of artificially limiting the MTU on selected subnets, and relying path MTU discovery to inform the bulk transport users.
We predict that raising the MTU will effectively be a larger performance gain for the end user than it was to replace 10 Mb/s Ethernet with 100 Mb/s Ethernet. If true, once the users are educated, there will be the opportunity to re-sell the entire current installed base of 100 Mb/s gear with big packet gear at either 100 Mb/s or 1 Gb/s.
Today, there are several different bottlenecks that limit performance of wide area connections to less than 10 Mb/s - the hypothetical bandwidth available more than a decade ago. The primary bottleneck today is actually end-system TCP buffer tuning, which is being addressed by the web100 project. However the primary deliverable for the project is an extended performance MIB for TCP, which has enabled us to examine other bottlenecks in the system.
From this information we have come to believe that for the vast majority of university users, raising the MTU would be more valuable than migration to faster Ethernets using 1500 Byte packets. In particular, raising the MTU (together with fixing pMTU discovery and TCP tuning) would mean that most users would get their full fair share of the bottleneck line rate (i.e. fill some link in the network).
On the other hand, since the vast majority of todays end systems can only use a small fraction of the available NIC data rate, raising the NIC date rate does little to help actual performance, and therefore there is no general demand for host interfaces faster than 100 Mb/s.
Therefore, we believe that a deployed R&E core that supports big packets, plus deployed fixes to path MTU discovery and end-system TCP tuning (or user educations about manual workarounds) has the potential to cause wide spread demand for larger packets, perhaps on faster NICs, everywhere.
The letter below is the IEEE response to draft-kaplan-isis-ext-eth-02.txt. That draft evolved to draft-ietf-isis-ext-eth-01.txt, which includes the letter below as appendix 1 with a rebuttal by the draft's authors as appendix 2. Note that these documents were "Works in progress" and have already expired - They have no current standing in the IETF.
From: Geoff Thompson, Chair, IEEE 802.3
To: Scott O. Bradner, IETF
Re: 802.3 Position on Extended Ethernet Frame Size Support
This is in response to your query for a position regarding the publication of Extended Ethernet Frame Size Support - draft-kaplan-isis-ext-eth-02.txt - as an informational RFC. This response was approved in concept and draft by 802.3 during its closing plenary at Hilton Head on March 15. The final form was drafted by myself and reviewed by an ad hoc that was formed during our closing plenary. It should be considered the position of the 802.3 Working Group.
The response is composed of two parts, specific comments on the draft and general comments on the use of jumbo frames in Ethernet networks, however, virtually all traffic uses the type/length field as a type field. It seems unlikely that the implementations using the length format would take advantage of longer packets. Therefore, the draft conveys a very limited value.
Specific comments on: Extended Ethernet Frame Size Support - draft-kaplan-isis-ext-eth-02.txt
The draft makes no mention that extended frames are not likely to be successfully handled by Ethernet equipment unless the network is composed entirely of equipment that is specifically designed, beyond the specifications of the Ethernet Standard, to relay extended size frames.
In section 2, Abstract, the document asserts that it presents an extension to the "current Ethernet Frame Standards to support payloads greater than 1500 bytes..." Neither the original Ethernet specification (it was not a "Standard") nor IEEE Std. 802.3 is a "frame standard". They are, rather, complete specifications for hardware and frame format with the expectation that parameters from one portion of the standard can be taken as a given in other portions of the Standard. Moreover, this draft is not an "extension" to those documents but rather a proposal to violate specific provisions of those documents.
In section 3, the draft refers to "Ethernet II [ETH] and points to the reference [ETH] The reference, as cited, is incorrect or incomplete.
Ethernet II would seem to point to Ethernet Version 2.0. That would specifically not be "version 1.0...September 1980". The citation in fact points to 2 different documents and fails to note that the November 1982 edition is in fact Version 2.0. Further, both of these are obsolete references and have been superceded by IEEE Std. 802.3 and ISO/IEC 8802-3. The current version of these Standards is IEEE Std. 802.3 [2000 Edition] and ISO/IEC 8802-3 : 2000.
The details of section 4 are badly out of date. IEEE Std. 802.3 has included both Type and Length encoded packets ever since the adoption of IEEE Std. 802.3x on March 20, 1997. The current text of the 802.3 text covering this reads:================================================================
3.2.6 Length/Type Field
This two-octet field takes one of two meanings,depending on its numeric value. For numerical evaluation, the first octet is the most significant octet of this field.
a)If the value of this field is less than or equal to the value of maxValidFrame (as specified in 18.104.22.168), then the Length/Type field indicates the number of MAC client data octets contained in the subsequent data field of the frame (Length interpretation).
b)If the value of this field is greater than or equal to 1536 decimal (equal to 0600 hexadecimal),then the Length/Type field indicates the nature of the MAC client protocol (Type interpretation). The Length and Type interpretations of this field are mutually exclusive.
Please note that any value over "the value of maxValidFrame" is NOT a valid value for encoding length. Additionally, the values between maxValidFrame and "1536 decimal" are undefined in the Ethernet standard. The behavior of equipment at these values is not specified and can not be depended on. The draft implies that these values are valid type fields. This is not true. These values are not valid for either Type or Length.
Section 4 Re: "...are not limited in length to 1500 bytes by framing." While this seems to be true, it is not necessarily true for a number of sometimes subtle reasons, some of which are noted in the "General" section below.
Section 5: Regarding the statement "Although the 802.3 length field is missing, the frame length is missing, the frame length is known by virtue of the frame being accepted by the network interface." This statement is not correct. Many Ethernet interfaces, particularly those of relay equipment, accept frames without regard for packet type or content. There is no reasonable expectation that standards based Ethernet/802.3 equipment will reject the proposed frames. They may very well accept the frame and corrupt it before passing it on. This corruption may consist of truncation or alteration of the data within the packet.
General comments on the use of jumbo frames in Ethernet networks:
Consideration #1: The expectation of no more than 15-1600 bytes between frames and an interpacket gap before the next frame is deeply ingrained throughout the design and implementation of standardized Ethernet/802.3 hardware. This shows up in buffer allocation schemes, clock skew and tolerance compensation and fifo design.
Consideration #2: For some Ethernet/802.3 hardware (repeaters are one specific example) it is not possible to design compliant equipment which meets all of the requirements and will still pass extra long frames. Further, since clock frequency may vary with time and temperature, equipment may successfully pass long frames at times and corrupt them at other times. Therefore, attempts to verify the ability to send long frames over a path may produce inaccurate results.
Consideration #3: The error checking mechanism embodied in the 4 byte checksum has not been well characterized at greater frame lengths, but is known to degrade. Therefore the data reliability of transfers in long frame transfers will have a greater rate of undetected frame errors.
Consideration #4: The length of frames proposed by this draft can not be assured to pass through standards conformant hardware. The huge value of Ethernet/802.3 systems in the data networking universe is their standardization and the resulting assurance that systems will all interoperate. No such assurance can be provided for oversize frames with both the current broadly accepted standard and the large installed base of standards based equipment.
In summary with regard to greatly longer frames for Ethernet, much of the gear produced today would be intolerant of greatly longer frames. There is no way proposed to distinguish between frame types in the network as they arrive from the media. Bridges might and repeaters would drop or truncate frames (and cause errors doing so) right and left for uncharacterized reasons. It would be a mess. What might seem okay for small carefully characterized networks would be enormously difficult or impossible to do across the Standard.
The choice of frame size for Ethernet packets is really the domain of 802.3 (CSMA/CD) and 802.1 (Bridging, VLANs). The only time the frame size has been modified over the history of the Standard was in order to increase maximum length by four bytes in order to accommodate VLANs, 802.1 initiated this work and 802.3 also modified the Ethernet standard to include these few extra bytes. The people with the experience dealing with this sort of thing attend IEEE 802. It's easy to define a new ethertype, but it's not too easy to figure out what happens when these non-standard frames are given to standardized transmission equipment e.g. bridges. We would expect discussions of this type to take place in both 802.3 & 802.1.
The giant frame issue has been mentioned several times over the years in 802.3, discussed in the back halls and considered each time we move to a higher speed. It has never had consensus support in that context. It has never been brought forward as a separate proposal. Backward compatibility has always been more important than ease of performance improvement. The problem is that the change is very easy to do in the standard and hard to do in the world. It is just like changing the gauge on railroad tracks. All you have to do is change one line in the standard, never mind all of the rails you have to move.
The Kaplan draft is just meant for carrying IS-IS routing protocol frames (the IS-IS working group is the intended sponsor of this draft). We expect those vendors supporting the larger frame will support this will show up and support this proposal. Those vendors not supporting the larger frame as well as those protecting the installed base will not support this activity nor having this sort of item standardized outside IEEE 802.3. [emphasis added - MM]
With best regards,
Geoff Thompson, Chair, IEEE 802.3
For additional information check out these pages: Raising the Internet MTU, Pittsburgh Supercomputing Center, Network research at PSC, or Matt Mathis. Please send comments and suggestions to firstname.lastname@example.org.