NO tinygrams NO tinygrams
Raising the Internet MTU

The "Internet Cell size" is effectively 1500 bytes - the Maximum Transmission Unit (MTU) for Ethernet. This is orders of magnitude smaller than the optimal MTU for many high performance applications running over todays high speed Internet. Note that although traditional "jumbograms" (9kB) are a huge improvement, they are not really large enough for networks that are faster than 1 Gb/s.

We are not proposing to standardize any particular large MTU. We are proposing that the Internet needs to support diverse MTUs (path MTU discovery has to be completely robust), such that different communities can use different MTUs in different parts of the Internet. Backbones supporting data intensive users (e.g. NLR, ETF/DTF, Internet2, DoE, NASA, etc) and their attached campuses are likely to want larger MTUs. Other communities, such as ISPs serving millions of low rate customers, may find that 1500 bytes is sufficient.

This material is divided into four main areas:

Just remember: The glass is neither half full nor half empty, it is merely the wrong size.

General Resources and Background:


Deploying Robust Path MTU Discovery (RFC 4821):

Traditional path MTU discovery, as specified in RFC 1191 and RFC 1981, is fragile when ICMP Packet-To-Big or Can't Fragment messages are not reliably generated or delivered. So called "ICMP black holes" can cause very hard-to-diagnose connection and application hangs or other problems as documented in RFC 2401 and RFC 4458, etc. These problems arise whether the differing MTUs are due to jumbogram support or tunnels with MTU's slightly smaller than the native infrastructure (e.g. PPPOE, VPNs, IPsec, etc.)

RFC 4821 describes a robust new method for Packetization Layer Path MTU Discovery (PLPMTUD) where TCP or some other protocol can determine that the path MTU without relying on ICMP or other messages from the network. In this algorithm, the packetization layer, which is the protocol responsible for choosing packet boundaries (e.g., segment sizes) probes the path by using progressively larger packets. If a probe packet is successfully delivered, then the effective Path MTU is raised to the probe size. The isolated loss of a probe packet (with or without an ICMP message) is treated as an indication of an MTU limit, and not as a congestion indicator. In this case alone, the Packetization Protocol is permitted to retransmit one segment of missing data without adjusting the congestion window. Read the RFC itself for a full description of the algorithm.

Traditional path MTU discovery requires that all nearby routers know a host's MTU and that they send the proper ICMP messages to the remote hosts, essentially by proxy. Since PLPMTUD does not require messages from the network, routers do not need to know the host's MTU. The end-to-end path MTU is deduced by observing which packet sizes are delivered and which are discarded (e.g. in a black hole). The very situation that causes classical path MTU discovery to fail becomes the primary signal for PLPMTUD.

RFC 4821 can be configured in a number of different ways. The natural first step is as an ICMP black hole recovery algorithm. In this configuration it is only invoked when a connection might be hung due to an ICMP black hole. It raises the robustness of RFC 1191 and RFC 1981 path MTU discovery with no significant downside. We recommend deploying it in all operating systems as soon as reasonably feasible. We are aware of at least 3 vendors who participated in the IETF pmtud WG and have experimental implementations.

In a slightly more aggressive configuration it can implement "opportunistic jumbo MTU discovery". Some high performance host interfaces can be pre-configured to use the largest MTU efficiently supported by the memory subsystem and NIC chipset, without prior knowledge of the MTU actually supported on the local network. The per connection initial MTU is selected by a heuristic based on the history of discovered path MTUs (typically initialized to 1500B). PLPMTUD can then probe up from 1500 bytes, to detect if the full path can support larger MTUs. In this manner an end system can opportunistically discover if the full path supports larger MTUs without any additional protocol support or site specific configuration.

Opportunistic jumbo MTU discovery has the potential to greatly ease jumbogram deployment since it relaxes some of the requirements on mixed MTU networks. With pure RFC 1191 path MTU discovery, every subnet is required to have a unique MTU, and any one device that can not be upgraded, vetoes the entire upgrade. Opportunistic path MTU discovery supports deployment strategies with mixed MTUs per subnet.

Implementations:

As we gain field experience with wide deployment of RFC 4821 in the above two configurations, we will document any additional recommendations for implementors. Watch this page for future information.


Enabling 9K "Jumbograms" in the Internet today:

We are pushing the wide deployment of 9kB jumbograms, even though we would prefer to go to larger sizes. Just getting the Internet to the point where it fully supports mixed MTUs will break the current strong local optima at a 1500B MTU. It is interesting to note that that a huge fraction of the deployed 1 Gb/s and faster gear already supports 9kB Jumbograms, but it is not enabled all the way to the end systems due to the problems with RFC 1191 listed above.

Jumbogram Resources:


Pushing up the Internet MTU

We are pushing for the deployment of really large MTUs in the high performance parts of the Internet. The standard MTU, 1500 Bytes, is about 3 orders of magnitude too small for for the fastest links in use today.  At 10 Gb/s (standard trunks for most mid-sized ISPs), a 1500 Byte packet takes only 1.2 uS (microseconds), which is much smaller than ATM cells at the peak ATM deployment. A 9kB "jumbogram" takes only 7.3 uS which is not much better. Since current packet times are so short, many of the problems that dogged ATM are hurting the Internet as well. In particular filling a long fast paths (e.g. 60 ms) with 1.2 uS packets requires the transport protocol to manage 50,000 packets in flight concurrently. Not too surprisingly TCP, and all other protocols, have a lot of difficulty managing this many outstanding packets.

If packets we 100 times larger (150 kBytes), the wire time would be 120 uS, the same flow would only require 500 packets in flight. Modern protocols have no difficulty at all managing this number of packets in flight.

How large do we want? Our initial vision was that the factor of ten bandwidth steps should have been allocated to a factor of 8 increase in payload size and a 20% reduction in packet time. Further considerations suggest that other models might be more practical. We show two here: constant time (125 uS, the voice/SONET frame time) and capping at 64kB, which is the natural limit for a number of protocols, including IPv4, due to the number of bits the length field (This limit does not apply to IPv6).

Note that there is no specific reason to require any particular MTU at any particular rate. As a general principle, we prefer declining packet times (and declining worst case jitter) as you go to higher rates.
Actual Vision Alternate 1 Alternate 2
Rate Year MTU Wire Time MTU Wire Time MTU Wire Time MTU Wire Time
10 Mb/s 1982 1.5 kB 1200 uS
100 Mb/s 1995 1.5 kB 120 uS 12 kB 960 uS 9 kB 720 uS 4.3 kB 433 uS
1 Gb/s 1998 1.5 kB 12 uS 96 kB 768 uS 64 kB 512 uS 9 kB 72 uS
10 Gb/s 2002 1.5 kB 1.2 uS 750 kB 600 uS 150 kB 120 uS 64 kB 51.2 uS
100 Gb/s 6 MB 480 uS 1.5 MB 120 uS 64 kB 5.12 uS
1 Tb/s 50 MB 400 uS 15 MB 120 uS 64 kB 0.512 uS

The above numbers are very speculative about what MTUs might make sense in the market. We keep updating them as we learn more about how MTU affects the balance between switching costs and end-system costs vs end-to-end performance. The Internet as a whole will be seeking to optimize total cost vs performance for across several different communities.


This page is http://www.psc.edu/~mathis/MTU/index.html.

For additional information check out these pages: Raising the Internet MTU, Pittsburgh Supercomputing Center, Network research at PSC, or Matt Mathis. Please send comments and suggestions to mathis@psc.edu.