Jan 31, 2002 - A coincident source Stereo speaker in accordance with the. (60) Provisional ... range components are oriented one above the other so that their respective sound ... there are typically at least two woofers that are symmetri cally arran
Mar 20, 2001 - (75) Inventors: Kenneth P. Fishkin, Redwood City, CA. (US); Beverly L. Harrison, Palo Alto,. FOREIGN PATENT DOCUMENTS. CA (US); Carlos ...
devil is in the detailsâ. Nature Biotech 15: 1222-1223, 1997 .... USA, vol. 84:2926-2930 (1987). Huang, Fei et al., âThe Mechanisms of Differential Sensitivity to an.
Airmagnet. AirMagnet Distributed System. .... WildPackets Inc. Airopeek Wireless LAN Analyzer. .... In one embodiment, a computer-readable medium is pro.
with Standard left and right analog channels using RCA-style audio jacks. HDMI and ... What is needed is multimedia appliance that can self-adapt and configure ..... processing by an MPEG-2 AVC/VC-1 decoder or the like in the TV 102. In one ...... in
Cybex, Director Installer/User Guide, Nov. .... Rose Electronics, UltraView Installation and Operation Manual, ..... airports, coffee shops, college campuses, etc.
TV compass, âThe iTV Market.â http://www.tv.compass.com/ itv market.html, printed ..... 19, the user has highlighted the score of the Brewers. Dodgers game and ...
U.S. Patent. Jul. 13, 2010. Sheet 2 of 19. US 7,753,577 B2. 3. 3. s. 1- N. -II. S. S. Y. & ... Another way to provide lighting for these tiles is to use fiber optics that are powered .... method has four components: the LED strip, a Splitter, a clear
A vandal resistant sports goal net clip, or fastener, for attaching nets to the frames of sports goals. The fasteners are difficult to remove by hand and require the ...
g s A: 3.E. Paaske et al., âHigh SpeedViterbi Decoder Architecture.â First ESA. EP. O 923 247 A2 6, .... Internet Wire, Sunbeam Joins Microsoft in University Plug and Play. Forum to Establish a ... Proceedings of the First NASA/DOD Workshop on Ev
(12) United States Patent
(10) Patent No.:
Bergamasco et al.
US 7,961,621 B2
(45) Date of Patent:
(54) METHODS AND DEVICES FOR BACKWARD
aim eta11~ ,
1 s et a .
6,404,768 B1 6,456,590 B1
(75) Inventors: Davide Bergamasco, Sunnyvale, CA
6/2002 Basak et al. 9/2002 R t l.
(US); Andrea Baldini, San Francisco,
CA (Us); Valentina Alaria’ San
3/2003 Kikuchi et al.
Francisco, CA (US); Flavio Bonomi, Palo Alto, CA (US); Rong Pan,
4/2003 pared (Commue )
Sunnyvale’ CA (Us) (73)
FOREIGN PATENT DOCUMENTS
Assignee: Cisco Technology, Inc., San Jose, CA
Subject to any disclaimer, the term of this
Patent 15 extended or adlusted under 35
MAC Control Pause Operation, 3lB.3.l Transmit Operation, Annex
U-S-C- 154(1)) by 1200 days-
31B, IEEE Std 802.3ae-2002, 4 pages.
(21) Appl. No.: 11/248,933 (22)
Jun. 14, 2011
Oct, 11, 2005
Primary Examiner * KWang B Yao Assistant Examiner * Kenan Cehic
Prior Publication Data
Us 2007/0081454 A1
(51) Int CL H04] 3/14 (52) (58)
(74) Attorney, Agent, or Firm * Weaver Austin Villeneuve
Apr. 12, 2007
& Sampson LLP
(57) ABSTRACT The present invention provides improved methods and
us. Cl. ..................................................... .. 370/236 Field of Classi?cation Search _____ __ 370/2301i236
deviees for managing network eerrgeerierr- Preferred imple mentations of the invention alloW congestion to be pushed
370/468; 709/225’ 229’ 232*235 See application ?le for Complete Search history
from congestion points in the core of a netWork to reaction points, Which may be edge devices, host devices or compo nents thereof. Preferably, rate limiters shape individual ?oWs of the reaction points that are causing congestion. Parameters
of these rate limiters are preferably tuned based on feedback
from congestion points, e.g., in the form of backward con gestion noti?cation (“BCN”)_ messages. In some implemen _ _ _
U.S. PATENT DOCUMENTS 5,402,416 A
5 , 5 26,3 50 A 5,742,604 A
6/1996 Gittins et a1, 4/1998 Edsall et a1.
71999 Handel eglal
The instantaneous measure(s) of congestion may be relative
8 1999 A an et
to a threshold of a particular queue and/ or relative to a thresh
Cieslak et a1.
tat1ons, such BCN mes sages 1nclude congestion change 1nfor mation and at least one instantaneous measure of congestion.
Haddock et al.
old of a buffer that includes a plural1ty of queues.
PCT Search Report mailed May 23, 2008 from International Appli cation No. PCT/US08/051986, including Noti?cation of Transmittal,
Sancho et al., “Analyzing the In?uence of Virtual Lanes on the Performance on In?niband Networks”; 2002; IEEE Proceeding of the International Parallel and Distributed processing Symposium (IPDPS’02); pp. 1-10. International Search Report, dated Sep. 21, 2006 from International Application No. PCT/US05/37069, 4 pp.
(4 PP~)~ PCT Written Opinion mailed May 23, 2008 from International Appli cation No. PCT/US08/051986, including Noti?cation of Transmittal,
(4 PP~)~ CIPO Second Of?ce Action mailed Feb. 27, 2009 in Chinese Appli cation No. 200580035946.
Written Opinion of the International Searching Authority, dated Sep.
CIPO Of?ce Action mailed Apr. 4, 2009, in Application No.
21, 2006 from International Application No. PCT/U S05/3 7069, 7 pp. International Search Report, mailed Nov. 1, 2006 from International Application No. PCT/US05/36700, 3 pp.
Written Opinion of the International Searching Authority, mailed Nov. 1, 2006 from International Application No. PCT/US05/36700,
US Of?ce Action mailed Mar. 4, 2009 for US. Appl. No. 11/152,991. US. Final Of?ce Action mailed Mar. 17, 2009 for US. Appl. No.
International Search Report, mailed Oct. 18, 2006, from International Application PCT/US05/37765, 3 pp. Written Opinion of the International Searching Authority, mailed Oct. 18, 2006, from International Application PCT/US05/37765, 7
US Notice of Allowance mailed Mar. 23, 2009 in US. Appl. No.
International Search Report, mailed Jan. 16, 2007, from International Application No. PCT/US05/37239. Written Opinion of the International Searching Authority, mailed Jan. 16, 2007, from International Application No. PCT/US05/ 37239. International Search Report, mailed Feb. 20, 2007, from International Application No. PCT/US05/37651. Written Opinion of the International Searching Authority, mailed Feb. 20, 2007, from International Application No. PCT/US05/37651. International Search Report, mailed Jun. 4, 2008, from International Application No. PCT/US2007/066027. Written Opinion of the International Searching Authority, mailed Jun. 4, 2008, from International Application No. PCT/US2007/
1 1/078,992. Of?ce Action mailed Apr. 7, 2009 for US. Appl. No. 11/094,877. US. Of?ce Action mailed Apr. 22, 2009 for US. Appl. No. 1 1/084,587. Of?ce Action mailed Jun. 22, 2009 for US. Appl. No. 11/400,671. Notice of Allowance issued May 29, 2009 for US. Appl. No. 1 1/15 5,388. Of?ce Action mailed Apr. 15, 2009 for US. Appl. No. 11/670,544. Cisco Systems, Inc., “Cisco data Center Network Architecture and
Solutions Overview,” http://www.cisco.com/application/pdf/en/us/ guest/netsol/ns377/c643/cdcconti0900aecd802c9a4f.pdf, 2006. F. Kamoun, et al., “Analysis of Shared Finite Storage in a Computer Network Node Environment Under General Traf?c Conditions”, IEEE Transactions on Communications, Jul. 1990.
AK. Choudry, et al., “Dynamic Queue Length Thresholds for Shared-Memory Packet Switches”, IEEE/ACM Transactions on Net
working, Apr. 1998.
CIPO Of?ce Action mailed Aug. 8, 2008 in Chinese Application No.
AK. Choudry, et al., “A New Buffer Management Scheme for Hier archical Shared Memory Switches”, IEEE/ACM Transactions on Networking, 26 pp., 1997. J. Mahdavi, et al., “IPPM Metrics for Measuring Connectivity”, RFC
CIPO Of?ce Action mailed Jul. 18, 2008 in Chinese Application No. 2005800346460.
Of?ce Action mailed Mar. 31, 2008 for US. Appl. No. 11/084,587. Of?ce Action mailed Jan. 30, 2008 for US. Appl. No. 11/078,992. Final Of?ce Action mailed Jul. 11, 2008 for US. Appl. No. 1 1/ 078,992. Of?ce Action mailed Jul. 3, 2008 for US. Appl. No. 11/400,671. Of?ce Action mailed Feb. 21, 2008 for US. Appl. No. 11/094,877. Of?ce Action mailed Jul. 28, 2008 for US. Appl. No. 11/094,877. Of?ce Action mailed Jan. 24, 2008 for US. Appl. No. 11/152,991. Final Of?ce Action mailed Sep. 8, 2008 for US. Appl. No. 1 1/ 152,991 .
Of?ce Action mailed May 29, 2008 for US. Appl. No. 11/155,388. Final Of?ce Action mailed Sep. 15, 2008 for US. Appl. No. 1 1/ 155,388. Of?ce Action mailed Oct. 23, 2008 for US. Appl. No. 11/078,992. Of?ce Action mailed Oct. 28, 2008 for US. Appl. No. 11/084,587. Final Of?ce Action mailed Dec. 10, 2008 for US. Appl. No. 1 1/ 094,877. IEEE Standards 802.3TMi2002, IEEE Computer Society, Mar. 8, 2002, 1513 pages. PCT Search Report mailed Sep. 27, 2007, from International Appli cation No. PCT/US06/38858, including Noti?cation of Transmittal,
2678, pp. 1-9, Sep. 1999. J. Postel, “Internet Control Message Protocol, DARPA Internet Pro gram Protocol Speci?cation”, RFC 792, pp. 1-15, Sep. 1981. Wei Cao Huawei Technologies: “IEEE 802.1ah Mode for Ethernet Over MPLS; draft-cao-pwe3 -801-1ah-00.tXt” IETF Standard-Work
EPO Of?ce Action mailed Oct. 19, 2009, in EP Application No. 058 10800 .2.
EPO Search Report mailed Mar. 19, 2010, in EP Application No.
PCT Written Opinion of the International Searching Authority mailed Sep. 27, 2007, from International Application No. PCT/
US06/38858 (6 pp.). “Random Early Detection Gateways for Congestion Avoidance,” (Transactions on Networking, Aug. 1993). RFC 3168, “TheAddition ofExplicit Congestion Noti?cation (E CN) to IP” (K. Ramakrishnan et al., Sep. 2001). U.S.Appl. No. 10/ 777,886, entitled “End-to-End Congestion Control in a Fibre Channel Network”, ?led Feb. 11, 2004.
US Of?ce Action mailed Nov. 23, 2009 in related U.S. Appl. No.
1 1/084,587. US Of?ce Action mailed Dec. 9, 2009 in related U.S. Appl. No. 1 1/400,671 .
US Of?ce Action mailed Nov. 4, 2009 in related U.S. Appl. No.
1 1/094,877. US Notice of Allowance mailed Apr. 23, 2010 in related U.S. Appl. No. 11/094,877. US Final Of?ce Action mailed Aug. 18, 2009 in related U.S. Appl.
METHODS AND DEVICES FOR BACKWARD CONGESTION NOTIFICATION
coupled With proportionally smaller buffer siZes and loW latency, causes buffers to ?ll up very quickly When congestion arises.
Some exemplary high-speed, loW latency netWorks having
CROSS-REFERENCE TO RELATED APPLICATIONS
relatively small buffers, Which Will be referred to herein as Data Center Ethernet (“DCE”) or the like, are described in
US. patent application Ser. No 11/084,587, entitled “Ether
This application is related to US. patent application Ser. No. 11/ 155,388 entitled “ACTIVE QUEUE MANAGE
net Extension for the Data Center” and ?led on Mar. 18, 2005,
MENT METHODS AND DEVICES” and ?led on Jun. 16,
to US. patent application Ser. No 11/078,992, entitled “Fibre Channel Over Ethernet” and ?led on Mar. 10, 2005 and to
2005 (the “AQM Application”), Which is hereby incorporated
US. patent application Ser. No 11/094,877, entitled “Net Work Device Architecture for Consolidating Input/Output and Reducing Latency” and ?led on Mar. 30, 2005, (the “DCE Applications”), all of Which are incorporated by refer
by reference for all purposes. BACKGROUND OF THE INVENTION
ence for all purposes. DCE netWorks are a challenging environment for conges
Congestion avoidance techniques are essential to the operation of netWorks and netWork devices. One such tech nique knoWn in the art as “Random Early Discard” or “RED” is described in a publication by S. Floyd and V. Jacobson
entitled “Random Early Detection Gateways for Congestion Avoidance, ” (Transactions on Networking, August 1993), Which is hereby incorporated by reference for all purposes. The basic principle behind RED is to control the average length of a netWork device’s (e. g., a router’ s) output queue in order to avoid long-term congestion. To achieve this goal, RED must Work tightly coupled With transport protocols,
tion management because of their high speed (minimum 10 Gbps) and loW latency (feW microseconds of round trip). 20
cations. If link-level ?oW-control is being used, congestion
spreads almost instantly. Prior art congestion control techniques such as RED and ECN have been shoWn to Work poorly With small buffers 25
because of the extremely compressed dynamics exhibited by such buffers. In fact, under congestion conditions a buffer in a DCE netWork ?lls up instantly When such techniques are employed, causing RED or ECN to Work in the region of
such as TCP, Which are equipped With their oWn congestion avoidance mechanisms and are thus capable to react to con
gestion indications generated by RED routers. FIG. 1A includes graph 100 that illustrates hoW RED
Also, in certain cases, such netWorks make use of 802.3X link-level ?oW control to guarantee Zero packet loss to appli
maximum drop/mark probability. This, in turn, causes the 30
traf?c ?oWs to sloW doWn more than necessary, Which causes
Works. For each incoming packet, the average queue length is
a loss of throughput.
calculated. (Please note that the terms “packe ” and “frame”
More advanced congestion control mechanisms tailored for netWorks characterized by operational parameters similar
may be used interchangeably herein.) If the average queue length is beloW a prede?ned minimum threshold 102, the packet is accepted and stored in the output queue for trans
to DCE have been considered. One such mechanism is Fibre 35
Channel Congestion Control (“FCC”), a congestion manage
mission. If the average queue siZe is above the minimum threshold 102 but beloW a prede?ned maximum threshold
ment mechanism for Fibre Channel netWorks that is
104, a packet marking probability is computed and the packet gets marked according to this probability. The marking prob
10/777,886, entitled “End-to-End Congestion Control in a Fibre Channel NetWork” and ?led on Feb. 1 1, 2004, Which is
ability is proportional to the average queue siZe. Therefore,
described in co-pending US. patent application Ser. No
When the queue is larger, there is a higher probability for an
a continuation-in-part of co-pending US. patent application Ser. No. 10/026,583, entitled “Methods and Apparatus for
incoming packet to be marked. Finally, if the average queue siZe is above the maximum threshold 104, all incoming pack
NetWork Congestion Control” and ?led on Dec. 18, 2001, both of Which are incorporated herein by reference for all
ets are marked until the average queue siZe falls again beloW the maximum threshold 104.
It is responsibility of the transport protocol to take the appropriate countermeasures When it detects packets marked
While quite effective at controlling congestion When it arises, FCC uses a conservative, time-driven rate recovery
process to accelerate traf?c ?oWs When congestion is improv ing. Therefore, FCC may take a longer-than-optimal time to recover the original rate of tra?ic ?oWs in congested high speed, loW-latency netWorks such as DCE netWorks.
al., September 2001), Which is hereby incorporated by refer
Many of the congestion management challenges of DCE
ence. When TCP is being used in the absence of an explicit
netWorks are shared by other netWorks, including but not limited to Fibre Channel netWorks and hi gh- speed Ethernet. It Would be very desirable to implement methods and devices that address at least some of the shortcomings of the prior art.
by RED. One explicit method of marking packets in this context is described in RFC 3168, “The Addition of Explicit Congestion Noti?cation (ECN) to IP” (K. Ramakrishnan et
method of marking packets, packets can only be “marked” by discarding them, With TCP interpreting the loss of packets as a congestion indication. When packet drops are detected, TCP sources immediately reduce their transmission rate, causing a reduction of the traf?c volume at the congested router(s). Discarding packets is also a useful means to control
SUMMARY OF THE INVENTION
average queue siZe When non-reactive transport protocols such as UDP are exploited.
As noted in the Background section of the AQM Applica tion, the RED algorithm presents scalability issues and other challenges. Moreover, as the speed of netWork tra?ic increases, controlling netWork congestion in an acceptable manner becomes increasingly challenging. This is true in part because it is not economically feasible to increase buffer siZes
in proportion to the higher netWork speeds. High speed,
The present invention provides improved methods and devices for managing netWork tra?ic. Preferred implementa tions of the invention alloW congestion to be pushed from congestion points in the core of a netWork to reaction points, Which may be edge devices, host devices or components thereof. Preferably, rate limiters shape individual ?oWs of the reaction points that are causing congestion. Parameters of these rate limiters are preferably tuned based on feedback
from congestion points, e.g., in the form of backWard con
US 7,961,621 B2 3
gestion noti?cation (“BCN”) messages. In some implemen tations, such BCN messages include congestion change infor
Alternative methods of the invention control rates of tra?ic injected into a netWork. One such method includes these steps: receiving a ?rst feedback message from a congestion point of a netWork, the ?rst feedback message comprising an
mation and at least one instantaneous measure of congestion.
The instantaneous measure(s) of congestion may be relative to a threshold of a particular queue and/ or relative to a thresh
instantaneous measure of congestion for the congestion
old of a buffer that includes a plurality of queues.
point, congestion change information for the congestion point and identity data for the congestion point; calculating a
Some implementations of the invention provide a conges tion management method that includes the folloWing steps: detecting netWork congestion at a ?rst congestion point of a network; identifying a ?rst congested entity of the network; calculating feedback information regarding a congestion level of the congested entity; and sending a ?rst feedback
feedback signal based, at least in part, on information in the ?rst feedback message; and adjusting a How rate of tra?ic
addressed to the congestion point according to the feedback
signal. The ?rst feedback message may identify a particular How.
message to a ?rst reaction point of the netWork. The reaction point is associated With one or more tra?ic ?oWs causing the
The congested entity may comprise a queue. The calculating step may involve calculating the feedback signal based on the instantaneous measure of congestion and the congestion
congestion, at least in part. The feedback message includes the feedback information and identity data for the congested
change information for the congestion point.
change information may be determined With reference to a
The ?rst feedback message may also comprise an indica tion that the occupancy of a buffer of the congestion point is above a buffer congestion threshold. If so, the calculating step may involve calculating the maximum negative value of the
predetermined threshold of a queue. The predetermined
threshold may decrease as a number of active virtual output
The method may include the step of adding a tag to each frame sent to the congestion point. The tag includes data responsive to the ?rst feedback message. All of the foregoing methods, along With other methods of
The feedback information may comprise an instantaneous
measure of congestion and congestion change information. The instantaneous measure of congestion and the congestion
queues (“VOQs”) in a buffer of a congestion point increases and the ?rst predetermined threshold may increase as the number of active VOQs in the buffer decreases.
The ?rst feedback message may be an indication to sloW doWn a tra?ic How, an indication to speed up a tra?ic ?oW or an indication to stop a tra?ic How. The ?rst feedback message
preferably identi?es a particular How. The congested entity
the present invention, may be implemented by softWare, ?rm Ware and/ or hardWare. For example, at least some methods of
the present invention may be implemented by computer pro 30
may be a queue.
The detecting step may involve sampling a frame and determining Whether a sampled frame includes data that is
device or an egress port of a host device’s netWork interface
responsive to a feedback message. When it is determined that
the sampled frame includes responsive data, the method may also include these steps: determining that the responsive data identify the ?rst congested entity; determining that the occu
currently above a ?rst predetermined threshold; and sending 40
frame. The second feedback message comprises an indication
back message to a source address of the sampled frame. The second feedback message comprises an indication to stop a
FIG. 7 illustrates an exemplary data path structure of a FIG. 8 illustrates an example of timeout and restart at a
reaction point. 55
and sending a second feedback message to a source address of
FIG. 9 depicts an alternative implementation for conges tion points having input buffers that are shared by a number of output queues. FIG. 10 is a netWork device that may be con?gured to implement some aspects of the invention.
an indication that the occupancy of the buffer is above the 60
When it is determined that the sampled frame does not
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
include responsive data, the method may further comprise the steps of determining that the occupancy of a queue to Which the sampled frame Will be added is currently beloW a ?rst predetermined threshold and determining not to send a sec ond feedback message to a source address of the sampled frame.
FIG. 6 illustrates exemplary processes of congestion detec tion and message generation at a congestion point.
the sampled frame. The second feedback message comprises
buffer congestion threshold.
sages betWeen a congestion point and a reaction point. FIG. 3 illustrates an exemplary BCN frame format. FIG. 4 illustrates an exemplary Rate Limited Tag (“RLT”) frame format. FIG. 5 illustrates an exemplary BCN frame format With
tra?ic ?oW. When it is determined that the sampled frame includes
responsive data, the method may also include these steps: determining that the responsive data identify the ?rst con gested entity; determining that the occupancy of a buffer of the congestion point is above a buffer congestion threshold;
FIG. 1A is a graph illustrating the RED algorithm. FIG. 1B is a netWork diagram illustrating netWork conges tion. FIGS. 2A and 2B illustrate different types of BCN mes
to sloW doWn a tra?ic ?oW.
When it is determined that the sampled frame includes responsive data, the method may also include these steps: determining that the responsive data identify the ?rst con gested entity; determining that the occupancy of a queue to Which the sampled frame Will be added is currently above a second predetermined threshold; and sending a second feed
card. BRIEF DESCRIPTION OF THE DRAWINGS
pancy of a queue to Which the sampled frame Will be added is a second feedback message to a source address of the sampled
grams embodied in machine-readable media. Some aspects of the invention can be implemented by netWork devices or portions thereof, such as an ingress port of an edge netWork
In this application, numerous speci?c details are set forth in 65
order to provide a thorough understanding of the present invention. It Will be obvious, hoWever, to one skilled in the art, that the present invention may be practiced Without some or
US 7,961 ,621 B2 5
all of these speci?c details. In other instances, Well known
220 to a reaction point (edge sWitch 110), indicating that edge
process steps have not been described in detail in order not to
sWitch 110 should sloW doWn its rate of transmission. Pref
obscure the present invention.
erably, negative BCN feedback message 220 includes su?i cient detail to alloW edge sWitch 110 to identify a particular traf?c ?oW (i.e., a layer 2 How, a layer 3 How, or a layer 4 How)
The present invention provides congestion management methods and devices that are particularly suitable for netWork devices, such as sWitches and routers. Some aspects of the present invention are particularly suitable for implementing a
that needs to be sloWed. A BCN frame is generated by a
congestion point by sampling incoming traf?c, e.g., as
Data Center Ethernet (“DCE”) solution, Which simpli?es the connectivity of data centers and provides a high bandWidth, loW latency netWork for carrying Ethernet and storage traf?c.
described beloW. In this example, core sWitch 140 has subse quently sent a “stop” BCN message 230 to edge sWitch 110. As described in more detail beloW, a “stop” BCN message 230 Will cause a reaction point to stop transmitting data (pref erably on a speci?ed data How) for a period of time. One exemplary BCN frame is depicted in FIG. 3. BCN frame 305 has a DestinationAddress (“DA”) 310 that is equal to the Source Address of the sampled frame. BCN frame 305 also has a Source Address (“SA”) 315 equal to an address
Some exemplary DCE methods and devices are described in
the DCE Applications, Which have been incorporated by ref erence herein. HoWever, the present invention has Wide appli cability outside of the DCE context and is suitable for Fibre
Channel netWorks, IP netWorks, etc, potentially any kind of packet sWitched netWork.
(here a MAC address) associated With the congestion point.
FIG. 1B shoWs a DCE netWork 105 that includes core
sWitch 140, edge sWitches 110, 120 and 130 and correspond ing end nodes 115, 125 and 135. End nodes 115 and 135 are simultaneously sending tra?ic at a line rate (10 Gbps) to end node 125. Because the aggregate tra?ic rate from links 150
This alloWs BCN Frame 220 to be routed back to the source of
the tra?ic causing congestion (in this example, to edge sWitch 20
110 ) With a valid source address.
and 170 are merely illustrative and that in some netWorks there may be many more links, core devices, etc., disposed betWeen the edge sWitches and the core sWitch shoWn in FIG. 1B. In this example, core sWitch 140 is a “congestion point”
In this example, ?eld 320 is an IEEE 802. 1Q tag that carries the VLAN of the sampled frame and the Priority ?eld indi cating the highest priority. Field 320 Will indicate a null VLAN in tWo instances: (1) if the sampled frame did not carry an 802.1Q tag or (2) if the VLAN ?eld of such tag indicated
that detects the congestion condition. According to preferred
and 160 exceeds the capacity of link 170, link 170 is subject to congestion and the queue(s) associated With it start ?lling up. Those of skill in the art Will appreciate that links 150, 160
a null value. Field 325 identi?es the frame as being a BCN
feedback message, in this example by indicating a predeter mined EtherType This EtherType could be any of the cur
rently unassigned EtherTypes, e.g., as per http://www.i
ana.org/assignments/ethemet-numbers. These EtherTypes
implementations of the invention, as soon as a congestion
are assignable by the IEEE Standards Department, 445 Hoes
point detects congestion, it starts sending explicit feedback
Lane, PO. Box 1331, PiscataWay, NJ. 08855-1331.
messages to the reaction points associated With the tra?ic
Version ?eld 330 indicates the version of the BCN proto col. In this example, three bits folloWing version ?eld 330 change the semantics of the BCN message When they are set. The meaning of these bits Will be described beloW. Q bit 331 indicates that Qdelta is saturated. In the example described beloW, Qdelta is saturated When its value is either equal to —2
?oWs causing such congestion. Such feedback messages Will sometimes be referenced herein as backWards congestion noti?cation (“BCN”) messages, BCN frames, or the like. In
some such implementations, the feedback message is an Eth ernet frame, Which may have a format similar to that of the
frame depicted in FIG. 3. In this example, core sWitch 140 causes “sloW-doWn” BCN messages 180 and 190 to be sent toWards end nodes 115 and 135. These messages Will also be referred to herein as a “negative BCN feedback messages” or the like. Such mes sages (and other BCN messages that are described beloW) are
processed at “reaction points,” Where congestion mitigation
ignored on reception. Future versions of the BCN protocol may rede?ne all or some of the reserved bits.
measures are put into place. The reaction points couldbe edge sWitches 110 and 130, or, in some implementations, end nodes 115 and 135. The processing of a negative BCN feedback message Will result in the instantiation of a ?lter/rate limiter (or a further
Qeq or —2 Qeq. M bit 332 indicates a condition of mild congestion, Whereas S bit 333 indicates a condition of severe congestion. Reserved bits in ?eld 335 are not used in this example. Instead, they are set to Zero on transmission and
Field 340 indicates a congestion point identi?er (“CPID”). A primary purpose of the CPID is to identify a congested
entity in the netWork. In this example, the congested entity is a queue of core sWitch 140. This information is sent to a
reaction point in order to create an association betWeen the 50
congested entity and the reaction point.
sloW doWn of the one(s) already instantiated, if any) at the
The contents of timestamp ?eld 350 and unit ?eld 352 are
reaction point. The purpose of the rate limiter is to sloW doWn a congesting tra?ic How to mitigate congestion at the core
copied from the homonymous ?elds of a Rate Limited Tag (“RLT”) of the sampled frame. RLTs Will be described beloW With reference to FIGS. 2B and 4. If the sampled frame does
sWitch. If congestion should improve (or dissipate com pletely), “speed-up” messages (also referred to herein as “positive BCN feedback messages” or the like) Will cause the
set to Zero.
Qoff ?eld 355 and Qdelta ?eld 360 contain quantitative feedback information conveyed by the congestion point to the
rate limiters to increase their rate to avoid Wasting bandWidth
at the congestion point. FIGS. 2A and 2B illustrate exemplary exchanges of mes sages betWeen a congestion point and a reaction point. In this example, the congestion point is core sWitch 140 and the
reaction point is edge sWitch 110. In FIG. 2A, edge sWitch 110 is sending untagged data frames 210 to core sWitch 140,
indicating that edge sWitch 110 has not yet received (or has not recently received) a BCN feedback message. HoWever, core sWitch 140 has detected congestion. First, core sWitch 140 has sent negative BCN feedback message
not carry such a tag, timestamp ?eld 350 and unit ?eld 352 are
reaction point. The use of such ?elds Will be described beloW With reference to FIG. 6.
Field 365 of BCN frame 305 consists of the ?rst N bytes of the sampled frame. N is a con?gurable parameter, and it has a minimum value is such that the resulting BCN frame is alWays guaranteed to be as large as, or larger than, a mini mum-siZed frame of the type used to implement the invention (e.g., a minimum-sized Ethernet frame of 64 bytes). For example, in the case of BCN frame 305 of FIG. 3, the mini
US 7,961,621 B2 7
mum value of N has to be 26 in order to ensure the length of
a congestion point Will generate BCN feedback messages
BCN frame 305 to be 64 bytes or larger. The information in ?eld 365 conveys to the reaction point enough information to
only on ?oWs Whose frames that carry an RLT tag With a CPID matching its oWn ID. As noted above, When a reaction point receives a BCN frame from a congestion point and such message causes a congestion mitigation action to be under taken on a particular data How, the reaction point associates
exert highly focused congestion mitigation actions. For example, a reaction point may use source and/or destination
IP addresses and TCP ports from ?eld 365 to identify speci?c tra?ic ?oWs and alter the corresponding transmission rates. Field 370 is the Frame Check Sequence or CRC of the BCN frame 305.
the CPID With the data How, eg by saving the CPID in a local register associated With such data How. Field 420 is reserved. Timestamp ?eld 425 may be used to estimate the round trip
FIG. 5 illustrates an example of an extended BCN frame
time betWeen the reaction point and the congestion point With
505 that may be used in netWorks employing MAC-in-MAC
Which it is associated. Each time a reaction point inserts an RLT tag in a frame it is going to transmit, the current value of
encapsulation. Such methods may be implemented, for
a local free running timer is copied into timestamp ?eld 425. Unit ?eld 427 indicates the time units used by the free running timer. The resolution of this free running timer may be, for
example, according to a conventional MAC-in-MAC scheme as described in IEEE standard draft 802.1ah or according to
novel methods described in Us. patent application Ser. No 11/152,991, entitled “FORWARDING TABLE REDUC
example, a value in the range 1 us to 100 us. As noted above, When a frame having an RLT tag is sampled by a congestion point, the contents of timestamp ?eld 425 and unit ?eld 427 are copied and inserted in timestamp ?eld 350 and unit ?eld
TION AND MULTIPATH NETWORK FORWARDING”
and ?led on Jun. 14, 2005, both of Which are hereby incorpo
rated by reference. BCN frame 505 includes outer destination address ?eld 510, Which indicates the outer source address of the sampled
erating BCN frames at a congestion point Will noW be
frame. Field 515 indicates the outer source address of the
congestion point, Which is a hierarchical MAC address in this example. Field 520 indicates the outer S-Tag (the outer IEEE
802.1Q tag) of the sampled frame. Field 525 indicates that
described With reference to FIG. 6. Queue 605 is a queue of a 25
Field 530 indicates the inner destination address, Which is the inner source address of the sampled frame. Fields 535
through 580 correspond generally With ?elds 315 through 30
congestion conditions. Incoming frames are sampled With a certain probability P 620. P 620 is a con?gurable parameter, the selection of Which
S-Tags 540 and 520 (a.k.a. B-Tag in 802.1ah) should be the same as the VLAN ?eld of the 802.1Q ?eld of the sampled frame. The priority ?eld of the outer S-Tag 520 should be set to the highest level of priority, While the same ?eld of the inner
S-Tag 540 is the priority ?eld of the sampled packet. FIG. 2B illustrates exemplary exchanges of messages that
congestion point. An equilibrium threshold Qeq 610 de?nes a desired operating point of a queue under congestion condi tions. In other Words, Qeq 610 establishes a target level around Which the length of queue 605 should oscillate When congestion arises. A severe congestion threshold Qsc 615 de?nes the level at Which the queue is subject to extreme
frame 505 is a MAC-in-MAC frame.
370 of BCN frame 3 05. The VLAN ?eld of the inner and outer
352 of a BCN frame generated by the congestion point. Exemplary methods for congestion detection and for gen
is a tradeoffbetWeen the usefulness of more frequent conges
tion detection and the overhead required for more frequent 35
sampling and computation. In some preferred implementa tions, P 620 is in the range of 0.001 to 0.1; in some such
may occur When a reaction point has already received one or
implementations, P 620 is 0.01. The values of Qeq 610, Qsc
more BCN frames from a congestionpoint. Here, edge sWitch 110 has previously received BCN frames from core sWitch 140. Additional BCN frames are en route, including positive BCN feedback message 250 and another negative BCN feed back message 220.
615 and P 620 should be established before the other steps shoWn in FIG. 6 are performed. In step 625, a congestion point determines Whether or not to sample a frame. If no frame is sampled, no BCN frame Will be generated at that moment. When a frame is sampled, the
When edge sWitch 110 receives a BCN frame from con gestion point 140 and such message is intended to cause a congestion mitigation action to be undertaken on a particular data How (e. g., the installation of a rate limiter or the sloWing doWn of an existing one), edge sWitch 110 stores a CPID in a
local register associated With such data How. All the frames 240 belonging to that How that are subsequently injected by edge sWitch 110 in the netWork Will carry a Rate Limited Tag
process continues to step 635, Wherein the sampled frame is evaluated. In this example, When the length of queue 605 is beloW Qeq, the treatment of sampled frames Will differ according to Whether the sampled frame carries an RLT tag having a CPID that identi?es the congestion point. If the sampled frame does not carry such an RLT tag and the length of the queue beloW
(“RLT”) containing the CPID.
Qeq, no BCN Frame is generated (message generation scheme 640) and sent (step 642). HoWever, if the sampled
One exemplary rate-limited frame 400 is illustrated in FIG. 4. Fields 402 and 405 indicate the destination address and source address, respectively, of rate-limited frame 400. Field 407 indicates the S-Tag value of rate-limited frame 400.
beloW Qeq, a BCN Frame is generated (message generation scheme 660) and sent (step 642). In other Words, in this implementation, if the sampled
frame does carry such an RLT tag and the length of the queue
Fields 410 through 427, shoWn in bold in FIG. 4, comprise
frame carries an RLT tag the congestion point generates a BCN frame irrespective of the current queue length if and
an RLT in this example. Field 410 indicates that the tag is an
only if its congestion point identi?er matches the CPID ?eld
RLT. In this example, the RLT tag is identi?ed by a predeter mined value in EtherType ?eld 410. Version ?eld, 412 and Reserved ?eld 414 have the same meaning as the 330 and 335,
respectively, of BCN frame 305. CPID ?eld 415 indicates the congestion point to Which the RLT pertains. This information may be used to complete the association betWeen a reaction point and the corresponding congestion point. One important purpose of this association is to prevent a reaction point from receiving positive feedback from multiple congestion points for the same ?oW. Preferably,
in the RLT tag. When such a match occurs, the timestamp 60
?eld of the RLT tag is copied into the corresponding ?eld of the BCN Frame.
In this implementation, When the queue length is above Qeq, the Congestion Point Will generate either a regular BCN feedback message or a “stop” BCN feedback message irre 65
spective of the CPID ?eld in the RLT tag. In this example, if the length of queue 605 is ZQeq and is éQsc, a negative BCN feedback message is generated Whether or not the packet
US 7,961,621 B2 10 carries an RLT tag, and Whether or not the CPID of the RLT
respect to the offset component Qoff(Which is also referred to
tag (if any) matches the congestion point ID. A “stop” BCN
herein as the instantaneous measure of congestion or the like).
feedback message is generated When the length of the queue
The values of Qoff and Qdelta are determined from BCN frames received by a reaction point. Based on the sign of the Feedback Signal Fb, in some implementations of the inven
is >Qsc. In this example, a BCN feedback message includes tWo
?elds, Qoff and Qdelta. Qoff is an instantaneous measure of congestion, Which in this example is the offset of the current queue length With respect to the equilibrium threshold Qeq. Here, Qoff is saturated at +Qeq and —Qeq. Here, a BCN feedback message also includes congestion change informa
tion the rate R is increased or decreased as folloWs:
tion. Here, the congestion change information is Qdelta,
If FbIO, R is unchanged. Here, Gi and Gd are the Increase Gain and Decrease Gain respectively, and Ru is the Rate Unit
IfFb>OR :R +Gi-Fb-Ru
Which is the change in length of the queue since the last sampled frame. In this example, Qdelta is saturated at +2 Qeq and —2 Qeq. When Qdelta saturates, the Q bit in the BCN Frame is set. A “stop” BCN feedback message is indicated by Zero values for Qoff and Qdelta. In fact, since a BCN message is not generated When a frame is sampled and Qoff and Qdelta
(i.e., the granularity of the rate adjustment) employed by the rate limiters. In one example, GiIl, Ru:8 Mbps and Gd:1/64. HoWever, these values are merely exemplary and the vari
ables of Equations (2) and (3) may be optimiZed according to the implementation. The calculations are preferably done in the reaction point. In alternative implementations, the calcu lations are done elseWhere, e.g., in the detection point. HoW
are both Zero, this combination may be used to identify a
“stop” BCN message. Qdelta may be calculated according to at least tWo meth ods. In the ?rst method, Qdelta is the difference betWeen the current queue length and the queue length at the previous time of sampling. In a second method, Qdelta is the difference betWeen the number of packets (or other data units) added to the queue and the number of packets (or other data units) removed from the queue since the last time of sampling. The
tions in the general form of Equations (2) and (3) to control 25
changes in R, the rates are decreased more aggressively When
Fb<0 (a multiplicative decrease) than the rates are increased When Fb>0 (an additive increase). This is desirable in order to avoid ?lling the buffers of a congestion point too quickly due
the previous queue length be stored in memory. The second method requires a smaller amount of state to be kept, but may
to a sloW response to detected congestion or due to a too-rapid 30
increase in How When congestion is abating. A limited number of ?lters/rate limiters may be available. There may be cases When all the ?lters have been used and a BCN message is received Which should cause the instantia
FIG. 7 illustrates the structure of the data paths of a reaction
point according to some implementations of the invention. This process may be implemented, for example, in an ingress port of an edge sWitch or in an egress port of the netWork
interface card (“NIC”) of a host device. Data path 705 repre sents a condition of the reaction point before any BCN frames
ever, if the calculations are performed in a location other than
the reaction point, the most effective use of timestamps Will be inhibited. It Will be observed that in implementations that use equa
?rst method is more accurate but requires that an indication of
be prone to error accumulation.
tion of a neW ?lter/rate limiter pair. In such cases, a number of 35
actions may be taken, e.g.: (l) aggregate all the ?lters/rate limiters in a single ?lter/rate limiter that controls the entire
have been received indicating congestion that pertains to this
traf?c originated by and end system; (2) aggregate ?lters/rate
reaction point, e. g., as in the state of edge device 110 in FIG.
limiters in an “intelligent” Way, e.g., use the same ?lter/rate
2A. In data path 705, un-tagged data frames, like those of data frames 210 of FIG. 2A, are transmitted by the reaction point. After BCN frames have been received indicating conges
limiter for all the traf?c ?oWs sharing the same destination address, etc; or (3) aggregate ?lters/rate limiters in a “less intelligent” Way, e.g., use the same ?lter/rate limiter for all the
tion of the frame header. When a reaction point receives a BCN Frame, the differ ence betWeen the current time and the time indicated in the timestamp ?eld of the BCN Frame is calculated. This differ ence is the last measure of the round trip time betWeen the
tion that pertains to this reaction point (e.g., as in the state of edge device 110 in FIG. 2B), a set of ?lters 720, F1 through Fn, divert the traf?c that matches a particular ?ltering crite
rion (e.g., L2 SA-DA, L3 SA-DA, etc.) from data path 705 to
traf?c ?oWs sharing the same bucket based on an hash func
a set of queues. Tral?c is drained from such queues by a set of
corresponding rate limiters 740, R1 through Rn, Whose rate is controlled by the BCN Frames coming from congestion points. Besides controlling the rate of traf?c, in this imple mentation the rate limiters also cause an RLT tag to be added
reaction point and the congestion point. This measure may be
averaged out, for example using an Exponential Weighted 50
to all the frames they transmit in order to elicit feedback from the congestion points. To ensure that the feedback is gener
ated only by the congestion point that originally caused the instantiation of the ?lter, the RLT tag contains the identity of
such congestion point (“CPID”). Congestion points should
include their identity in every BCN Frame they generate, so that each of ?lters 720 may be associated With individual
congestion points. According to some implementations of the invention, the rate control algorithm used by rate limiters 740 Works accord ing to a Feedback Signal Fb that is calculated, e.g., according to Equation(l): Fb:(QOff-W'Qdelta)
Once a rate limiter has been instantiated, it may be reclaimed once tWo conditions are satis?ed: (l) the queue of the rate limiter is empty, and (2) its rate is at or above the line-rate. These tWo conditions are necessary to avoid out of
order packet delivery.
In Equation (1), W is a parameter used to Weight the deriva tive component Qdelta (Which is also referred to herein as the congestion change component or the like) more or less With
Moving Average similar to the one used by WRED, and used to dynamically adjust the value of some of the reaction parameters. For example, a reaction point may have a number of tables containing different values of the W, Gi, and Gd parameters precalculated based on different round-trip times. The current value of the averaged round-trip time may be used to select the table of parameters that best suite the current loop
Each rate limiter is associated With a timer that is reset every time a BCN Frame is received. If this timer expires, it means that the corresponding rate limiter has not received BCN Frames for the entire duration of the timeout period.
This may happen, for example, because the traf?c stream that