The post title alone is cause for fighting in some circles (it’s just an invitation for argument and I know it’s more of a marketing thing than a technically accurate description), but work with me here. On one level or another, there is growing interest and marketing around the concept of being able to eliminate Spanning Tree Protocol (STP) in layer 2 networks and enabling multipathing in bridged networks. It’s hard to have missed Cisco’s plugging of their FabricPath technology, and underneath all the marketing, routing frames from A to B is pretty much what it is about.
For the purposes of this post, we’ll look at why STP is such a beast to begin with (and let’s face it, that could be a multi-part post on its own), then I’ll look in a series of posts at three competing options that would allow you to get rid of it:
- TRILL (in this post)
- Shortest Path Bridging / 802.1aq
- Juniper’s QFabric
Let’s dive straight in then.
Why Spanning Tree Sucks
When Radia Perlman designed the Spanning Tree algorithm one fateful night in 1985, she created a protocol that is arguably both a huge networking enabler and the most evil beast in existence. The fact that it came with a free “Algorhyme” pretty much guaranteed its acceptance of course:
I think that I shall never see
A graph as lovely as a tree.
A tree which must be sure to span.
So packets can reach every LAN.
First the root must be selected.
By ID, it is elected.
Least cost paths from Root are traced.
In the tree these paths are placed.
A mesh is made by folks like me.
Then bridges find a spanning tree.
As a quick overview here’s a very abbreviated pros/cons evaluation of STP as I see it:
- You can join Ethernet segments using Bridges and deliberately introduce loops for resiliency, but without traffic looping around infinitely;
- STP is a distributed algorithm that scales in terms of CPU/memory requirements regardless of network size;
- STP is plug and pray (within reason);
- STP has allowed campus and datacenter networks to scale with resilient links.
- STP recalculations => potentially long data interruption;
- Many ‘enhancements’ and proprietary variations over the years;
- Resiliency yes, but hideously inefficient utilization of the available bandwidth – no way to load balance traffic.
There was a great diagram I saw in the slide deck for Cisco Live 2010 session BRKDCT-2079 demonstrating the last point above. To some extent it’s a case of reductio ad absurdum, but the point was made that STP “takes a perfectly good meshed network and reduces it to a tree”. With thanks (and apologies) I’m going to shamelessly borrow the illustration used as it’s a rather handy visualization of the issues with STP. The Cisco Live slide had a diagram something along these lines (assuming 10Gbps links):
The bold lines represent the paths that have not been blocked by STP, i.e. actively forwarding links. In this case, a VLAN that is trunked across all these wonderfully meshed links is ultimately reduced from a potential of 160GBps capacity at the access layer to 20gbps through the Core. When you consider what is often referred to as “East-West” switching (i.e. servers talking to other servers on different access switches), a server on the left-most access switch talking to a server on the right-most access switch would see its frames going up to the Core and all the way back down, and it would be bottle necked by the fact that this ‘tree’ structure has to have one – and only one – root node at the top. Ok, this also assumes that we trunk through our core – and we wouldn’t do that, would we? – but the point is well made. In order to ensure a loop-free topology, Spanning Tree effectively wastes much of the potential bandwidth available to you, and doesn’t necessarily make the best decision in terms of the “best” path between any two given points.
Over the years, STP was deemed too slow to converge, and for ports connected to end device like PCs, the whole blocking, listening, learning, forwarding sequence of STP on those ports meant that end stations found themselves unable to complete DHCP requests (for example) because the port was not yet forwarding when they sent the request. Similar issues arise in the wider network, as topology changes cause STP recalculations during which time again traffic is frozen as the network has to re-converge as loop free. As a result, Cisco in particular drove a number of very handy enhancements like PortFast, BackboneFast and UplinkFast. Trunking between switches (ISL and later on 802.1q) saw vendors taking different approaches to STP for multiple VLANs – some chose to run a single STP instance, and Cisco chose to go down the Per-VLAN Spanning Tree (PVST) path – more complex in many ways, but more controllable, and certainly more effective where different VLANs were trunked over different devices to each other and needed a unique tree. Of course that in turn didn’t scale so well when you started talking about thousands of VLANs (and if you have a Cisco Catalyst 6500 or 7600 switch with a few hundred VLANs being trunked in and out of a given linecard and you’ve most likely hit this warning:
%PM-SP-4-LIMITS: The number of vlan-port instances on module 1 exceeded the recommended limit of 1800
The solution? Multiple Spanning Tree (MST) – a nice compromise that sits somewhere between PVST (which has an instance per VLAN), and a single-STP solution, allowing you to map multiple VLANs into a single STP instance, but still allowing you to have multiple STP instances and thus multiple trees. In theory this does allow you to create a primitive form of load sharing between VLANs, but in the case of link failures I’m not sure I’d like to try and predict the net effect in a complex network.
So let’s assume then that while functional, STP is fundamentally evil. Even Perlman is on record saying that she didn’t think STP was the right solution, and that she believed that traffic should always have been routed between segments, not bridged. So where do we go if we want to maintain the crazy level of meshing that was shown in the earlier diagram, but also to avoid letting STP have its wicked way with your links?
TRILL / L2MP / FabricPath
First let’s get the definitions out the way:
- TRILL = TRansparent Interconnection of Lots of Links (surely vying for a spot at the Most Contrived Acronym awards), an IETF protocol;
- L2MP = Layer 2 Multi-Pathing;
- FabricPath = what happens when Cisco marketing wants to sell a technology that doesn’t have a cool enough name. I wonder sometimes what would happen to a technology like BGP if it were new today?
What’s the difference between these three? Let’s disambiguate L2MP and FabricPath first; FabricPath appears to be the Cisco marketing name for L2MP. In turn, L2MP is basically TRILL with – inevitably – some extra Cisco proprietary knobs and features which they claim you will want. The bottom line is though that the underlying technology for these is pretty much TRILL, so if you understand TRILL, you’ll have a pretty good idea what’s going on with FabricPath. So let’s do that.
The “Problem and Applicability Statement” for TRILL is published in RFC5556, authored by Joe Touch (ISI) and Radia Perlman (Sun):
Interesting to see Radia Perlman there again, isn’t it? I like to think that she’s doing this in order to make up for the scourge of STP, but whatever her reasons I’m not complaining. If I might be irreverent for a moment, as I looked at Perlman’s picture I did notice an uncanny resemblance to another famous person and wondered if they were in some way related:
Oddly, I feel like a lightning bolt is about to strike in my vicinity (perhaps reaching out through the Ethernet cabling). Anyhow, Radia’s son, Ray Perlner, was kind enough to write Algorhyme V2 in an attempt to once again ensure acceptance of Radia’s baby:
I hope that we shall one day see
A graph more lovely than a tree.
A graph to boost efficiency
While still configuration-free.
A network where RBridges can
Route packets to their target LAN.
The paths they find, to our elation,
Are least cost paths to destination!
With packet hop counts we now see,
The network need not be loop-free!
RBridges work transparently,
Without a common spanning tree.
– Ray Perlner
You really can’t argue against a poem like that, but before you go deploying TRILL with those rhyming couplets dancing around in your head we should probably look at how it works so you know what you’re in for. In terms of headline features, TRILL is:
- perhaps most simply described as “layer 2 routing” (controversy alert!);
- capable of ECMP (Equal Cost Multi Path) delivery of frames;
- the protocol allows “unlimited” equal cost paths
- unclear if there may be hardware dependent limits
- a replacement for the Spanning Tree Protocol ;
- transparent to end stations;
- a “zero configuration” protocol;
- compatible with existing bridges (can do incremental deployment);
When I say Layer 2 routing, in order to avoid the inevitable backlash, it’s only true up to a point – in theory, frames are routed from end to end across the layer 2 domain based on a knowledge of where a remote MAC address is located. Reality though means that there is still need to flood frames at times, and that’s a behavior you don’t see with true routing protocols, right? Imagine a routing protocol where if you don’t have a route, you simply send the packet everywhere in the hopes that eventually it gets to the right destination? Ok, so not quite truly routing then, but certainly within the layer 2 domain, we’re going to try to route as many frames as possible. Maybe it’s just Smart Bridging?
“Routing” frames means you can make smart decisions and use ECMP to load share traffic, meaning that the network diagram we saw earlier with Core bandwidth limited to 20Gbps (which affected East-West traffic flows), now looks like this:
Hooray, we can trunk VLANs through our Core with impunity! Of course, that East-West traffic will likely never go to the Core now anyway, as all paths from Access to Distribution are now available.
A TRILL capable switch is called an RBridge (Routing Bridge). Each RBridge runs IS-IS in order to communicate with other RBridges and establish a routed topology. IS-IS is chosen for a few reasons:
- IS-IS runs at Layer 2, so switches require no Layer 3 configuration to make this work
- Compare with OSPF, which requires IP addresses to be configured, and uses IP Multicast to communicate!
- IS-IS is extensible (consider for example Integrated IS-IS where IP prefix information is piggy-backed on the underlying IS-IS connectivity).
- IS-IS is a Link State Protocol, so it’s pretty fast, efficient and scalable.
What if you don’t know IS-IS, or your engineers only know OSPF? It doesn’t matter! TRILL is intended to be plug and play – you don’t need to configure anything to make it work, you just need to enable it and let it discover its own neighbors and do the hard work automatically. It’s unlikely that you would ever need to even see anything other than who your neighbors are, and I sincerely doubt that there will ever be commands implemented to tweak TRILL IS-IS parameters.
But wait – if I already have IS-IS running on my network and I enable TRILL (which uses IS-IS too) on the same VLAN as one of my IS-IS routers, won’t they find each other and get confused? No; the TRILL team thought of that one and use a different multicast MAC for TRILL’s IS-IS implementation, so they will be like ships in the night. A new range of multicast MAC addresses has been assigned for TRILL (01-80-C2-00-00-40 through 01-80-C2-00-00-4F) with, for example, 01-80-C2-00-00-41 assigned for “All IS-IS RBridges”. There’s a new Network Layer Protocol Identifier (NLPID) defined – 0xC0 – and it looks like the TLV Code Points have been approved. IS-IS is basically being extended to support the unique requirements of TRILL, and to ensure that there is no conflict between regular IS-IS and TRILL’s use of IS-IS. You can read more about TRILL’s use of IS-IS in the IETF Internet-Draft “TRILL Use of IS-IS (draft-ietf-isis-trill-05)”.
Lots of theory for sure, but I’m guessing you’re still wondering how TRILL actually works, so let’s go there next.
How TRILL Works
We know that TRILL uses IS-IS as its underlying protocol; this is used to ensure that every RBridge has a picture of the network so that an optimal routing tree can be created, including support for Equal Cost Multi Path (ECMP) to a destination. The tree that’s created isn’t based on the actual MAC addresses, but rather on the RBridge IDs that are exchanged. IS-IS is used to figure out the best path or paths to get frames from RouterA to RouterB through any number of intermediate RBridges.
When an RBridge stores a mapping (my terminology) for a destination MAC address, it stores the MAC address along with the ID (or ‘Nickname’) of the RBridge that that MAC is connected to (i.e. the destination RBridge):
This level of abstraction means that when there’s a topology change in the network, only the RBridge tree needs to be recalculated and a new path to the other RBridges installed in the forwarding tables; MAC entries don’t necessarily need to change, as they point to the RBridge ID and are not directly part of the tree. By way of analogy, this is similar to a redistributed External OSPF route advertised with a forwarding address set to 0.0.0.0 – the routing decision is made based on recursive lookup of the Advertising Router’s IP; consequently a Shortest Path does not need to be calculated for the prefix contained in the External route, just to the Advertising Router IP.
Incidentally, with Spanning Tree Protocol a network link failure leads to a Spanning Tree recalculation, and to accelerated CAM table aging (i.e. the bridging / MAC address forwarding tables are flushed). With TRILL if a network link fails, the underlying topology is recalculated in IS-IS, but assuming that there is still an alternate path available to get to the egress RBridge, the MAC mappings do not need to be flushed. This helps reduce flooding after a failure, and because of the rapidity with which IS-IS recalculates paths, interruption to traffic flow is minimal.
Let’s assume that I have an RBridge – RBridge3 which has the MAC mappings and ISIS forwarding tables illustrated above. From the perspective of RBridge3, to get to 00-00-fe-11-22-33, it knows it needs to send traffic to RBridge1. If I have other RBridges in between RBridge1 and RBridge3, there are two ways to ensure that the traffic gets to the other end as planned. One way is to ensure that every RBridge in the network has a full list of MAC mappings, and can therefore make their own forwarding decision. This is good, but means that RBridges in the network Core may face a large burden in terms of the size of tables they have to maintain – kind of the opposite of how we normally like to utilize a Core. The alternative is to encapsulate the original frame in a new frame whose destination is RBridge1. Intermediate RBridges don’t need to know the destination MAC, they just need to know how to forward traffic to RBridge1. Well that’s easy – IS-IS has already ensured that the RBridges all know about each other! And so that’s the choice that was made – the ingress RBridge encapsulates the frame using a special TRILL Ethertype (0x22F3) and sends it to the destination RBridge. It’s still a “Hop by Hop” routing decision – each RBridge makes its own decision how best to get to the destination RBridge – but the choice of egress RBridge was determined by the ingress RBridge.
As a TRILL-encapsulated frame traverses For each hop, the destination address of the frame must change (just like it does when routing IP), and inside the frame there must be a record of where the frame is ultimately going (the egress RBridge), again kind of like the destination IP address in an IP frame. This is accomplished using an outer (Transport) header and an inner (TRILL) header prepended to the original frame. You can read in more sickening detail and with fewer crass generalizations about the frame formats in the IETF Internet-Draft “RBridges: Base Protocol Specification”
Here’s a representation of a frame that ingressed at RBridge1, is ultimately destined for RBridge3, and is encapsulated by TRILL to traverse RBridge2 on the way:
The frame as shown between RBridge1 and RBridge2 can be decoded thus:
- The egress RBridge – i.e. destination – (from TRILL Header) is RBridge3.
- The ingress RBridge (from TRILL Header) is RBridge1.
- The sending device for this link (from Transport Header) is RBridge1.
- The destination device for this link – i.e. the next hop – (from Transport Header) is RBridge2.
Once RBridge2 receives is and forwards it on towards RBridge3, the TRILL Header remains unchanged, and the Transport Header will show a source of RBridge2 and a destination on that link of RBridge 3. This is all suspiciously similar to watching an IP packet encapsulated in Ethernet traversing some routers
This is all good for unicast MACs, but TRILL also must support the transport of Broadcast/Multicast frames. TRILL creates optimal Broadcast/Multicast trees, and uses Reverse Path Forwarding checks to ensure that there are no loops. I’ve honestly not seen much more information about the nature of those trees, so I’m going to leave it there and try to be content with the rather skimpy knowledge that they are supported as efficiently as possible.
We now know:
- that IS-IS is used to build a topology of RBridges;
- that frames are TRILL encapsulated from ingress to egress in a TRILL network;
- that RBridges make a hop-by-hop routing decision;
- that MAC addresses map to egress RBridges;
- and that a MAC address mapping is only required at ingress RBridges that need to talk to the destination MAC;
- that broadcast/multicast MAC flooding is supported.
So far so good, but how do we learn those MAC addresses in the first place?
Learning MAC Addresses
This is, amusingly enough, the part of the protocol that seems to have got the least attention when it’s being presented, which is a little odd when you consider that without the ability to learn MAC mappings, nothing will work. Since again we have a rather important part of the protocol with only scant information available, I will share what I know with a little bit of extrapolation and hopefully we’ll be in the right ballpark.
RBridges learn MAC addresses through four basic mechanisms:
- Locally received native (non-TRILL) frames. RBridges have to deal with end stations just like regular bridges, so they will build a regular MAC address forwarding table for locally connected devices.
- Static (manual) configuration. Yep, you can force a MAC address mapping.
- From received TRILL frames. Remember the TRILL header within the frame format I showed earlier? It includes the source (ingress) RBridge ID and the destination (egress) RBridge ID. If an RBridge receives a TRILL encapsulated frame, it can extract the ingress RBridge ID and the Source MAC address, and it will add a MAC mapping for that source MAC pointing back to the ingress. This means that (a) return traffic won’t flood initially, and (b) the path is assured to be symmetrical in terms of ingres/egress RBridges.
- (optional) Explicitly distribute MAC addresses using TRILL ESADI (End Station Address Distribution Information) protocol. In effect, TRILL can use a special protocol to distribute the known local (native) MAC addresses to all other RBridges so that they pre-populate their MAC mappings. This is similar in some ways to the way OTV (Overlay Transport Virtualization) shares MAC addresses between sites.
I’m not convinced whether using ESADI is going to be something I would want to do, but it’s certainly an option that can reduce the amount of flooding in the TRILL domain. And once we have MAC mappings, TRILL encapsulation takes over.
TRILL is all very well, but how will it fit into you network? First, whether you are looking at TRILL or FabricPath, both have seamless integration with existing STP networks. The idea is that TRILL can be installed in stages – maybe with the Core first, then expand outwards towards the access layer over time?
Multiple RBridges on a LAN Segment
Having multiple RBridges active on a LAN segment could be an issue if they all start forwarding traffic over the TRILL network, as this would cause both traffic duplication and also confusion in terms of the appropriate return path with which to populate the MAC mapping tables. Consequently, RBridges on a VLAN see each other and elect a Designated RBridge (DRB) for the segment, which in turn normally becomes the Appointed Forwarder that is exclusively responsible for sending/receiving frames on that shared segment while all other RBridges effectively are in a kind of standby mode. Technically (i.e. in the protocol specifications) it is possible for a DRB to make other RBridges Appointed Forwarders, but I am not aware of this being implemented yet, and the likelihood is that the DRB will do the AF job itself.
Obviously Cisco is a player here with FabricPath (which they describe as a pre-standard superset of TRILL). However, the only hardware currently supporting TRILL is the Nexus 7000, and only on some linecards. You may have heard the saying around lakes and marinas that the word “boat” is in fact an acronym for “Bust Out Another Thousand”; I have a feeling that the word “Nexus” may fall into a similar category – but this time something about power and cooling. Still, if you are following Cisco’s desired upgrade path and you are forklifting your old 6500/7600 platform and replacing them with Nexus 7ks, then you are good to go!
TRILL is not compatible with FabricPath (despite the similarities) – so if you choose FabricPath, you are locked into Cisco hardware going forward. Cisco does offer a TRILL-compatibility mode though, which should interoperate with other vendors’ TRILL equipment, although that would then lose you the ‘superset’ features of FabricPath.
Brocade offers VCS (Virtual Clustering Switching) – which is a multi-pathing fabric based on TRILL – in their VDX product line. As noted in the comments by Omar Sultan, I will clarify that while the data plane in VCS is TRILL, the control plane (i.e. what would be IS-IS) instead uses Fabric Shortest Path First (FSPF) – perhaps not a surprise given the rest of Brocade’s product line, and a nice was to reuse some existing code.
Currently you will not see vendors like Avaya, Nortel, HP (H3C/3Com) and Juniper supporting TRILL; we’ll look at what they are supporting as this series continues.
Do I Need TRILL?
That, my friend, is the million dollar question. Theoretically, it could help reduce failure recovery times after link failures; it can allow you to better utilize your bandwidth by allowing multipathing, and (one hand clapping) it can get rid of STP for you. On the other hand, it’s a fairly new technology that is supported by a small number of vendors with proprietary twists, and may lock you into something you didn’t want to be locked into. Opponents (or at least, proponents of alternative solutions – particularly Shortest Path Bridging, SPB) argue that TRILL’s new protocols/encapsulations mean redeveloping OA&M tools, which means a long lead time and high expense before you can manage your network properly.They also argue that TRILL may require new ASICs in routers to be supported (perhaps why it’s not being offered in the 7600/6500 platforms), and thus may mean requiring a full forklift exercise in order to implement. On the other hand, if you just implemented a Nexus7k network, it’s of interest, don’t you think? That and OTV… which should be coming in a later post
For interfaces that connect to other Fabricpath RBridges, this is the rather challenging configuration required. Since this is NXOS, the first task is to enable the feature-set so that you’ll have the fabricpath commands available, then enable the feature-set. In the default VDC:
N7K(config)# install feature-set fabricpath
Then in the VDC in which you want to run FabricPath, enable the feature-set:
N7K-CORE(config)# feature-set fabricpath
Next up, define the vlan(s) that you’d like to use to carry the FabricPath traffic:
N7K-CORE(config)# vlan xxx N7K-CORE(config-vlan)# mode fabricpath
Then finally, configure the interface (or port-channel) that connects to another FabricPath router(s):
N7K-CORE(config)# interface Ethernet1/1 N7K-CORE(config-if)# description Fabricpath uplink N7K-CORE(config-if)# switchport mode fabricpath
And… that’s about it. For monitoring, you need the “show fabricpath” commands.
TRILL is a fast-moving beast to keep track of, and I’m positive that something I’ve said will turn out to be “old information” that has since been superseded. There’s also limited detailed information out there (especially as it isn’t completely standardized at this time – note the Internet-Drafts that were referenced) so if I have misinterpreted any part of TRILL’s functioning, again please let me know. Basically, keep me honest here please!
I’d love to hear your thoughts on whether you plan to deploy FabricPath or VCS fabric, and if you have already, what your experience is from both a functional and operational perspective, good or bad.
Next time, I’ll be making a hash of Shortest Path Bridging.
Updated 2011-05-18 @11:40am: Clarified use of FSPF in VCS TRILL implementation following Omar’s comment below.