First of all I largely agree with Mike's email,
but then that's not going to surprise anyone :)
Alia Atlas wrote:
At 10:35 AM 4/26/2005, mike shand wrote:
At 15:07 25/03/2005 -0500, Alia Atlas wrote:
Second is the list of downsides with the approach. The main concern
is that the mechanism becomes too complex such that the trade-off
between its complexity and the full coverage is not desirable.
1. This requires a large number of additional IP addresses in
the IGP. The same number of additional FECs is required to support LDP.
Yes, it does. In the simplest case of link and node protection, and
ignoring LANs it requires 2 addresses per protected link. It is
expected that these would come out of a "private" address space, and
hence wouldn't consume real addresses. Indeed for security reasons it
is preferable that they are private addresses.
I don't think this number is "too many". The question is how does this
number increase when we add LANs and SRLGs.
It would be useful to hear some additional opinions on the impact of
adding a large number of addresses. The other question is what is the
boundary when it becomes a serious concern.
Also to understand whether the issue is the number of addresses per se
or the inflation of the routing protocol message size.
2. Explicit tunnels are needed, which means that targeted LDP
sessions are necessary to have this support LDP traffic.
Yes. In the case of node protection we could also using Naiming's
scheme of next-next hop LDP advertisement.
True - but I'd want to think about the implications in terms of
additional communication & periods of instability/inaccuracy of
knowledge. It also doesn't handle the multi-homed prefix case for the
case when the path isn't via the next-next-hop.
OK. I think that we need to work on some state transition description
to make sure that all the bases are covered, and that we have
a common view of the states.
The complexity of MHP is really the complexity of MHP per se, rather
than the complexity of NV.
We have four options:
1) Restrict the reach of the repair to max two hops and
maybe use Naiming's LDP extension.
2) Tunnel the packet (using NV, PQ or whatever) and learn the label
at the far end.
3) Tunnel the packet and then strip all labels and do an IP lookup.
4) Figure out some other method of delivering the packet n hops away
in the base topology (such as n-hop u-turn).
Each of these approaches seems to have it's issues, and it's
a question of picking the least unpalatable.
This is a particular concern for multi-homed prefixes; I'll describe
my concerns on this later.
Yes. This is a concern for LDP. I don't like the idea of targeted LDP
sessions. Two possibilities come to mind
a) each node with an attached MHP distributes an additional label for
that prefix which has the semantics that when you pop that address you
MUST forward the underlying IP packet "directly".
b) an alternative which doesn't require additional labels, but DOES
require a new "well known" label with the above semantics.
Neither are very attractive, but perhaps more attractive than the
directed LDP sessions.
Both of these presume the ability to route based on the nested addresses
of the packet. In general, I don't think that this is a valid
assumption. Consider, for instance, the case of a BGP-free core.
Traffic is directed towards an ASBR in a different area (that is
multi-homed to the one being considered). In that case, the ABR may not
have the BGP routes to be able to correctly forward the packet based on
its IP address. There are also a number of scenarios where what is
underneath the top LDP label is another MPLs label & not routable at all.
In which case we either have to:
a) Run the directed LDP session
b) Give up
c) Think of something else.
A for else might be domain wide labels, but I remember the last
time that was proposed in MPLS WG:)
Are there any other for-else's that are better?
3. Substantial IGP changes are required to handle the additional
Substantial is perhaps a bit strong. We need to advertise the not-via
address and its association. For IS-IS its pretty straightforward.
OSPF, by its very nature, may be a little more tricky.
More substantial than a few bits :-) The main issue here is just the
interop and migration concerns.
I don't understand. The IGP will flood the TLV's we have in mind.
Non-NV routers will be excluded from base. Could you expand?
5. The management of the Notvia addresses & of the tunnels can
create longer time periods where protection isn't available for a
part of the network (the new link or node, etc.).
I don't think the tunnels add to the time at all. They are after all
just FIB entries. Distributing the notvia addresses for a new
node/link will occur at the same time as distributing the information
about the link/node in the first place. I don't think it significantly
increases the delay.
There is of course the time it takes to recompute notvia routes, but
we think this will be well under a second.
These aspects certainly need thinking about, but they don't seem to
pose insurmountable issues.
I agree that they're not insurmountable - but just require consideration.
Third, there are a number of issues that I feel need considerable
discussion to try and resolve. I will try to go through each in turn
and explain what I think the various aspects of each are. Each of
these issues has the possibility to resolve in such a way that the
Notvia Addresses approach becomes overly complex.
Yes. That is a temptation we need to resist!
It's frequently a coverage versus complexity trade-off, where each
decision is along the slippery slope - alas!
b. It is desirable to have some dampening on the withdrawal of
Notvia addresses to minimize thrashing.
The allocation of notvia addresses to links certainly shouldn't be
changed as a result of not "needing" the notvia address when the
object with which it is associated goes away. It should also get back
the same notvia address when it comes back. But I don't think there
are any particular issues associated with them disappearing and
reappearing in the LSPs.
Do you have any specific issues in mind?
Only keeping the notvia addresses around until after the network has
converged... If the notvia address is withdrawn with the link that's
failed, then traffic may still be using that alternate.
Since we are going to use controlled rather than uncontrolled
convergence we can include managing the NV entries in the FIB.
You are right to point out that we have not described how to do this.
c. If configured in blocks, it would be extremely desirable to
have the same Notvia address mean the same thing through multiple
reboots, etc. It'd be good to have some means of consistent
association. This is for easy manageability.
For the case where the notvia address is for a neighbor, it's not always
that straightforward - unless one ends up advertising multiple notvia
addresses for the same neighbor, depending on the number of parallel
links. This is mostly engineering, I think.
d. When a new link or neighbor comes up, there will be a longer
period of time when an alternate isn't available because the Notvia
address hasn't been advertised yet. These periods without protection
need to be clearly understood and minimized.
Yes. I'm not convinced there is a particular problem here, but it does
need thinking through carefully.
e. There may be scalability concerns based on the number of
Notvia addresses and LDP FECs required. For instance, as described
in the draft, it is basically the number of uni-directional links in
the topology. This is ignoring the extras for broadcast links. To
fully & certainly provide SRLG protection if at all feasible, would
require that each router advertise a Notvia address for every
uni-directional link into every neighbor of that router. This would
result in K*L additional addresses, where K is the average number of
neighbors & L is the number of uni-directional links in the topology.
Yes. This is a major concern, and we need to devise ways of solving
SRLGs etc. which minimize the potential proliferation of addresses.
We need to get the right tradeoff here between optimal solutions and
Agreed. We need to understand the impact of additional addresses to
know the complexity cost of that versus the reduced coverage of
selecting a less complete approach to broadcast link and SRLG protection.
2. Insufficiently diverse topology: It is possible that a
network topology cannot provide an alternate that suffices for link,
node and SRLG protection. It isn't clear to me how to compute a
"best-available" alternate using this approach. For instance, if one
can get link protection, but not node protection, how would that be
determined, computed and assigned? This becomes much more of a
concern for SRLG protection & for topologies where failures have
already occurred and the network has converged for those & needs
protection in the event of an additional failure.
Clearly it is always possible to create a topology which contains
single points of failure and is inherently irreparable. This is part
of the tradeoff we need to address when thinking about SRLGs, since
taking a simple but pessimistic approach to SRLG can result in this
sort of failure. This seems to be a property of the problem rather
than any particular solution.
Let me try to explain this a bit better. Say there's a topology that,
for a particular next-hop & next-next-hop, can only provide an alternate
that gives link and node protection but not SRLG protection. Now, how
does the notvia addresses method compute an alternate? If the method is
pruning the topology of the relevant link, node & SRLGs, no alternate
will be found. However, it was possible to compute & use an alternate
that gives the link & node protection.
I need to think about this.
The similar case can easily
occur with link & node protection. Say S has two parallel links to E;
if the first fails, S could use the other to get link protection - but
there is no node-protecting alternate. How does S determine this? What
is the fall-back strategy in the case that no "full-protection"
alternate is available?
In this case
S fails E, and computes the NV paths to its neighbors.
If any or all of these are unreachable it uses a link
repair to E_!S to reach them as described in Section 4.2
of the draft. If E_!S does not exist as in the case above,
S then looks to see if the parallel link exists.
Of course in the absence of SRLG, this topology contains
a SPF for node protection and will always be expected to have
limited repair coverage.
You are correct that this all needs describing in detail.
3. Failure Diagnosis versus Pessimism: As written, the draft
discusses the idea of doing failure diagnosis using BFD. As Stewart,
Mike & I have discussed, this isn't possible for SRLG failures,
although it is possible for broadcast links.
Yes, and this relates to (2) above.
a. I am concerned about adding the failure diagnosis. This is
yet another level of complexity for implementation. It also has
ramifications for the forwarding plane, because of the need to store
multiple alternates to use & have multiple states to check to decide
what to use.
Yes. It would be nice not to have to do it, but that is back to the
Complexity vs. coverage? I'm very fond, unsurprisingly, of options that
don't require hardware changes... so I judge that the complexity is
rather high to support this one - as well as more error-prone (see
comment on unreliable diagnosis).
b. An example of a concern with the BFD diagnosis is that all
interfaces on a node that has failed are not certain to fail exactly
simultaneously or even within a sub-50ms bounded window. It is
entirely possible that BFD sessions are terminated on different
line-cards, that detect the router failure at slightly different
times and stop forwarding traffic, therefore, at slightly different
Yes. There is the possibility of misdiagnosis in this case if the
second failure occurs too long after the first. I suppose this then
looks like two separate failures. Clearly an unreliable diagnosis is
probably worse than no diagnosis at all. We need to get some handle on
how realistic or not this scenario is.
Well, I think it is exceedingly realistic :-)
For a non-power related failure, routers with separate forwarding &
control planes may take varying amounts of time for the line-cards to
all realize that the route controller is down.
Well maybe for power-failures as well :)
The pathology of this sort of failure is highly implementation dependent.
Say BFD was running on the LC, but the switch fabric was down.
You could end up with the neighbors thinking that the router was still
up, but it was non-functional. Eventually routing would notice the
absence of routing hellos, unless of course, these had also been
delegated :) Perhaps we need to run BFD to the neighbor's neighbors
on the direct path?
The problem is that we rapidly get on a complexity spiral that
We clearly to write down a set of project scoping rules for the
types of failure that we will and will not deal with.
c. The other approach is to pessimistically eliminate all
routers connected to the broadcast link as well as the broadcast
link; this may not provide an alternate.
Yes. While simple, it runs into the problem of being a single (albeit
large) point of failure. Its the same trade-off as above.
Don't they all reduce to that?
It also needs to be thought through what issues might exist if the
topologies used for the SPF vary slightly for each router that is on
the broadcast link, since each will, as described, not prune itself
out when doing the computation; of course, there could be an approach
where the same topology can be used everywhere.
I'm not really sure what you mean here.
Let me try and explain it a bit. Perhaps I'm missing something. In
the case where a notvia topology results in pruning the router doing the
computation, what forms the root of the SPT? Say routers A, B and C
are all connected to a broadcast link X and want to compute a notvia X
address as described in (c) by pruning the pseudo-node related to X as
well as A, B, and C. Now, router A prunes the pseudo-node, A, B and C
from the topology; what does A use as the root? IF A only prunes the
pseudo-node, B and C to compute notvia X, B only prunes the pseudo-node,
A, and C, and C only prunes the pseudo-node, A and B, and all other
routers prune the pseudo-node, A, B, and C, can there be any issues with
a consistently computed & non-looping path for notvia X?
I think it may not be an issue - b/c once the traffic leaves A, B or C,
it will never return - but it at least needs some thought, since this is
a bit different from what's traditionally been done.
Agreed, we need to write down the algorithm and subject it to review.
It isn't clear to me what Notvia addresses would be needed to
express "don't go through this pseudo-node or any nodes attached to
it"; I don't think that it is simply the Notvia address for avoiding
a particular node.
No, it would need a specific notvia address bound to the LAN interface.
4. Multi-homed Prefixes: I am quite concerned about the
mechanisms suggested in the draft.
a. First, I really do not like the idea of having separate
forwarding for "local" prefixes that come out of a tunnel. What is a
local prefix? For instance, does this mean that an ABR has to
forward traffic different depending on which area traffic from the
tunnel has come from? I am concerned about how this would scale;
maybe only 2 FIBs are needed (one for backbone & one for other), but
it may be worse to handle AS external routes. I know that Stewart,
Mike, Joel, Albert and I had discussed/agreed to put this idea out of
scope at least for the moment.
Clearly the problem needs solving, especially since prefixes which are
multihomed are frequently the most important prefixes (which is WHY
they are multihomed in the first place).
b. I am quite concerned about having tunnels to the advertisers
of the prefixes.
i. There needs to be a mechanism to determine whether the
advertiser of a prefix will forward the packet in a loop-free fashion
to avoid the failure point. The separate forwarding for "local"
prefixes avoided the need for this determination, but at more
There seem to be two aspects to this.
a) we need the ability to get the packet to the "second-best"
attachment point for the prefix without it being "sucked back" to the
failure. This in general requires a tunnel, except for the cases where
a neighbor of the node detecting the failure has an LFA to the second
best attachment point. Clearly this could be used in preference to a
tunnel where available, but at the expense of additional complexity.
However this is really just an extension of the general principle that
we should use "basic" (i.e. LFA) repair to cream off traffic which
doesn't NEED to be tunnelled.
Yes - though the tunnels bring out the LDP issues with targeted sessions
- of course.
b) we need (in a very limited number of cases), the ability to force
the packet to the locally attached prefix. This only occurs where the
local cost is high compared to the cost back to the failed attachment
point. But when we DO need it, the use of a tunnel is a convenient
means of signalling this. I'm not sure how else to do it, other than
using a label.
Of course ONE "solution" would be to REQUIRE the costs to be set
I like that one :-) but you knew I would! We may need to define what
that means better or have a way of determining that it is the case.
ii. To support LDP, every tunnel requires a targeted LDP
session. If multi-homed prefixes are common, then this becomes a
full mesh for LDP. That isn't acceptable.
Of course, multi-homed prefixes may be much more infrequent for LDP
than for IP; for example, there is no reason to advertise a separate
FEC for the subnet of a link. However, multi-homed prefixes are a
concern for LDP for at least the inter-area, AS External, and BGP
iii. If traffic is encapsulated to a node's regular address,
because that traffic is destined to a prefix advertised by the node,
how does the receiving node know to remove the encapsulation and
forward the packet inside all in the fast path? Is this a just a
question of different handling based on the header type inside the
outer encapsulation (for GRE)?
OK. The traffic wouldn't be directed up to the control plane because it
was GRE encapsulated??
GRE always pops the header at the tunnel endpoint. That is how it
And had a special header type for this purpose?
Certainly I can see something like this working with an LDP LSP, b/c the
label would just get it to that router & then be popped & the packet
forwarded based on what's underneath.
Perhaps an MPLS label of some sort the way we thought of doing directed
forwarding and the way that Mark Townsley proposed doing IP VPN?
iv. Perhaps these issues could be handled by determining a
next-next-hop that avoids the failure to reach an appropriate
advertiser. Of course, this is a different set/type of computation.
Could you explain that suggestion please?
Well, if there is a neighbor's neighbor whose path to the multi-homed
prefix doesn't go through the failure & this can be determined, then the
traffic could be tunneled to that neighbor's neighbor & then normally
forwarded from there.
Yes, but you only get two hop reachability. Perhaps you do this,
and then do directed LDP for the remaining (perhaps 2%) of cases.
The problem I have with this is the added complexity.
Basically, if one knows the SPT from each neighbor's neighbor & can
reach all of those neighbor's neighbors without going through the
failure, then it might provide an alternate. The issue there is first
that the path to a neighbor's neighbor might go via the failed element &
not have an appropriate notvia & second the effort of computing and
considering the different SPTs.
Does that make more sense?
5. SRLGs and Broadcast Links: There seem to be a number of
possible ways to handle SRLGs and broadcast links, each of which
provides a different trade-off in terms of coverage, computation,
and extra Notvia addresses.
There are basically 4 approaches at this point.
a. First, In order to compute a notvia alternate that avoids a
link, the primary neighbor, and all SRLGs that the link is part of,
it is necessary to have a separate topology and associated SPF
computation for each link that is a member of an SRLG or a broadcast
link. This requires also a substantially larger number of Notvia
addresses and the corresponding mechanisms to determine how and when
to allocate and de-allocate them.
.. and could potentially result in a combinatorial explosion if we
weren't very careful.
I do think of this as having the highest potential to be the "too
complicated & run-away" scenario.
b. Second, one could use a topology that removed the primary
neighbor and see whether SRLG protection can be obtained either along
S's path or along any path of a neighbor of S that is also loop-free.
Could you explain that a bit more please?
This is the concept of looking for a loop-free neighbor to the notvia
address whose path there happens to give SRLG protection. We'd
discussed this one the last day at IETF.
c. Third, when a Notvia address indicates to avoid a node, one
could remove not merely the node & the uni-directional links to and
from that node, but also any other links that are in a common SRLG
with any of the links to or from the removed node. This is
pessimistic but allows some SRLG protection without increased
computation or Notvia addresses.
Yes. This is nice and simple, but as you have pointed out above, could
easily result in an inability to find a viable repair.
At the risk of additional complication, one could have it configurable
as to the specific handling of SRLGs. For instance, most link/node/SRLG
would be handled with a single notvia per node - and then there could
be configured specific links that required their own notvia. Of
course, that adds substantial extra complexity. Is it necessary?
d. Fourth, one could simply track the SRLGs encountered along
the Notvia path; this just reports whether the alternate provides
SRLG protection without any effort to obtain it.
Yes. Interesting. I wonder how useful this would be.
Well, it could feedback to a network design at least - to give an
indication of coverage. Also, not all SRLGs have the same likelihood to
fail. So, if avoiding all doesn't work, perhaps one can avoid the most
risky ones - and then report the protection against the failures of the
This is part of my concern/question about what notvia alternate is
computed if the best protection isn't possible.
6. Implementability: Clearly, the draft describes the basic
idea for Notvia addresses, but there are a fair number of
implementation/protocol decisions that need to be made before this
can become anything more than an interesting idea.
Sure. There are quite a few design decisions and tradeoffs as
indicated above that need tieing down.
7. There is a definite need to describe the convergence case
better. This is how the transition from using the alternate to the
network being converged happens, such that the alternate remains
a. For instance, if the node E fails, then the Notvia address
E_!S will no longer be advertised. If S was getting link protection
(because that was all that was possible, for instance) by tunneling
traffic to E_!S, it is important that this traffic be properly
discarded when E's addresses go away. This implies that there needs
to be a default blackhole for Notvia addresses.
I don't quite understand your concern here. If E goes away and S is
sending to E_!S, then the neighbors of E will drop the packets because
we don't repair a notvia address.
I'm thinking of this as the more specific prefix goes away. Without a
specific blackhole for the group of prefixes, why wouldn't the packets
take that instead? I.e., if the notvia address is 10.1.1.1 and there's
a default route for 10.1/16 (or for 0.0.0.0/0), then the packet would
pick up the latter when the notvia address is removed.
Yes, you are quite right. We need a NV black hole.
Or are you concerned that after convergence, there will be nodes which
don't even have a forwarding entry forE_!S. By this time I don't think
that S (or anyone else) should still be using that address, but even
if it were, the absence of a forwarding entry would (SHOULD) cause the
packet to be dropped. Is this all you are saying?
I was also worrying about the above in reference to the notvia
addresses. While it is possible to say that changes to notvia
addresses shouldn't be installed until after the network has otherwise
converged, that sort of detail needs to be clarified.
b. Another example is when node E fails, the next-next-hop B
must continue to advertise the Notvia address B_!E until the network
converges so that S can continue to tunnel traffic to B_!E as the
Yes. Our view was that no changes would be made to notvia
advertisement or more specifically notvia FIB entries until after
convergence is over. Of course there is an issue as to how you tell
when that has happened, but the timers associated with loop free
convergence probably give a good indication.
Conceptually, I agree. Having a primary topology to forward the traffic
on while the backup one reconverges helps.
c. It is possible to get a micro-forwarding loop affecting a
Notvia address as a result of a less severe failure than
anticipated. For instance, consider the following topology.
| | \ 10
1 |R 1 |R \
| 5 | \
Link S->E and Link H->F are in SRLG R
When node E fails, if I converges before H, there will be a loop
affecting the Notvia address being used to reach F without going
through any of Link S->E, E or SRLG R.
We discussed this privately, and I still don't see how loops could
arrise even if the notvia FIB were recomputed before normal
convergence is complete. But I think it is better to delay the notvia
FIB changes anyway.
Just for clarity (hopefully), before the failure, H computes the path
for F_!E, the address of F's that is notvia E, to go via I and then to
F. After the failure of E, if H installs the changed notvia address
F_!E the path is directly to F, b/c node E no longer has SRLG R
associated with any of E's up links.
I think that the core issue here is the case of a failure during the
reconvergence of the repair topology. Is that in scope?
d. How do exceptions work? Particularly in regards to an
IP-in-IP encapsulation such as GRE, it doesn't seem like MTU exceeded
cases can be handled cleanly either by use of DF or by doing IP
fragmentation and then the reassembly at the end of the tunnel. This
seems like a problem for all ICMP packets; how could a source
understand the header inside for a TTL expired, for instance.
I'll leave this for Stewart (tunnel) Bryant!
For LDP, there are mechanisms (layer violations though they are) to
handle exceptions generating ICMP packets.
The interesting question is who needs to know of the MTU problem?
If you tell the host, then by the time it adjusts it's MTU the
network will likely have reconverged anyway.
However if you tell repairing router (which is what will happen
with a tunnels packet), it can alarm and let the network
administrators know that there is problem with IPFRR config.
For this to work, the MTU at the edges needs to be lower than the
MTU in the core.
e. For IP-in-IP tunnels, another concern is flow diversity. The
IP source and destination addresses are used to determine a flow;
this flow identification may then be used for a variety of purposes,
including ECMP. By putting all the traffic to a variety of
destinations inside the same header, the ability to take advantage of
flow diversity appears to have disappeared. This could possibly be
solved by putting the original source address into the encapsulating
header? Are there other approaches?
Again, for an LDP tunnel, many routers can look under the label and
consider the IP packet inside for flow identification.
I was going to say:
Given that basic cuts in before NV, I think that the only case where
this is a problem is when you have a router with max ECMP = say 2 which
selects two from more than two, and the next hop on one of them fails.
This is surely a corner case?
Then Mike pointed out that we had said that we would use ECMP in the
draft, and yes there is a problem. Again we need to think about the
implications, because it's not clear what we should do.
Rtgwg mailing list
Rtgwg mailing list