At 10:51 AM 4/27/2005, Stewart Bryant wrote:
First of all I largely agree with Mike's email,
but then that's not going to surprise anyone :)
Simply shocking :-)
Alia Atlas wrote:
At 10:35 AM 4/26/2005, mike shand wrote:
At 15:07 25/03/2005 -0500, Alia Atlas wrote:
Second is the list of downsides with the approach. The main concern is
that the mechanism becomes too complex such that the trade-off between
its complexity and the full coverage is not desirable.
1. This requires a large number of additional IP addresses in the
IGP. The same number of additional FECs is required to support LDP.
Yes, it does. In the simplest case of link and node protection, and
ignoring LANs it requires 2 addresses per protected link. It is expected
that these would come out of a "private" address space, and hence
wouldn't consume real addresses. Indeed for security reasons it is
preferable that they are private addresses.
I don't think this number is "too many". The question is how does this
number increase when we add LANs and SRLGs.
It would be useful to hear some additional opinions on the impact of
adding a large number of addresses. The other question is what is the
boundary when it becomes a serious concern.
Also to understand whether the issue is the number of addresses per se
or the inflation of the routing protocol message size.
Sure. We need to understand clearly the potential issues causing the
2. Explicit tunnels are needed, which means that targeted LDP
sessions are necessary to have this support LDP traffic.
Yes. In the case of node protection we could also using Naiming's scheme
of next-next hop LDP advertisement.
True - but I'd want to think about the implications in terms of
additional communication & periods of instability/inaccuracy of
knowledge. It also doesn't handle the multi-homed prefix case for the
case when the path isn't via the next-next-hop.
OK. I think that we need to work on some state transition description
to make sure that all the bases are covered, and that we have
a common view of the states.
The complexity of MHP is really the complexity of MHP per se, rather
than the complexity of NV.
We have four options:
1) Restrict the reach of the repair to max two hops and
maybe use Naiming's LDP extension.
yes - maybe it's good enough? It'd be good to get the multi-homed prefixes
considered in the network topology & simulations so we can see.
2) Tunnel the packet (using NV, PQ or whatever) and learn the label
at the far end.
In the general worst-case, doesn't scale. How realistic that worst-case
isn't as clear.
3) Tunnel the packet and then strip all labels and do an IP lookup.
Not really an option - assuming we want to have things like pseudo-wires
running under the LDP :-)
4) Figure out some other method of delivering the packet n hops away
in the base topology (such as n-hop u-turn).
Possible, but it does get uglier to compute.
Each of these approaches seems to have it's issues, and it's
a question of picking the least unpalatable.
This is a particular concern for multi-homed prefixes; I'll describe
my concerns on this later.
Yes. This is a concern for LDP. I don't like the idea of targeted LDP
sessions. Two possibilities come to mind
a) each node with an attached MHP distributes an additional label for
that prefix which has the semantics that when you pop that address you
MUST forward the underlying IP packet "directly".
b) an alternative which doesn't require additional labels, but DOES
require a new "well known" label with the above semantics.
Neither are very attractive, but perhaps more attractive than the
directed LDP sessions.
Both of these presume the ability to route based on the nested addresses
of the packet. In general, I don't think that this is a valid
assumption. Consider, for instance, the case of a BGP-free core.
Traffic is directed towards an ASBR in a different area (that is
multi-homed to the one being considered). In that case, the ABR may not
have the BGP routes to be able to correctly forward the packet based on
its IP address. There are also a number of scenarios where what is
underneath the top LDP label is another MPLs label & not routable at all.
In which case we either have to:
a) Run the directed LDP session
See above for my thoughts on the options.
b) Give up
nope - too stubborn :-)
c) Think of something else.
A for else might be domain wide labels, but I remember the last
time that was proposed in MPLS WG:)
sure, but we're seeing upstream label distribution being proposed again too :-)
It is true that anything else seems likely to require some change/addition
to the MPLS semantics - and that would be more challenging to consider.
Are there any other for-else's that are better?
3. Substantial IGP changes are required to handle the additional
Substantial is perhaps a bit strong. We need to advertise the not-via
address and its association. For IS-IS its pretty straightforward. OSPF,
by its very nature, may be a little more tricky.
More substantial than a few bits :-) The main issue here is just the
interop and migration concerns.
I don't understand. The IGP will flood the TLV's we have in mind.
Non-NV routers will be excluded from base. Could you expand?
Oh, this is just the more general worry of getting interoperability without
very clear specification. I've nothing specific here yet - just a
background concern that it be thought about.
I do agree that excluding those routers that don't advertise the capability
takes care of a large section of the issues.
b. It is desirable to have some dampening on the withdrawal of
Notvia addresses to minimize thrashing.
The allocation of notvia addresses to links certainly shouldn't be
changed as a result of not "needing" the notvia address when the object
with which it is associated goes away. It should also get back the same
notvia address when it comes back. But I don't think there are any
particular issues associated with them disappearing and reappearing in
Do you have any specific issues in mind?
Only keeping the notvia addresses around until after the network has
converged... If the notvia address is withdrawn with the link that's
failed, then traffic may still be using that alternate.
Since we are going to use controlled rather than uncontrolled
convergence we can include managing the NV entries in the FIB.
You are right to point out that we have not described how to do this.
What are you defining as controlled convergence? Is this the wait until
the network's done otherwise & then do the notvia addresses? That'll work
- but it's a bit different from the general controlled convergence for the
2. Insufficiently diverse topology: It is possible that a network
topology cannot provide an alternate that suffices for link, node and
SRLG protection. It isn't clear to me how to compute a
"best-available" alternate using this approach. For instance, if one
can get link protection, but not node protection, how would that be
determined, computed and assigned? This becomes much more of a concern
for SRLG protection & for topologies where failures have already
occurred and the network has converged for those & needs protection in
the event of an additional failure.
Clearly it is always possible to create a topology which contains single
points of failure and is inherently irreparable. This is part of the
tradeoff we need to address when thinking about SRLGs, since taking a
simple but pessimistic approach to SRLG can result in this sort of
failure. This seems to be a property of the problem rather than any
Let me try to explain this a bit better. Say there's a topology that,
for a particular next-hop & next-next-hop, can only provide an alternate
that gives link and node protection but not SRLG protection. Now, how
does the notvia addresses method compute an alternate? If the method is
pruning the topology of the relevant link, node & SRLGs, no alternate
will be found. However, it was possible to compute & use an alternate
that gives the link & node protection.
I need to think about this.
The similar case can easily occur with link & node protection. Say S has
two parallel links to E; if the first fails, S could use the other to get
link protection - but there is no node-protecting alternate. How does S
determine this? What is the fall-back strategy in the case that no
"full-protection" alternate is available?
In this case
S fails E, and computes the NV paths to its neighbors.
If any or all of these are unreachable it uses a link
repair to E_!S to reach them as described in Section 4.2
of the draft. If E_!S does not exist as in the case above,
S then looks to see if the parallel link exists.
Of course in the absence of SRLG, this topology contains
a SPF for node protection and will always be expected to have
limited repair coverage.
Thanks. This helps clarify the behavior when the link is pt-to-pt.
For the broadcast link case, I can see greater difficulties - because it
looks like a local SRLG to a large extent. Any thoughts there?
You are correct that this all needs describing in detail.
That's what revisions are for :-)
b. An example of a concern with the BFD diagnosis is that all
interfaces on a node that has failed are not certain to fail exactly
simultaneously or even within a sub-50ms bounded window. It is
entirely possible that BFD sessions are terminated on different
line-cards, that detect the router failure at slightly different times
and stop forwarding traffic, therefore, at slightly different times.
Yes. There is the possibility of misdiagnosis in this case if the second
failure occurs too long after the first. I suppose this then looks like
two separate failures. Clearly an unreliable diagnosis is probably worse
than no diagnosis at all. We need to get some handle on how realistic or
not this scenario is.
Well, I think it is exceedingly realistic :-)
For a non-power related failure, routers with separate forwarding &
control planes may take varying amounts of time for the line-cards to all
realize that the route controller is down.
Well maybe for power-failures as well :)
The pathology of this sort of failure is highly implementation dependent.
Say BFD was running on the LC, but the switch fabric was down.
You could end up with the neighbors thinking that the router was still
up, but it was non-functional. Eventually routing would notice the
absence of routing hellos, unless of course, these had also been
delegated :) Perhaps we need to run BFD to the neighbor's neighbors
on the direct path?
This is a large part of what I see to be the problem with thinking of BFD
as being a mechanism for detecting router failure. Perhaps there are those
with more BFD experience who can point out how it could work?
To my mind, either the BFD session is running on the line-card, in which
case one can possibly imagine it being implemented in a scalable way such
that packets are sent out at least every 5-10ms, such that a failure could
be detected within 20ms & then repaired in the remaining 30ms, or the BFD
session isn't running on the line-card, in which case generating packets
every 5-10ms per interface reliably seems a bit of a stretch.
It seems to me to get the desired speed for failure detection that BFD
would need to be done on the line-card & preferably in hardware. This
makes BFD good for detecting link failures, but not so much for detecting
As for the line-card being up, but the switch fabric down, I'd hope that
the router internals would have a way of detecting that & bringing the
line-card down when appropriate. That's internal implementation anyhow.
BFD to neighbor's neighbor would at least validate the forwarding path all
the way to that neighbor's neighbor - but it gives a larger number of BFD
sessions & there is also the difficulty of interpreting the failure back to
the affected routes. For instance, what about the case where the
downstream neighbor may (or may not?) ECMP to two of its neighbors. If the
BFD session to one of those goes down, does S do repair?
The problem is that we rapidly get on a complexity spiral that
This is why I really do not like the idea of trying to do failure diagnosis
via BFD. I don't think it is realistic to believe that router failure can
be identified as such by BFD within 20-30ms.
We clearly to write down a set of project scoping rules for the
types of failure that we will and will not deal with.
I don't think the issue here is the type of failure that is being handled -
that is determined by the avoidance of the alternate. The question here is
the ability to diagnosis failure types on the fly.
It also needs to be thought through what issues might exist if the
topologies used for the SPF vary slightly for each router that is on
the broadcast link, since each will, as described, not prune itself
out when doing the computation; of course, there could be an approach
where the same topology can be used everywhere.
I'm not really sure what you mean here.
Let me try and explain it a bit. Perhaps I'm missing something. In the
case where a notvia topology results in pruning the router doing the
computation, what forms the root of the SPT? Say routers A, B and C are
all connected to a broadcast link X and want to compute a notvia X
address as described in (c) by pruning the pseudo-node related to X as
well as A, B, and C. Now, router A prunes the pseudo-node, A, B and C
from the topology; what does A use as the root? IF A only prunes the
pseudo-node, B and C to compute notvia X, B only prunes the pseudo-node,
A, and C, and C only prunes the pseudo-node, A and B, and all other
routers prune the pseudo-node, A, B, and C, can there be any issues with
a consistently computed & non-looping path for notvia X?
I think it may not be an issue - b/c once the traffic leaves A, B or C,
it will never return - but it at least needs some thought, since this is
a bit different from what's traditionally been done.
Agreed, we need to write down the algorithm and subject it to review.
The more I think about it, the more comfortable I am - but as we've agreed,
it needs clear description & thought about the differences & their
Of course, multi-homed prefixes may be much more infrequent for LDP
than for IP; for example, there is no reason to advertise a separate
FEC for the subnet of a link. However, multi-homed prefixes are a
concern for LDP for at least the inter-area, AS External, and BGP routes.
iii. If traffic is encapsulated to a node's regular address, because
that traffic is destined to a prefix advertised by the node, how does
the receiving node know to remove the encapsulation and forward the
packet inside all in the fast path? Is this a just a question of
different handling based on the header type inside the outer
encapsulation (for GRE)?
OK. The traffic wouldn't be directed up to the control plane because it
was GRE encapsulated??
GRE always pops the header at the tunnel endpoint. That is how it
OK. Thanks. I'm (obviously) not that familiar with GRE.
And had a special header type for this purpose?
Certainly I can see something like this working with an LDP LSP, b/c the
label would just get it to that router & then be popped & the packet
forwarded based on what's underneath.
Perhaps an MPLS label of some sort the way we thought of doing directed
forwarding and the way that Mark Townsley proposed doing IP VPN?
Could you explain more? I'm not sure what you're picturing the label being
iv. Perhaps these issues could be handled by determining a
next-next-hop that avoids the failure to reach an appropriate
advertiser. Of course, this is a different set/type of computation.
Could you explain that suggestion please?
Well, if there is a neighbor's neighbor whose path to the multi-homed
prefix doesn't go through the failure & this can be determined, then the
traffic could be tunneled to that neighbor's neighbor & then normally
forwarded from there.
Yes, but you only get two hop reachability. Perhaps you do this,
and then do directed LDP for the remaining (perhaps 2%) of cases.
The problem I have with this is the added complexity.
Yes, it would require additional computation & anytime you have two ways to
do something, it gets more complicated. On the other hand, a
(potentially) full mesh of LDP sessions isn't simple either!
If the frequency of multi-homed prefixes that can't be protected via this
is small - and the network design to gain protection isn't too complicated
to understand, then it'd help. For, say, the inter-area case, a typical
cross-hatch connection probably works fine.
7. There is a definite need to describe the convergence case
better. This is how the transition from using the alternate to the
network being converged happens, such that the alternate remains functional.
a. For instance, if the node E fails, then the Notvia address E_!S
will no longer be advertised. If S was getting link protection
(because that was all that was possible, for instance) by tunneling
traffic to E_!S, it is important that this traffic be properly
discarded when E's addresses go away. This implies that there needs
to be a default blackhole for Notvia addresses.
I don't quite understand your concern here. If E goes away and S is
sending to E_!S, then the neighbors of E will drop the packets because
we don't repair a notvia address.
I'm thinking of this as the more specific prefix goes away. Without a
specific blackhole for the group of prefixes, why wouldn't the packets
take that instead? I.e., if the notvia address is 10.1.1.1 and there's a
default route for 10.1/16 (or for 0.0.0.0/0), then the packet would pick
up the latter when the notvia address is removed.
Yes, you are quite right. We need a NV black hole.
I think this is just a detail to be written down then.
c. It is possible to get a micro-forwarding loop affecting a
Notvia address as a result of a less severe failure than
anticipated. For instance, consider the following topology.
| | \ 10
1 |R 1 |R \
| 5 | \
Link S->E and Link H->F are in SRLG R
When node E fails, if I converges before H, there will be a loop
affecting the Notvia address being used to reach F without going
through any of Link S->E, E or SRLG R.
We discussed this privately, and I still don't see how loops could
arrise even if the notvia FIB were recomputed before normal convergence
is complete. But I think it is better to delay the notvia FIB changes anyway.
Just for clarity (hopefully), before the failure, H computes the path
for F_!E, the address of F's that is notvia E, to go via I and then to
F. After the failure of E, if H installs the changed notvia address F_!E
the path is directly to F, b/c node E no longer has SRLG R associated
with any of E's up links.
I think that the core issue here is the case of a failure during the
reconvergence of the repair topology. Is that in scope?
?? No, I think this is the issue of whether the repair topology
reconverges at the same time as the primary topology or it is delayed. If
it is at the same time, the problem above happens. I think we're all
agreed that delaying the repair topology convergence until the primary
topology convergence is complete is the way to go.
d. How do exceptions work? Particularly in regards to an IP-in-IP
encapsulation such as GRE, it doesn't seem like MTU exceeded cases can
be handled cleanly either by use of DF or by doing IP fragmentation
and then the reassembly at the end of the tunnel. This seems like a
problem for all ICMP packets; how could a source understand the header
inside for a TTL expired, for instance.
I'll leave this for Stewart (tunnel) Bryant!
For LDP, there are mechanisms (layer violations though they are) to
handle exceptions generating ICMP packets.
The interesting question is who needs to know of the MTU problem?
If you tell the host, then by the time it adjusts it's MTU the
network will likely have reconverged anyway.
It depends on how long the network is taking to get off the
alternates. That could be 10s or so - so there's a chance the host will
be able to adjust its MTU usefully.
However if you tell repairing router (which is what will happen
with a tunnels packet), it can alarm and let the network
administrators know that there is problem with IPFRR config.
But why does the ICMP packet need to travel back to the repairing
router? Why couldn't the router which sees an MTU exceeded addressed to a
notvia address let the network administrators know?
For this to work, the MTU at the edges needs to be lower than the
MTU in the core.
Sure. My concern isn't so much for the MTU exceeded case as for the TTL
expired case. That's a common network debugging tool - and we'd want it to
continue to work during repair.
e. For IP-in-IP tunnels, another concern is flow diversity. The
IP source and destination addresses are used to determine a flow; this
flow identification may then be used for a variety of purposes,
including ECMP. By putting all the traffic to a variety of
destinations inside the same header, the ability to take advantage of
flow diversity appears to have disappeared. This could possibly be
solved by putting the original source address into the encapsulating
header? Are there other approaches?
Again, for an LDP tunnel, many routers can look under the label and
consider the IP packet inside for flow identification.
I was going to say:
Given that basic cuts in before NV, I think that the only case where
this is a problem is when you have a router with max ECMP = say 2 which
selects two from more than two, and the next hop on one of them fails.
This is surely a corner case?
Then Mike pointed out that we had said that we would use ECMP in the
draft, and yes there is a problem. Again we need to think about the
implications, because it's not clear what we should do.
If the paths for notvia addresses can be ECMP, then it becomes a
concern. Now, maybe it isn't necessary for those to be ECMP - or that
likely. But maybe it could be solved by pulling the real source IP address
in the encapsulating packet header?
Rtgwg mailing list