[email protected]
[Top] [All Lists]

Re: thoughts on draft-bryant-shand-ipfrr-notvia-addresses-00.txt

Subject: Re: thoughts on draft-bryant-shand-ipfrr-notvia-addresses-00.txt
From: Stewart Bryant
Date: Wed, 27 Apr 2005 15:51:19 +0100
First of all I largely agree with Mike's email,
but then that's not going to surprise anyone :)

Alia Atlas wrote:


At 10:35 AM 4/26/2005, mike shand wrote:

At 15:07 25/03/2005 -0500, Alia Atlas wrote:

Second is the list of downsides with the approach. The main concern is that the mechanism becomes too complex such that the trade-off between its complexity and the full coverage is not desirable. 1. This requires a large number of additional IP addresses in the IGP. The same number of additional FECs is required to support LDP.

Yes, it does. In the simplest case of link and node protection, and ignoring LANs it requires 2 addresses per protected link. It is expected that these would come out of a "private" address space, and hence wouldn't consume real addresses. Indeed for security reasons it is preferable that they are private addresses.
I don't think this number is "too many". The question is how does this
number increase when we add LANs and SRLGs.

It would be useful to hear some additional opinions on the impact of adding a large number of addresses. The other question is what is the boundary when it becomes a serious concern.
Also to understand whether the issue is the number of addresses per se
or the inflation of the routing protocol message size.

2. Explicit tunnels are needed, which means that targeted LDP sessions are necessary to have this support LDP traffic.

Yes. In the case of node protection we could also using Naiming's scheme of next-next hop LDP advertisement.

True - but I'd want to think about the implications in terms of additional communication & periods of instability/inaccuracy of knowledge. It also doesn't handle the multi-homed prefix case for the case when the path isn't via the next-next-hop.
OK. I think that we need to work on some state transition description
to make sure that all the bases are covered, and that we have
a common view of the states.

The complexity of MHP is really the complexity of MHP per se, rather
than the complexity of NV.

We have four options:

1) Restrict the reach of the repair to max two hops and
maybe use Naiming's LDP extension.
2) Tunnel the packet (using NV, PQ or whatever) and learn the label
at the far end.
3) Tunnel the packet and then strip all labels and do an IP lookup.
4) Figure out some other method of delivering the packet n hops away
in the base topology (such as n-hop u-turn).

Each of these approaches seems to have it's issues, and it's
a question of picking the least unpalatable.

This is a particular concern for multi-homed prefixes; I'll describe my concerns on this later.

Yes. This is a concern for LDP. I don't like the idea of targeted LDP sessions. Two possibilities come to mind
a) each node with an attached MHP distributes an additional label for
that prefix which has the semantics that when you pop that address you
MUST forward the underlying IP packet "directly".
b) an alternative which doesn't require additional labels, but DOES
require a new "well known" label with the above semantics.
Neither are very attractive, but perhaps more attractive than the
directed LDP sessions.

Both of these presume the ability to route based on the nested addresses of the packet. In general, I don't think that this is a valid assumption. Consider, for instance, the case of a BGP-free core. Traffic is directed towards an ASBR in a different area (that is multi-homed to the one being considered). In that case, the ABR may not have the BGP routes to be able to correctly forward the packet based on its IP address. There are also a number of scenarios where what is underneath the top LDP label is another MPLs label & not routable at all.
In which case we either have to:

a) Run the directed LDP session
b) Give up
c) Think of something else.

A for else might be domain wide labels, but I remember the last
time that was proposed in MPLS WG:)

Are there any other for-else's that are better?

3. Substantial IGP changes are required to handle the additional Notvia addresses.

Substantial is perhaps a bit strong. We need to advertise the not-via address and its association. For IS-IS its pretty straightforward. OSPF, by its very nature, may be a little more tricky.

More substantial than a few bits :-) The main issue here is just the interop and migration concerns.
I don't understand. The IGP will flood the TLV's we have in mind.
Non-NV routers will be excluded from base. Could you expand?

5. The management of the Notvia addresses & of the tunnels can create longer time periods where protection isn't available for a part of the network (the new link or node, etc.).

I don't think the tunnels add to the time at all. They are after all just FIB entries. Distributing the notvia addresses for a new node/link will occur at the same time as distributing the information about the link/node in the first place. I don't think it significantly increases the delay.
There is of course the time it takes to recompute notvia routes, but
we think this will be well under a second.
These aspects certainly need thinking about, but they don't seem to
pose insurmountable issues.

I agree that they're not insurmountable - but just require consideration.

Third, there are a number of issues that I feel need considerable discussion to try and resolve. I will try to go through each in turn and explain what I think the various aspects of each are. Each of these issues has the possibility to resolve in such a way that the Notvia Addresses approach becomes overly complex.

Yes. That is a temptation we need to resist!

It's frequently a coverage versus complexity trade-off, where each decision is along the slippery slope - alas!
b. It is desirable to have some dampening on the withdrawal of Notvia addresses to minimize thrashing.

The allocation of notvia addresses to links certainly shouldn't be changed as a result of not "needing" the notvia address when the object with which it is associated goes away. It should also get back the same notvia address when it comes back. But I don't think there are any particular issues associated with them disappearing and reappearing in the LSPs.
Do you have any specific issues in mind?

Only keeping the notvia addresses around until after the network has converged... If the notvia address is withdrawn with the link that's failed, then traffic may still be using that alternate.
Since we are going to use controlled rather than uncontrolled
convergence we can include managing the NV entries in the FIB.
You are right to point out that we have not described how to do this.

c. If configured in blocks, it would be extremely desirable to have the same Notvia address mean the same thing through multiple reboots, etc. It'd be good to have some means of consistent association. This is for easy manageability.

Yes, definitely.

For the case where the notvia address is for a neighbor, it's not always that straightforward - unless one ends up advertising multiple notvia addresses for the same neighbor, depending on the number of parallel links. This is mostly engineering, I think.
d. When a new link or neighbor comes up, there will be a longer period of time when an alternate isn't available because the Notvia address hasn't been advertised yet. These periods without protection need to be clearly understood and minimized.

Yes. I'm not convinced there is a particular problem here, but it does need thinking through carefully.


e. There may be scalability concerns based on the number of Notvia addresses and LDP FECs required. For instance, as described in the draft, it is basically the number of uni-directional links in the topology. This is ignoring the extras for broadcast links. To fully & certainly provide SRLG protection if at all feasible, would require that each router advertise a Notvia address for every uni-directional link into every neighbor of that router. This would result in K*L additional addresses, where K is the average number of neighbors & L is the number of uni-directional links in the topology.

Yes. This is a major concern, and we need to devise ways of solving SRLGs etc. which minimize the potential proliferation of addresses. We need to get the right tradeoff here between optimal solutions and complexity.

Agreed. We need to understand the impact of additional addresses to know the complexity cost of that versus the reduced coverage of selecting a less complete approach to broadcast link and SRLG protection.
2. Insufficiently diverse topology: It is possible that a network topology cannot provide an alternate that suffices for link, node and SRLG protection. It isn't clear to me how to compute a "best-available" alternate using this approach. For instance, if one can get link protection, but not node protection, how would that be determined, computed and assigned? This becomes much more of a concern for SRLG protection & for topologies where failures have already occurred and the network has converged for those & needs protection in the event of an additional failure.

Clearly it is always possible to create a topology which contains single points of failure and is inherently irreparable. This is part of the tradeoff we need to address when thinking about SRLGs, since taking a simple but pessimistic approach to SRLG can result in this sort of failure. This seems to be a property of the problem rather than any particular solution.

Let me try to explain this a bit better. Say there's a topology that, for a particular next-hop & next-next-hop, can only provide an alternate that gives link and node protection but not SRLG protection. Now, how does the notvia addresses method compute an alternate? If the method is pruning the topology of the relevant link, node & SRLGs, no alternate will be found. However, it was possible to compute & use an alternate that gives the link & node protection.
I need to think about this.

The similar case can easily occur with link & node protection. Say S has two parallel links to E; if the first fails, S could use the other to get link protection - but there is no node-protecting alternate. How does S determine this? What is the fall-back strategy in the case that no "full-protection" alternate is available?
In this case

S fails E, and computes the NV paths to its neighbors.
If any or all of these are unreachable it uses a link
repair to E_!S to reach them as described in Section 4.2
of the draft. If E_!S does not exist as in the case above,
S then looks to see if the parallel link exists.

Of course in the absence of SRLG, this topology contains
a SPF for node protection and will always be expected to have
limited repair coverage.

You are correct that this all needs describing in detail.

3. Failure Diagnosis versus Pessimism: As written, the draft discusses the idea of doing failure diagnosis using BFD. As Stewart, Mike & I have discussed, this isn't possible for SRLG failures, although it is possible for broadcast links.

Yes, and this relates to (2) above.

a. I am concerned about adding the failure diagnosis. This is yet another level of complexity for implementation. It also has ramifications for the forwarding plane, because of the need to store multiple alternates to use & have multiple states to check to decide what to use.

Yes. It would be nice not to have to do it, but that is back to the tradeoff above.

Complexity vs. coverage? I'm very fond, unsurprisingly, of options that don't require hardware changes... so I judge that the complexity is rather high to support this one - as well as more error-prone (see comment on unreliable diagnosis).
b. An example of a concern with the BFD diagnosis is that all interfaces on a node that has failed are not certain to fail exactly simultaneously or even within a sub-50ms bounded window. It is entirely possible that BFD sessions are terminated on different line-cards, that detect the router failure at slightly different times and stop forwarding traffic, therefore, at slightly different times.

Yes. There is the possibility of misdiagnosis in this case if the second failure occurs too long after the first. I suppose this then looks like two separate failures. Clearly an unreliable diagnosis is probably worse than no diagnosis at all. We need to get some handle on how realistic or not this scenario is.

Well, I think it is exceedingly realistic :-)
For a non-power related failure, routers with separate forwarding & control planes may take varying amounts of time for the line-cards to all realize that the route controller is down.
Well maybe for power-failures as well :)

The pathology of this sort of failure is highly implementation dependent.
Say BFD was running on the LC, but the switch fabric was down.
You could end up with the neighbors thinking that the router was still
up, but it was non-functional. Eventually routing would notice the
absence of routing hellos, unless of course, these had also been
delegated :) Perhaps we need to run BFD to the neighbor's neighbors
on the direct path?

The problem is that we rapidly get on a complexity spiral that
becomes intractable.

We clearly to write down a set of project scoping rules for the
types of failure that we will and will not deal with.

c. The other approach is to pessimistically eliminate all routers connected to the broadcast link as well as the broadcast link; this may not provide an alternate.

Yes. While simple, it runs into the problem of being a single (albeit large) point of failure. Its the same trade-off as above.

Don't they all reduce to that?

It also needs to be thought through what issues might exist if the topologies used for the SPF vary slightly for each router that is on the broadcast link, since each will, as described, not prune itself out when doing the computation; of course, there could be an approach where the same topology can be used everywhere.

I'm not really sure what you mean here.

Let me try and explain it a bit. Perhaps I'm missing something. In the case where a notvia topology results in pruning the router doing the computation, what forms the root of the SPT? Say routers A, B and C are all connected to a broadcast link X and want to compute a notvia X address as described in (c) by pruning the pseudo-node related to X as well as A, B, and C. Now, router A prunes the pseudo-node, A, B and C from the topology; what does A use as the root? IF A only prunes the pseudo-node, B and C to compute notvia X, B only prunes the pseudo-node, A, and C, and C only prunes the pseudo-node, A and B, and all other routers prune the pseudo-node, A, B, and C, can there be any issues with a consistently computed & non-looping path for notvia X?
I think it may not be an issue - b/c once the traffic leaves A, B or C,
it will never return - but it at least needs some thought, since this is
a bit different from what's traditionally been done.
Agreed, we need to write down the algorithm and subject it to review.
It isn't clear to me what Notvia addresses would be needed to express "don't go through this pseudo-node or any nodes attached to it"; I don't think that it is simply the Notvia address for avoiding a particular node.

No, it would need a specific notvia address bound to the LAN interface.


4. Multi-homed Prefixes: I am quite concerned about the mechanisms suggested in the draft. a. First, I really do not like the idea of having separate forwarding for "local" prefixes that come out of a tunnel. What is a local prefix? For instance, does this mean that an ABR has to forward traffic different depending on which area traffic from the tunnel has come from? I am concerned about how this would scale; maybe only 2 FIBs are needed (one for backbone & one for other), but it may be worse to handle AS external routes. I know that Stewart, Mike, Joel, Albert and I had discussed/agreed to put this idea out of scope at least for the moment.

Clearly the problem needs solving, especially since prefixes which are multihomed are frequently the most important prefixes (which is WHY they are multihomed in the first place).

inconvenient that!

b. I am quite concerned about having tunnels to the advertisers of the prefixes. i. There needs to be a mechanism to determine whether the advertiser of a prefix will forward the packet in a loop-free fashion to avoid the failure point. The separate forwarding for "local" prefixes avoided the need for this determination, but at more substantial cost.

There seem to be two aspects to this.

a) we need the ability to get the packet to the "second-best" attachment point for the prefix without it being "sucked back" to the failure. This in general requires a tunnel, except for the cases where a neighbor of the node detecting the failure has an LFA to the second best attachment point. Clearly this could be used in preference to a tunnel where available, but at the expense of additional complexity. However this is really just an extension of the general principle that we should use "basic" (i.e. LFA) repair to cream off traffic which doesn't NEED to be tunnelled.

Yes - though the tunnels bring out the LDP issues with targeted sessions - of course.
b) we need (in a very limited number of cases), the ability to force the packet to the locally attached prefix. This only occurs where the local cost is high compared to the cost back to the failed attachment point. But when we DO need it, the use of a tunnel is a convenient means of signalling this. I'm not sure how else to do it, other than using a label.
Of course ONE "solution" would be to REQUIRE the costs to be set
sensibly :-)

I like that one :-) but you knew I would! We may need to define what that means better or have a way of determining that it is the case.
ii. To support LDP, every tunnel requires a targeted LDP session. If multi-homed prefixes are common, then this becomes a full mesh for LDP. That isn't acceptable.


Of course, multi-homed prefixes may be much more infrequent for LDP than for IP; for example, there is no reason to advertise a separate FEC for the subnet of a link. However, multi-homed prefixes are a concern for LDP for at least the inter-area, AS External, and BGP routes. iii. If traffic is encapsulated to a node's regular address, because that traffic is destined to a prefix advertised by the node, how does the receiving node know to remove the encapsulation and forward the packet inside all in the fast path? Is this a just a question of different handling based on the header type inside the outer encapsulation (for GRE)?


OK. The traffic wouldn't be directed up to the control plane because it was GRE encapsulated??
GRE always pops the header at the tunnel endpoint. That is how it

And had a special header type for this purpose? Certainly I can see something like this working with an LDP LSP, b/c the label would just get it to that router & then be popped & the packet forwarded based on what's underneath.
Perhaps an MPLS label of some sort the way we thought of doing directed
forwarding and the way that Mark Townsley proposed doing IP VPN?

iv. Perhaps these issues could be handled by determining a next-next-hop that avoids the failure to reach an appropriate advertiser. Of course, this is a different set/type of computation.

Could you explain that suggestion please?

Well, if there is a neighbor's neighbor whose path to the multi-homed prefix doesn't go through the failure & this can be determined, then the traffic could be tunneled to that neighbor's neighbor & then normally forwarded from there.
Yes, but you only get two hop reachability. Perhaps you do this,
and then do directed LDP for the remaining (perhaps 2%) of cases.
The problem I have with this is the added complexity.

Basically, if one knows the SPT from each neighbor's neighbor & can
reach all of those neighbor's neighbors without going through the
failure, then it might provide an alternate. The issue there is first
that the path to a neighbor's neighbor might go via the failed element &
not have an appropriate notvia & second the effort of computing and
considering the different SPTs.
Does that make more sense?

5. SRLGs and Broadcast Links: There seem to be a number of possible ways to handle SRLGs and broadcast links, each of which provides a different trade-off in terms of coverage, computation, and extra Notvia addresses.


  There are basically 4 approaches at this point.
a. First, In order to compute a notvia alternate that avoids a link, the primary neighbor, and all SRLGs that the link is part of, it is necessary to have a separate topology and associated SPF computation for each link that is a member of an SRLG or a broadcast link. This requires also a substantially larger number of Notvia addresses and the corresponding mechanisms to determine how and when to allocate and de-allocate them.

.. and could potentially result in a combinatorial explosion if we weren't very careful.

I do think of this as having the highest potential to be the "too complicated & run-away" scenario.
b. Second, one could use a topology that removed the primary neighbor and see whether SRLG protection can be obtained either along S's path or along any path of a neighbor of S that is also loop-free.

Could you explain that a bit more please?

This is the concept of looking for a loop-free neighbor to the notvia address whose path there happens to give SRLG protection. We'd discussed this one the last day at IETF.
c. Third, when a Notvia address indicates to avoid a node, one could remove not merely the node & the uni-directional links to and from that node, but also any other links that are in a common SRLG with any of the links to or from the removed node. This is pessimistic but allows some SRLG protection without increased computation or Notvia addresses.

Yes. This is nice and simple, but as you have pointed out above, could easily result in an inability to find a viable repair.

At the risk of additional complication, one could have it configurable as to the specific handling of SRLGs. For instance, most link/node/SRLG would be handled with a single notvia per node - and then there could be configured specific links that required their own notvia. Of course, that adds substantial extra complexity. Is it necessary?
d. Fourth, one could simply track the SRLGs encountered along the Notvia path; this just reports whether the alternate provides SRLG protection without any effort to obtain it.

Yes. Interesting. I wonder how useful this would be.

Well, it could feedback to a network design at least - to give an indication of coverage. Also, not all SRLGs have the same likelihood to fail. So, if avoiding all doesn't work, perhaps one can avoid the most risky ones - and then report the protection against the failures of the others.
This is part of my concern/question about what notvia alternate is
computed if the best protection isn't possible.
6. Implementability: Clearly, the draft describes the basic idea for Notvia addresses, but there are a fair number of implementation/protocol decisions that need to be made before this can become anything more than an interesting idea.

Sure. There are quite a few design decisions and tradeoffs as indicated above that need tieing down.
7. There is a definite need to describe the convergence case better. This is how the transition from using the alternate to the network being converged happens, such that the alternate remains functional. a. For instance, if the node E fails, then the Notvia address E_!S will no longer be advertised. If S was getting link protection (because that was all that was possible, for instance) by tunneling traffic to E_!S, it is important that this traffic be properly discarded when E's addresses go away. This implies that there needs to be a default blackhole for Notvia addresses.

I don't quite understand your concern here. If E goes away and S is sending to E_!S, then the neighbors of E will drop the packets because we don't repair a notvia address.

I'm thinking of this as the more specific prefix goes away. Without a specific blackhole for the group of prefixes, why wouldn't the packets take that instead? I.e., if the notvia address is and there's a default route for 10.1/16 (or for, then the packet would pick up the latter when the notvia address is removed.
Yes, you are quite right. We need a NV black hole.

Or are you concerned that after convergence, there will be nodes which don't even have a forwarding entry forE_!S. By this time I don't think that S (or anyone else) should still be using that address, but even if it were, the absence of a forwarding entry would (SHOULD) cause the packet to be dropped. Is this all you are saying?

I was also worrying about the above in reference to the notvia addresses. While it is possible to say that changes to notvia addresses shouldn't be installed until after the network has otherwise converged, that sort of detail needs to be clarified.
b. Another example is when node E fails, the next-next-hop B must continue to advertise the Notvia address B_!E until the network converges so that S can continue to tunnel traffic to B_!E as the alternate.

Yes. Our view was that no changes would be made to notvia advertisement or more specifically notvia FIB entries until after convergence is over. Of course there is an issue as to how you tell when that has happened, but the timers associated with loop free convergence probably give a good indication.

Conceptually, I agree. Having a primary topology to forward the traffic on while the backup one reconverges helps.
c. It is possible to get a micro-forwarding loop affecting a Notvia address as a result of a less severe failure than anticipated. For instance, consider the following topology.
          1   |
      |       |   \ 10
    1 |R    1 |R   \
      |   5   |     \

     Link S->E and Link H->F are in SRLG R

When node E fails, if I converges before H, there will be a loop affecting the Notvia address being used to reach F without going through any of Link S->E, E or SRLG R.

We discussed this privately, and I still don't see how loops could arrise even if the notvia FIB were recomputed before normal convergence is complete. But I think it is better to delay the notvia FIB changes anyway.

Just for clarity (hopefully), before the failure, H computes the path for F_!E, the address of F's that is notvia E, to go via I and then to F. After the failure of E, if H installs the changed notvia address F_!E the path is directly to F, b/c node E no longer has SRLG R associated with any of E's up links.
I think that the core issue here is the case of a failure during the
reconvergence of the repair topology. Is that in scope?

d. How do exceptions work? Particularly in regards to an IP-in-IP encapsulation such as GRE, it doesn't seem like MTU exceeded cases can be handled cleanly either by use of DF or by doing IP fragmentation and then the reassembly at the end of the tunnel. This seems like a problem for all ICMP packets; how could a source understand the header inside for a TTL expired, for instance.

I'll leave this for Stewart (tunnel) Bryant!

For LDP, there are mechanisms (layer violations though they are) to handle exceptions generating ICMP packets.
The interesting question is who needs to know of the MTU problem?

If you tell the host, then by the time it adjusts it's MTU the
network will likely have reconverged anyway.

However if you tell repairing router (which is what will happen
with a tunnels packet), it can alarm and let the network
administrators know that there is problem with IPFRR config.

For this to work, the MTU at the edges needs to be lower than the
MTU in the core.

e. For IP-in-IP tunnels, another concern is flow diversity. The IP source and destination addresses are used to determine a flow; this flow identification may then be used for a variety of purposes, including ECMP. By putting all the traffic to a variety of destinations inside the same header, the ability to take advantage of flow diversity appears to have disappeared. This could possibly be solved by putting the original source address into the encapsulating header? Are there other approaches?

and this.

Again, for an LDP tunnel, many routers can look under the label and consider the IP packet inside for flow identification.
I was going to say:

Given that basic cuts in before NV, I think that the only case where
this is a problem is when you have a router with max ECMP = say 2 which
selects two from more than two, and the next hop on one of them fails.
This is surely a corner case?

Then Mike pointed out that we had said that we would use ECMP in the
draft, and yes there is a problem. Again we need to think about the
implications, because it's not clear what we should do.

- Stewart


Rtgwg mailing list
[email protected]

Rtgwg mailing list
[email protected]

<Prev in Thread] Current Thread [Next in Thread>