[email protected]
[Top] [All Lists]

Re: Fwd: I-D ACTION:draft-atlas-ip-local-protect-uturn-01.txt

Subject: Re: Fwd: I-D ACTION:draft-atlas-ip-local-protect-uturn-01.txt
From: Curtis Villamizar
Date: Thu, 28 Oct 2004 13:21:59 -0400
In message <[email protected]x>
"Naidu, Venkata" writes:
> -> Alia, we have talked briefly at the IETF a couple of times. 
> -> I am interested 
> -> in the IP fast reroute concept and hope that we can use it 
> -> in our network 
> -> some time in the future. I have a concern however. It is 
> -> regarding the 
> -> complexity of designing and keeping the topology so that 
> -> uturn or even 
> -> better, loop-free approach gives 100% coverage. Is my 
> -> feeling correct that 
> -> you need to keep a  very dense topology to reach the 100% goal?
>   Good question. 100% coverage is specific to number of 
>   failures. If an algorithm/approach is designed to
>   cover 100% coverage for 1-failure may not cover 100% 
>   for 2-failures (simultaneous failures), even if the same 
>   topology is sufficient to cover 2-failures.
>   For example, 1-failure 100% coverage topology for any
>   V vertices would need at least E edges. E-1 edges are
>   enough for a connected graph. So, E is still O(V), which
>   is sparse. So, sparse topology is sufficient for 1-failure 
>   100% coverage. In the real world, these types of topologies
>   are very common. Look at token ring, sonet rings etc. All
>   these ring topologies cover 1-failure very well. Because,
>   graph is still connected if a node/link fails in a ring.
>   Coming to k-failure (simultaneous) 100% coverage topology,
>   the graph becomes dense and dense. More over, such graphs
>   should have very particular properties, such as t-spanners.
>   It is a very interesting research topic to find out when
>   the topology becomes sparse to dense (i.e., E=O(V) to O(VV)).
>   As the number of failures increase by one, the number of 
>   edges increase exponentially. So 2-failure 100% coverage 
>   would need a dense graph, IMO.
> Venkata.


A more practical way of looking at the operational problem is to look
at single failures plus any correlated failures that are likely to

For example:

  Node protection can be thought of as a simultaneious failure of all
  edges that terminate at a given node.

  Links (edges) that travel over a common length of fiber (DWM) may be
  considered to have a high probablitity of correlated failure.

  Links that rely on other common equipment or resources including
  power may experience correlated failure.  Whether to consider the
  probablilty of failure of any given reasource as significant enough
  is up to the operator.  

  In some cases fiber along the same physical path can fail.  Classic
  examples, are fiber on the same bridge that collapsed, on both sides
  of the same railroad track, fiber in gas pipes dug up be a
  contractor digging up the wrong unused gas pipes, along the same
  (San Deigo) earthquake fault.  The operator needs to decide if any
  case is sufficiently probable.

Many causes of correlated failure are so small that they need not be
considered.  For example, redundant power is not supposed to fail.  In
the late 1980s the FAA lost connectivity on the eastern corridor when
power failed when a backup generator used by both NYNEX and AT&T
failed (same building) and their circuits through NYC from two
different carrier failed simultaneously due to power.  In the
mid-1990s a rat exploded in a San Jose power transfer switch and took
out grid and backup power.  These are low probability but not zero.
Same with fiber on the same earthquake fault line.

FRR provides fast transient failover, not the only failover.  The
question is then economic, not technical.  At what point is the
probablility of a given correlated failure small enough that the risk
of revenue loss due to violating an SLA or the potential customer ill
will great enough to justify the cost of proecting against that
particular failure.  If the risk is that a correlated failure will
result in under 1 second failover or a few seconds to failover rather
than under 50 msec and the probablility of occurance is very low and
the cost to provide protection is high, then not protecting against
the low probablility correlated failure is a justifiable risk.

There are many ways that a provider can do this analysis.  A first
order (which I think was the original question) is to just consider
the minimal topology changes needed to cover single failures.  Based
solely on probablility estimates and a chosen probablility threshhold,
a better approximation includes these "likely" correlated failures by
including SRLG in the analysis and make sure it is supported in the
recovery mechanism.  These analysis can be done knowing just the
topology and the set of SRLG that need to be considered (if it is not
the null set) and therefore a vendor can do this for the provider
given that information.

Only the provider can determine the true risk by estimating the
revenue loss if a given correlated failure is unprotected, the
probability of that failure, and the cost of the candidate remedies.
That would allow a more thorough cost vs risk analysis.

The important thing to remember is that the risk in question is the
risk of not having protection during the convergence transient (which
should be sub second to a few seconds).


ps - Or you can declare SRLG (or SRG?) to be a non-goal.

Rtgwg mailing list
[email protected]

<Prev in Thread] Current Thread [Next in Thread>