[email protected]
[Top] [All Lists]

Re: FW: composite link - candidate for respin, maybe

Subject: Re: FW: composite link - candidate for respin, maybe
From: Curtis Villamizar
Date: Mon, 29 Mar 2010 00:55:37 -0400

See inline.  Assume that no response on my part means full agreement
(or you didn't add [LY] and I missed it since that is what I searched



In message <[email protected]>
Yong Lucy writes:
> Curtis,
> Thank you for this work. You have a quick and good start.
> See inline.
> Regards,
> Lucy
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On Behalf Of
> > Curtis Villamizar
> > Sent: Friday, March 26, 2010 2:02 AM
> > To: [email protected]
> > Subject: composite link - candidate for respin, maybe
> > 
> > 
> > This email is to the authors of the CL drafts and the WG as a whole.
> > 
> > I started out writing a quick few bullets on what we talked about in
> > today's WG meeting that we seemed to agree on and that made some sense
> > to me.  The latter part may be a filter that introduces artifacts to
> > the signal (or noise).
> > 
> > That didn't seem to stand alone so I added some terms up front and
> > then some introductory text, trying to keep it concise.
> > 
> > What I would like to ask the WG is whether I'm on a reasonable track
> > here.  [hopefully we won't have continued dead silence on the list.]
> > 
> > What I would like to ask the authors is whether they would consider
> > this a reasonable restart point.
> [LY] This is reasonable to me. 
> >
> > I've tried to keep in mind the chair's instructions and what I sensed
> > to be the feelings in the audience (though we didn't formally hum - is
> > formal WG hum an oxymoron?).
> > 
> >   keep it short - 5 pages including boilerplate if possible
> [LY] We understand what that imply. Not necessary to limit to 5, concise is
> the goal.
> > 
> >   include clear requirements
> > 
> >   do not mandate implementation details
> > 
> > I'm not sure if the usage scenarios at the end are needed or if this
> > could stand alone.
> > 
> > There was a lot of chatter in the room so I appologize if I missed
> > something.  This is based on my recollection which is known to have a
> > fairly high bit error rate even on better days.
> > 
> > Curtis
> > 
> > 
> > 
> > btw - this is obviously not an I-D as is for more reasons than missing
> > boilerplate.  Its also late and I'm tired so I made no attempt to get
> > citation right.
> > 
> > 
> > 
> > 
> > Key terms:
> > 
> >   flow - A flow in the context of this document is a aggregate of
> >     traffic for which packets should not be reordered.  A flow is
> >     similart to a microflow or ordered aggregate as defined in
> >     [diffserv framework].  The term "flow" is used here for brevity.
> >     This definition of flow should not be interpreted to have broader
> >     scope than this document.
> > 
> >   flow identification - The means of identifying a flow or a group of
> >     flows may be specific to a type of payload.
> [LY] IMO: in the context of this draft, flow identification identifies a
> flow. Not a group of flows. A flow has to be transported over a single
> component link in order to preserve the ordering.

A particular flow identification method may isolate a group of one.
That is neither precluded or required.  If a group of flows is bound
to a component link, all the flows in that group are bound.

I think were you are confused is that if a LSP is self identified
(throgh signaling) or identified through management plane as requiring
a strict ordering over the entire LSP, then it is treated as an
individual LSP.

[begin aside]

BTW- Existing techniques isolate groups of flows and do a good job of
load balancing.  IP traffic can have millions of flows, most of which
are short lived.  If IP hash is the basis for an entropy label there
would be millions of values (up to 20 bits worth - minus a few if
reserved label values are avoided, though 19 bits would be easier to
implement than a modulo).

Even if all the flows are LSP, if a customer PTP Ethernet is carried
as a PW there is no way to know when that mostly idle Ethernet is
about to start the nightly database backup and suddenly peak and stay
there for a while.  Lots of applications are like this.  Some grow
gradually, some spike suddenly.

[end aside]

> >   top label entry - In MPLS the top label entry contains the label on
> >     which an intitial forwarding decision is made.  This label may be
> >     popped and the forwarding decision may involve further labels but
> >     that is immeterial to this discussion.
> > 
> >   label stack - In MPLS the label stack includes all of the MPLS
> >     labels from the top of the stack to the label marked with the
> >     S-bit (Bottom of Stack bit) set.
> > 
> >   outer and inner LSP - The LSP associated with labels in the outer
> >     encapsulation are called outer LSP.  Those LSP which are
> >     associated with inner encapsulation (closer to the label entry
> >     containing the S-bit) are called inner LSP.  These are not called
> >     top and bottom LSP since MPLS and PWE draw the label stack in
> >     opposite directions with PWE putting the outermost label on the
> >     bottom of diagrams (and confusing people in doing so).
> > 
> >   component link - See RFC4201.
> [LY] IMO: component link in RFC 4201 apply to component links with the same
> TE metrics. The component links in composite link support different
> characteristics.

Fine.  Propose a definition of component link.

> >   composite link - [pull from existing I-D]
> > 
> > Introduction:
> > 
> >   There is often a need to provide large aggregates of bandwidth that
> >   is best provided using parallel links between routers or MPLS LSR.
> >   In core networks there is often no alternative since the aggregate
> >   capacities of core networks today far exceed the capacity of a
> >   single physical link or single packet processing element.
> > 
> >   Today this requirement can be handled by Ethernet Link Aggregation
> >   [IEEE802.1X], link bundling [RFC4201], or other aggregation
> >   techniques some of which may be vendor specific.  Each has strengths
> >   and weaknesses.
> > 
> >   The term composite link is more general than terms such as link
> >   aggregate which is generally considered to be specific to Ethernet
> >   and its use here is consistent with the broad definition in [ITU
> >   8xxx].
> > 
> >   Large aggregates of IP traffic do not provide explicit signaling to
> >   indicate the expected traffic loads.  Large aggregates of MPLS
> >   traffic are carried in MPLS tunnels supported by MPLS LSP.  LSP
> >   which are signaled using RSVP-TE extensions do provide explicit
> >   signaling which includes the expected traffic load for the
> >   aggregate.  LSP which are signaled using LDP do not provide an
> >   expected traffic load.
> > 
> >   MPLS LSP may contain other MPLS LSP arranged hierarchically.  When
> >   an MPLS LSR serves as a midpoint LSR in an LSP carrying other LSP as
> >   payload, there is no signaling associated with these inner LSP.
> >   Therefore even when using RSVP-TE signaling there may be
> >   insufficient information provided by signaling to adequately
> >   distribute load across a composite link.
> [LY] FRR [LY] is a good example.

I take it that no change is needed and you are just pointing out an

> >   Generally a set of label stack entries that is unique across the
> >   ordered set of label numbers can safely be assumed to contain a
> >   group of flows.  The reordering of traffic can therefore be
> >   considered to be acceptable unless reordering occurs within traffic
> >   containing a common unique set of label stack entries.  Existing
> >   load splitting techniques take advantage of this property in
> >   addition to looking beyond the bottom of the label stack and
> >   determining if the payload is IPv4 or IPv6 to load balance traffic
> >   accordingly.
> [LY] Does the draft aims on MPLS network? In MPLS network, there are IP
> packets from control plane. Should we limit to this scope for now?

No one has demonstated any technical reason to limit scope to MPLS
only, let alone a strong reason.  This is a paragraph about what is
known to be "generally safe" to load split on and what existing
implementations do.  It is generally safe except now it is unsafe for
MPLS-TP OAM LM only.  I should add citations to the PWE CW and the
motivation for CW which is based on this behaviour.

> >   For example a large aggregate of IP traffic may be subdivided into a
> >   large number of groups of flows using a hash on the IP source and
> >   destination addresses.  This is as described in [diffserv
> >   framework].  For MPLS traffic carrying IP, a similar hash can be
> >   performed on the set of labels in the label stack.  These techniques
> >   are both examples of means to subdivide traffic into groups of flows
> >   for the purpose of load balancing traffic across aggregated link
> >   capacity.  The means of identifying a flow should not be confused
> >   with the definition of a flow.
> > 
> >   Discussion of whether a hash based approach provides a sufficiently
> >   even load balance using any particular hashing algorithm or method
> >   of distributing traffic across a set of component links is outside
> >   of the scope of this document.
> > 
> >   The use of three hash based approaches are defined in RFCxxxx.  The
> >   use of hash based approaches is mentioned as an example of an
> >   existing set of techniques to distribute traffic over a set of
> >   component links.  Other techniques are not precluded.
> [LY] Not sure why mention hash here. 

I am citing what little we have in documentation on hash based methods
(RFC2991 and RFC2992).

Hash based methods that snoop past BOS to IP is why we have a PWE CW.
We need to acknowledge very widely deployed behaviour.  In this case
it is every core router/LSR in the Internet for the last 15 years
(unless some vendor that I don't know about has missed this but if
Cisco and Juniper own 90%++ of the market it is in at least 90%).

> > Requirements:
> > 
> >   These requirements refer to link bundling solely to provide a frame
> >   of reference.  This requirements document does not intend to
> >   constrain a solution to build upon link bundling.  Meeting these
> >   requirements useing extensions to link bundling is not precluded, if
> >   doing so is determined by later IETF work to be the best solution.
> > 
> >   The first few requirements listed here are met or partially met by
> >   existing link bundling behavior including common behaviour that is
> >   implemented when the all ones address (for example 0xFFFFFFFF for
> >   IPv4) is used.  This common behaviour today makes use of a hashing
> >   technique as described in the introduction, though other behaviours
> >   are not precluded.
> [LY] Why mention hash here?

Because it is in link bundle and link bundle is used as the frame of
reference.  I did say that we are not bound to extending link bundle.
But we must consider it as a possible option because IETF requires
that we not reinvent the wheel.  To do so we need to discuss what link
bundle does in the first place so that we can decide.  If we decide
that we can't reuse link bundle we have to justify that decision.

> >   1.  Aggregated control information which summarizes multiple
> >       parallel links into a single advertisement is required to reduce
> >       information load and improve scaleability.
> > 
> >   2.  A means to support very large LSP is needed, including LSP whose
> >       total bandwidth exceeds the size of a single component link but
> >       whose traffic has no single flow greater the component links.
> >       In link bundling this is supported by many implementations using
> >       the all ones address component addressing and hash based
> >       techniques.
> [LY] IMO: This is not a requirement for composite link. Original draft
> requires a flow BW (LSP) is less than single component link capacity.
> Opinion from other co-authors? Original draft requires TE based method to
> handle both RSVP-TE LSPs and LDP LSPs.

This is real world today.  Apparently Dave thinks it is a requirement.
Probably becuase his network wouldn't work today without it.  All of
the customers I've spoken to (at least the IP people) think supporting
LSP larger than a component link is a hard requirment.  This is most
important for IP core where (bundle/aggregate/composite) links are
rapidly approaching Tb/s and we haven't seen the first commercial
100GbE yet (except trade show demos, and ODU4 will come even later).

Once again, we can ask for concensus on this.  Generally we can't
remove widely deployed capability (or we can but we can't expect
support or market success if we do).

The new requirement is to add LSP which *don't* behave this way (don't
load split by separating flows that exist within an LSP).

BTW - this applies to both TE and LDP LSP, and TE LSP carrying LDP
LSP, etc.

> >       Note: some implementations impose further restrictions regarding
> >       the distribution of traffic across the set of identifiers used
> >       in flow identification.  Discussion of algorithms and
> >       limitations of existing implementations is out of scope for this
> >       requirements document.
> > 
> >   The remaining requirements are not met by existing link bundling.
> > 
> >   3.  In some more than one set of metrics is needed to accommodate a
> >       mix of capacity with different characteristics, particularly a
> >       bundle where a subset of component links have shorter delay.
> > 
> >   4.  A mechansism is needed to signal an LSP such that a component
> >       link with specific characteristics are chosen, if a preference
> >       exists.  For example, the shortest delay may be required for
> >       some LSP, but not required for others.
> > 
> >   5.  LSP signaling is needed to indicate a preference for placement
> >       on a single component link and to specifically forbid spreading
> >       that LSP over multiple component links based on flow
> >       identification beyond the outermost label entry.
> > 
> >   6.  A means to support non-disruptive reallocation of an existing
> >       LSP to another component link is needed.
> > 
> >   7.  A means to populate the TE-LSDB with information regarding which
> >       links (per end) can support distribution of large LSP across
> >       multiple component links based on the component flows and the
> >       characteristics of this capability.  Key characteristics are:
> > 
> >     a.  The largest single flow that can be supported.  This may
> >         or may not be related to the size of component links.
> > 
> >     b.  Characteristics of the flow identification method.  [These
> >         can be enumberated in this document or a later document. ]
> > 
> >     c.  Other characteristics?  [ Not sure if I got everything
> >         mentioned in the WG meeting. ]
> > 
> >   8.  Some means is needed for an LSP which allows distribution of
> >       flows across member links to indicate characteristics of flow
> >       distribution.  These characteristics include:
> > 
> >     a.  The largest flow expected.
> > 
> >     b.  Other? [ Did we identify any other?  There was some
> >         chatter about distribution of flows but no specific
> >         characteristics was called for - AFAIK ]
> > 
> >   9.  In some cases it may be useful to measure link parameters
> >       and reflect these in metrics.  Link delay is an example.
> > 
> >   10. Some uses require an ability to bound the sum of delay metrics
> >       along a path while otherwise taking the shorted path related to
> >       another metric.  [This was mentioned but seems a bit orthogonal
> >       to all but #3.]
> > 
> > Purpose:
> > 
> >   [ A set of example scenarios were discussed.  We may want to capture
> >     them here (and maybe refine the examples). ]
rtgwg mailing list
[email protected]

<Prev in Thread] Current Thread [Next in Thread>