"Mcdysan, David E" writes:
> Hi Curtis,
> Thanks for taking the initiative to write this up.
> Also thanks to the rtgwg participants who provided many insightful
> questions and suggestions at the mike as well as the helpful direction
> from the chairs.
> The definition of "flow" is not agreed to. The current scope is MPLS
> only and not the Diffserv definition you propose unless the WG agrees to
> extend the scope. I had asked the wg chairs to call for a resolution to
> this decision on the list, and hopefully we can accomplish this soon.
As Tony's email pointed out, we may be in agreement. I might have
> A few detailed comments in line below.
> If the other folks who took notes could publish their lists of things we
> identified as requirements and/or problem statements.
Further comments inline.
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > On Behalf Of Curtis Villamizar
> > Sent: Friday, March 26, 2010 3:02 AM
> > To: [email protected]
> > Subject: composite link - candidate for respin, maybe
> > This email is to the authors of the CL drafts and the WG as a whole.
> > I started out writing a quick few bullets on what we talked
> > about in today's WG meeting that we seemed to agree on and
> > that made some sense to me. The latter part may be a filter
> > that introduces artifacts to the signal (or noise).
> > That didn't seem to stand alone so I added some terms up
> > front and then some introductory text, trying to keep it concise.
> > What I would like to ask the WG is whether I'm on a
> > reasonable track here. [hopefully we won't have continued
> > dead silence on the list.]
> > What I would like to ask the authors is whether they would
> > consider this a reasonable restart point.
> > I've tried to keep in mind the chair's instructions and what
> > I sensed to be the feelings in the audience (though we didn't
> > formally hum - is formal WG hum an oxymoron?).
> > keep it short - 5 pages including boilerplate if possible
> I heard 7 pages, including boilerplate, and Alex's response to my
> question regarding "Acknowledgement of Prior work" material was that it
> could be moved to an Appendix and not count against this quota. Alex,
> please confirm.
The request was 5 pages. The compromise was 7 pages. The real point
was to keep it concise. If clarity is improved with a little more,
then I'm sure it would be OK. If the document rambles (as I am doing
now), clarity is reduced with length.
> > include clear requirements
> > do not mandate implementation details
> > I'm not sure if the usage scenarios at the end are needed or
> > if this could stand alone.
> > There was a lot of chatter in the room so I appologize if I
> > missed something. This is based on my recollection which is
> > known to have a fairly high bit error rate even on better days.
> > Curtis
> > btw - this is obviously not an I-D as is for more reasons
> > than missing boilerplate. Its also late and I'm tired so I
> > made no attempt to get citation right.
> > Key terms:
> > flow - A flow in the context of this document is a aggregate of
> > traffic for which packets should not be reordered. A flow is
> > similart to a microflow or ordered aggregate as defined in
> > [diffserv framework]. The term "flow" is used here for brevity.
> > This definition of flow should not be interpreted to have broader
> > scope than this document.
> Note that Diffserv microflow includes source address, source port,
> destination address, destination port and protocol id. See scope
Tony suggested (in email) that I remove all but the first sentence.
I'd like to keep the example but improve clarity as follows.
flow - A flow in the context of this document is a aggregate of
traffic for which packets should not be reordered. A flow in this
context should not be confused with a microflow or ordered
aggregate as defined in [diffserv framework] which share the
similarity of requiring that reordering be avoided but is specific
to IP. The term "flow" is used here for brevity. This definition
of flow should not be interpreted to have broader scope than this
> > flow identification - The means of identifying a flow or a group of
> > flows may be specific to a type of payload.
> > top label entry - In MPLS the top label entry contains the label on
> > which an intitial forwarding decision is made. This label may be
> > popped and the forwarding decision may involve further labels but
> > that is immeterial to this discussion.
> > label stack - In MPLS the label stack includes all of the MPLS
> > labels from the top of the stack to the label marked with the
> > S-bit (Bottom of Stack bit) set.
> > outer and inner LSP - The LSP associated with labels in the outer
> > encapsulation are called outer LSP. Those LSP which are
> > associated with inner encapsulation (closer to the label entry
> > containing the S-bit) are called inner LSP. These are not called
> > top and bottom LSP since MPLS and PWE draw the label stack in
> > opposite directions with PWE putting the outermost label on the
> > bottom of diagrams (and confusing people in doing so).
> Need to cover case with more than two LSPs. Would using inner LSPs. (For
> brevity is last sentence necessary?
Outer and inner are directions. Outermost and innermost are top and
the one marked with BOS. This allows us to say "top label" and inner
labels (plural). Outer are the label(s) used for forwarding. If PHP
is used, numerous outer labels may be popped, requiring the next label
to be examined before the packet is sent on its merry way.
According to his email Tony was also confused and thought inner meant
innermost. Let me try again.
outer LSP(s) and inner LSP(s) - The LSP(s) associated with labels in
the outer encapsulation are called outer LSP. The outer label
stack entries are used for forwarding. The remaining LSP(s) which
are associated with inner encapsulation (closer to the label entry
containing the S-bit) are called inner LSP(s). There is a single
outermost LSP and innermost LSP, but may be multiple outer and
inner LSP. These are not called top and bottom LSP since MPLS and
PWE draw the label stack in opposite directions with PWE putting
the outermost label on the bottom of diagrams (and confusing
people in doing so).
That is getting rather long.
> > component link - See RFC4201.
> Not clear, component link is not the bundled link, use existing I-D?
A component link is a single link in the link bundle. We can pull a
definition from elsewhere if you prefer. Ethernet LAG uses "member".
Link bundle uses "component".
> > composite link - [pull from existing I-D]
We can pull component from there too if there is a clear definition.
> > Introduction:
> > There is often a need to provide large aggregates of bandwidth that
> > is best provided using parallel links between routers or MPLS LSR.
> > In core networks there is often no alternative since the aggregate
> > capacities of core networks today far exceed the capacity of a
> > single physical link or single packet processing element.
> Move following to Appendix on summary of Existing approaches per
> direction from Alex and merge with existing material and your comments,
> inputs to the list in the thread "Acknowledgement of Prior work" .
> > Today this requirement can be handled by Ethernet Link Aggregation
> > [IEEE802.1X], link bundling [RFC4201], or other aggregation
> > techniques some of which may be vendor specific. Each has strengths
> > and weaknesses.
> > The term composite link is more general than terms such as link
> > aggregate which is generally considered to be specific to Ethernet
> > and its use here is consistent with the broad definition in [ITU
> > 8xxx].
> > Large aggregates of IP traffic do not provide explicit signaling to
> > indicate the expected traffic loads. Large aggregates of MPLS
> > traffic are carried in MPLS tunnels supported by MPLS LSP. LSP
> > which are signaled using RSVP-TE extensions do provide explicit
> > signaling which includes the expected traffic load for the
> > aggregate. LSP which are signaled using LDP do not provide an
> > expected traffic load.
> > MPLS LSP may contain other MPLS LSP arranged hierarchically. When
> > an MPLS LSR serves as a midpoint LSR in an LSP carrying other LSP as
> > payload, there is no signaling associated with these inner LSP.
> > Therefore even when using RSVP-TE signaling there may be
> > insufficient information provided by signaling to adequately
> > distribute load across a composite link.
> > Generally a set of label stack entries that is unique across the
> > ordered set of label numbers can safely be assumed to contain a
> > group of flows. The reordering of traffic can therefore be
> > considered to be acceptable unless reordering occurs within traffic
> > containing a common unique set of label stack entries. Existing
> > load splitting techniques take advantage of this property in
> > addition to looking beyond the bottom of the label stack and
> > determining if the payload is IPv4 or IPv6 to load balance traffic
> > accordingly.
> > For example a large aggregate of IP traffic may be subdivided into a
> > large number of groups of flows using a hash on the IP source and
> > destination addresses. This is as described in [diffserv
> > framework]. For MPLS traffic carrying IP, a similar hash can be
> > performed on the set of labels in the label stack. These techniques
> > are both examples of means to subdivide traffic into groups of flows
> > for the purpose of load balancing traffic across aggregated link
> > capacity. The means of identifying a flow should not be confused
> > with the definition of a flow.
> > Discussion of whether a hash based approach provides a sufficiently
> > even load balance using any particular hashing algorithm or method
> > of distributing traffic across a set of component links is outside
> > of the scope of this document.
> > The use of three hash based approaches are defined in RFCxxxx. The
> > use of hash based approaches is mentioned as an example of an
> > existing set of techniques to distribute traffic over a set of
> > component links. Other techniques are not precluded.
> End of Material Moved to Appendix.
I think this belong up front. We need to define the problem and it
doesn't hurt to spend a page describing what exists today. The extent
to which some requirements are already met *is* relevant.
> I had envisioned a section summarizing network operator problems to be
> solved preceding the requriements. I plan to draft some text and send it
> out for comment on the list.
I was going to put story telling time at the end. :-)
> > Requirements:
> > These requirements refer to link bundling solely to provide a frame
> > of reference. This requirements document does not intend to
> > constrain a solution to build upon link bundling. Meeting these
> > requirements useing extensions to link bundling is not precluded, if
> > doing so is determined by later IETF work to be the best solution.
> > The first few requirements listed here are met or partially met by
> > existing link bundling behavior including common behaviour that is
> > implemented when the all ones address (for example 0xFFFFFFFF for
> > IPv4) is used. This common behaviour today makes use of a hashing
> > technique as described in the introduction, though other behaviours
> > are not precluded.
> > 1. Aggregated control information which summarizes multiple
> > parallel links into a single advertisement is required to reduce
> > information load and improve scaleability.
> Based upon Tony's comments, I think that the objective is to convey the
> same information as a set of parallel links with different
> characteristics more efficiently and in a more scalable manner than
> making an advertisement for each parallel link. As discussed,
> summarization (e.g., more than one value for latency) is a specific
> solution. Comments, other recollections?
I think you mean Tony's comment at the mic.
This requirement #1 is to reduce the size of the TE-LSDB. This is
what link bundle already does get right. You seem to be commenting on
item 3 after reading only item 1.
> > 2. A means to support very large LSP is needed, including LSP whose
> > total bandwidth exceeds the size of a single component link but
> > whose traffic has no single flow greater ^ the ^ component
> Above is an important requirement you mentioned.
Again, link bundling does this but there is no way for the ingress to
pick which behaviour it wants. That is item #5.
> Move following to Appendix
> > In link bundling this is supported by many implementations using
> > the all ones address component addressing and hash based
> > techniques.
> > Note: some implementations impose further restrictions regarding
> > the distribution of traffic across the set of identifiers used
> > in flow identification. Discussion of algorithms and
> > limitations of existing implementations is out of scope for this
> > requirements document.
> End Move to Appendix.
I do think this is relevant and belongs here. Items #1 and #2 are
requirements met by link bundling. The reset is requirements not met.
> > The remaining requirements are not met by existing link bundling.
> > 3. In some more than one set of metrics is needed to accommodate a
> > mix of capacity with different characteristics, particularly a
> > bundle where a subset of component links have shorter delay.
> I would avoid use of metric to avoid confusion with current solutions.
> Proposed rewording as follows:
Metric is used in the LS routing protocol sense.
> A means in control and data plane protocols is needed to accomodate a
> composite link composed of component links with different
> characteristics, including at least: capacity, current latency,
> indication of whether latency can change, ... others?
That is item #4. Please read to the end before citing omissions.
> > 4. A mechansism is needed to signal an LSP such that a component
> > link with specific characteristics are chosen, if a preference
> > exists. For example, the shortest delay may be required for
> > some LSP, but not required for others.
> As discussed in the meeting, picking the shortest delay per composite
> link is one requirement as you state above.
> We need to add the other service provider requirement described in the
> meeting where certain LSPs have a latency that is less than a specified
> end-end value.
Item #10. Again, read to the end before citing omissions. It wasn't
> In my view, these are separate requirements, which may have different
That is what one is item #4 and the other is item #10.
> > 5. LSP signaling is needed to indicate a preference for placement
> > on a single component link and to specifically forbid spreading
> > that LSP over multiple component links based on flow
> > identification beyond the outermost label entry.
> Need to clarify whether this applies to outer LSP and/or Inner LSP(s).
> As discussed, we need to add description that Composite Link end point
> routers participate in outer LSP signaling, may "snoop" signaling for
> inner LSPs(), or may be able to determine that a label may be used for
> component link assignment decisions (e.g., entropy label).
What I had in mind was two choices, outer only or all. See revised
definition of outer and inner above.
I really don't want to get into counting a certain depth into the
stack as no hardware today does that.
If we did add that, we'd need a capability indication for 1) can use
outer only, 2) can also use entire stack up to N, 3) can use lesser of
M (if M is speicified in LSP) or entire stack up to N. Specifying
that is not a problem. No existing equipment or emerging merchant
Ethernet silicon does this but we're not cast in silicon yet so OK
with me. :-) Others might object.
> > 6. A means to support non-disruptive reallocation of an existing
> > LSP to another component link is needed.
> Need to include the control of change frequency from 22.214.171.124.3 the
> existing I-D to this requirement.
OK. That belongs with new LSP parameters (Item #8).
> In the current draft, the use of MPLS TC (aka EXP) and DSCP bits are
> only specified for this purpose. Do we want to describe use in other
> cases? Also should describe fact that most operators do not modify DSCP
> and map this to EXP bits so that DSCP is transparent to customers.
Are you suggesting that we move some TC values and not others? That
could be done today by creating two LSP one that is more "moveable"
than the other among the set of component links.
No hardware today routes based on both label and EXP at a midpoint.
However some hardware today will allow yout to use DSCP or TC to
direct traffic into different LSP.
> > 7. A means to populate the TE-LSDB with information regarding which
> > links (per end) can support distribution of large LSP across
> > multiple component links based on the component flows and the
> > characteristics of this capability. Key characteristics are:
> > a. The largest single flow that can be supported. This may
> > or may not be related to the size of component links.
> > b. Characteristics of the flow identification method. [These
> > can be enumberated in this document or a later document. ]
> Would this be a place where MPLS and IP requirements would be
Yes. IP would typically be just src/dst hash. MPLS would be how the
label stack is used, and whether the implementation looks past the
label stack to see if traffic is IP. This is not the place for
protocol details but each of these characteristics need to be encoded
in a compact encoding.
> > c. Other characteristics? [ Not sure if I got everything
> > mentioned in the WG meeting. ]
> We ran out of time in the wg, but support for LDP is an important
> operator requirement. The vast majority of L3VPNs run over LDP
> "tunnels." I think that stating the why instead of how for the material
> on this subject from section 4.2.3 is something we need to do.
This (Item #7) is a discussion relevante link LSA or LSPDU TLVs. The
above comment is relevant to iten #8 but I'll respond here.
The esisting load balance implementations are control plane agnostic.
The reserved bandwidth is used for admission control to the aggregate
capacity when the load balance is enabled.
Of the bag of tricks the link has to offer the control plane picks
what the LSP wants. We can certainly extend LDP as well as RSVP-TE as
described in item #8.
> Also need to cover case mentioned by Ning that both LDP and RSVP-TE will
> be present on the same composite link. I think you are proposing to add
> unlabeled IP traffic to this (data plane) set as well.
We can just as easily include RSVP-TE, LDP, and IP. Both LDP and IP
traffic have to be measured. The means to identify groups of flows
differs and IP bahavior will need to be configured (ie: management
plane) but otherwise techniques are the same for IP and LDP.
> Also, traffic measurement based support at the composite link is
Item #9 covered this. The sum of LDP and IP can be used to reduce the
reservable bandwidth. There is no way to "preempt" IP or LDP without
causing oscillations so only RSVP-TE can be preempted if measurement
is done. I pointed this out in an earlier email.
> Need to describe the operator requirements on what needs to happen in
> the event of "Bandwidth Shortage Events" See section 126.96.36.199.
Preemption works. LDP and IP can't be preempted. Maybe MUST NOT
result in oscillations can be added and put an end to this.
> I was going to describe "auto-bandwidth" for RSVP-TE as a current method
> used by operators that the composite link should still support.
See above mention of "he sum of LDP and IP can be used to reduce the
reservable bandwidth." The existing "auto-bandwidth" is an ingress
function where the requested resrvation is changed.
> How control (routing, signaling) and management (OAM) packets are
> directed to component links and how this needs to be done so that no
> impacts to liveliness, adjacency and/or OAM occurs needs to be stated.
For control "nothing breaks" would be a sufficient requirement.
We can mention that LM measurement is lost on any LSP that is not kept
atomic (is allowed to be load balanced to multiple components) despite
the characteristic of having no reordering on flows. Otherwise OAM
A requirement would be "impacts on OAM must be documented". A
dissertation on the topic is not needed in the requirements document.
> Add dynamic signaling and advertisement of lower layer component links,
> and feedback from the lower layer regarding latency. See 188.8.131.52.
Are we aggregating information or advertising individual link? This
is another case of "which is it". I thought we agreed to aggregating
information except for circumstances where some subset of the links
had different delay.
> Backward compatibility as you proposed on previous thread (merge with
> text in 184.108.40.206)
I thought the WG meeting concensus was to toss the original. BTW- Its
interesting for a link to advertise its desired maximum latency.
Advertising delay is one thing. We don't want to get carried away and
specifify mean, stddev, N-th norm, or percentile histograms. Lets
stick with a single number for delay.
> Automatic derivation of routing metrics based upon signaled (or
> measured) latency changes (220.127.116.11).
If we add a delay metric, then the delay metric can depend on measured
delay. Adjusting general routing metrics according to load takes us
down the road of provably stable mechanisms. "MUST NOT oscillate" is
a strong requirement.
> Would look to others who were keeping notes to add to the above list.
> > 8. Some means is needed for an LSP which allows distribution of
> > flows across member links to indicate characteristics of flow
> > distribution. These characteristics include:
> > a. The largest flow expected.
> Please say more, not sure I understand this point.
If CL component links are 10G each, the CL can't handle a single flow
of 20G that came in on a 100G link. If the LSP claims to have up to
20G flows, the LSP has to be rejected at setup time.
This is not all that hard to do. If the largest customer attachment
is a 10G rate limited to 4G, then the largest possible flow is 4G. If
a customer has a 100G attachment then the largest possible flow can be
a negociated (ie: at contract time) parameter. If it is a content
provider traffic is going to a million eyeballs and the largest flow
may be much smaller than the attachment circuit.
> > b. Other? [ Did we identify any other? There was some
> > chatter about distribution of flows but no specific
> > characteristics was called for - AFAIK ]
> See above; change control frequency,
If we base this on hash, then the control frequency can't easily be
held constant over the entire set of LSP. My assumption is that LSP
that were put on a single component would only be moved if link bundle
style LSP allocation had created a "bin packing" problem that had to
be resolved by moving something and the remaining "spread spectrum"
LSP would fill in the remaining capacity (which in practice would most
likely be most of the bandwidth).
> Also, "pinning" in some way (see 18.104.22.168) based upon snooped signaling
> for inner LSPs and configuration (e.g., of FEC (ranges)).
Pinning has to be on the outer labels or not at all. The LSR is not
going to made parapsychic by these requirements.
> > 9. In some cases it may be useful to measure link parameters
> > and reflect these in metrics. Link delay is an example.
> I think this is one of the most important requirements. Editorially, it
> should be toward the top of the list.
Order is not intended to indicate importance. It was supposed to be a
logical grouping with capabilities building on prior ones.
> As commented by Dimitri; need to state requirements on the frequency of
> latency measurements and the precision required.
We can save that for the later. Requirements are just that the
frequency of sampling be configurable.
> > 10. Some uses require an ability to bound the sum of delay metrics
> > along a path while otherwise taking the shorted path related to
> > another metric. [This was mentioned but seems a bit orthogonal
> > to all but #3.]
> See 4 above as well. We think the characteristics of 22.214.171.124 and 126.96.36.199
> are important, but need to focus the text on the semantics of why these
> are important instead of the current descriptions which are a how
> assuming crankback style signaling as the solution approach. A
> discussion of the how alternatives view could be moved to the framework
I think we agreed at the meeting to toss the existing draft. For
example, how does an LSP signal "delay variation". A link would
signal jitter if anything didi. The ingress can try to predict
cummulative jitter base on link jitter along the path. I don't think
there is concensus on adding jitter though.
> Also as discussed in the meeting, a means to implement Diffserv for
> MPLS-TE when composite links are present in the network is also
Before we add a requirement, you need to explain how support for
diffserv would ever be lost by load balancing.
> > Purpose:
> > [ A set of example scenarios were discussed. We may want to capture
> > them here (and maybe refine the examples). ]
> IMO, example scenarios,or solution sketches may be more appropriate for
> a framework document.
If they are properly specified as a set of cases, then yes.
Lets get past the requirments first.
rtgwg mailing list