|
|
A ChangeLog entry was included in the original email. Here it is again. Does it
look OK?
2009-06-02 Ghassan Shobaki <ghassan.shobaki@xxxxxxx>
* tree-ssa-loop-prefetch.c
(gather_memory_references): Introduced a counter for the number of
memory references.
(anything_to_prefetch_p): Introduced a counter for the number of
prefetches.
(is_loop_prefetching_profitable): New function with a cost model
for prefetching.
(loop_prefetch_arrays): Use the new cost model to determine if
prefetching is profitable.
* params.def (MIN_INSN_TO_PREFETCH_RATIO,
PREFETCH_MIN_INSN_TO_MEM_RATIO): New parameters.
* params.h (MIN_INSN_TO_PREFETCH_RATIO,
PREFETCH_MIN_INSN_TO_MEM_RATIO): New parameters.
Thanks
-Ghassan
-----Original Message-----
From: Richard Guenther [mailto:richard.guenther@xxxxxxxxx]
Sent: Friday, June 05, 2009 2:01 AM
To: Shobaki, Ghassan
Cc: Zdenek Dvorak; gcc-patches@xxxxxxxxxxx
Subject: Re: Patch to Avoid Bad Prefetching
On Fri, Jun 5, 2009 at 4:29 AM, Shobaki, Ghassan
<Ghassan.Shobaki@xxxxxxx> wrote:
> Here is the revised patch. I made the changes requested by Zdenek and
> changed the default parameter value to 3. I have also added entries for
> the new parameters to doc/invoke.texi. The patch bootstrapped and passed
> "make check". Is it now OK for check in?
Ok with a proper ChangeLog entry.
Thanks,
Richard.
> Thanks
> -Ghassan
>
>
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi (revision 148159)
> +++ doc/invoke.texi (working copy)
> @@ -7839,6 +7839,15 @@ The size of L1 cache, in kilobytes.
> @item l2-cache-size
> The size of L2 cache, in kilobytes.
>
> +@item min-insn-to-prefetch-ratio
> +The minimum ratio between the number of instructions and the
> +number of prefetches to enable prefetching in a loop with an
> +unknown trip count.
> +
> +@item prefetch-min-insn-to-mem-ratio
> +The minimum ratio between the number of instructions and the
> +number of memory references to enable prefetching in a loop.
> +
> @item use-canonical-types
> Whether the compiler should use the ``canonical'' type system. By
> default, this should always be 1, which uses a more efficient internal
> Index: params.h
> ===================================================================
> --- params.h (revision 148159)
> +++ params.h (working copy)
> @@ -166,4 +166,8 @@ typedef enum compiler_param
> PARAM_VALUE (PARAM_LOOP_INVARIANT_MAX_BBS_IN_LOOP)
> #define SLP_MAX_INSNS_IN_BB \
> PARAM_VALUE (PARAM_SLP_MAX_INSNS_IN_BB)
> +#define MIN_INSN_TO_PREFETCH_RATIO \
> + PARAM_VALUE (PARAM_MIN_INSN_TO_PREFETCH_RATIO)
> +#define PREFETCH_MIN_INSN_TO_MEM_RATIO \
> + PARAM_VALUE (PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO)
> #endif /* ! GCC_PARAMS_H */
> Index: tree-ssa-loop-prefetch.c
> ===================================================================
> --- tree-ssa-loop-prefetch.c (revision 148159)
> +++ tree-ssa-loop-prefetch.c (working copy)
> @@ -109,6 +109,23 @@ along with GCC; see the file COPYING3.
> prefetch instructions with guards in cases where 5) was not
> sufficient
> to satisfy the constraints?
>
> + The function is_loop_prefetching_profitable() implements a cost
> model
> + to determine if prefetching is profitable for a given loop. The cost
> + model has two heuristcs:
> + 1. A heuristic that determines whether the given loop has enough CPU
> + ops that can be overlapped with cache missing memory ops.
> + If not, the loop won't benefit from prefetching. This is
> implemented
> + by requirung the ratio between the instruction count and the mem
> ref
> + count to be above a certain minimum.
> + 2. A heuristic that disables prefetching in a loop with an unknown
> trip
> + count if the prefetching cost is above a certain limit. The
> relative
> + prefetching cost is estimated by taking the ratio between the
> + prefetch count and the total intruction count (this models the
> I-cache
> + cost).
> + The limits used in these heuristics are defined as parameters with
> + reasonable default values. Machine-specific default values will be
> + added later.
> +
> Some other TODO:
> -- write and use more general reuse analysis (that could be also
> used
> in other cache aimed loop optimizations)
> @@ -476,7 +493,7 @@ gather_memory_references_ref (struct loo
> true if there are no other memory references inside the loop. */
>
> static struct mem_ref_group *
> -gather_memory_references (struct loop *loop, bool *no_other_refs)
> +gather_memory_references (struct loop *loop, bool *no_other_refs,
> unsigned *ref_count)
> {
> basic_block *body = get_loop_body_in_dom_order (loop);
> basic_block bb;
> @@ -487,6 +504,7 @@ gather_memory_references (struct loop *l
> struct mem_ref_group *refs = NULL;
>
> *no_other_refs = true;
> + *ref_count = 0;
>
> /* Scan the loop body in order, so that the former references precede
> the
> later ones. */
> @@ -513,11 +531,17 @@ gather_memory_references (struct loop *l
> rhs = gimple_assign_rhs1 (stmt);
>
> if (REFERENCE_CLASS_P (rhs))
> + {
> *no_other_refs &= gather_memory_references_ref (loop, &refs,
> rhs, false,
> stmt);
> + *ref_count += 1;
> + }
> if (REFERENCE_CLASS_P (lhs))
> + {
> *no_other_refs &= gather_memory_references_ref (loop, &refs,
> lhs, true,
> stmt);
> + *ref_count += 1;
> + }
> }
> }
> free (body);
> @@ -846,20 +870,20 @@ schedule_prefetches (struct mem_ref_grou
> return any;
> }
>
> -/* Determine whether there is any reference suitable for prefetching
> - in GROUPS. */
> +/* Estimate the number of prefetches in the given GROUPS. */
>
> -static bool
> -anything_to_prefetch_p (struct mem_ref_group *groups)
> +static int
> +estimate_prefetch_count (struct mem_ref_group *groups)
> {
> struct mem_ref *ref;
> + int prefetch_count = 0;
>
> for (; groups; groups = groups->next)
> for (ref = groups->refs; ref; ref = ref->next)
> if (should_issue_prefetch_p (ref))
> - return true;
> + prefetch_count++;
>
> - return false;
> + return prefetch_count;
> }
>
> /* Issue prefetches for the reference REF into loop as decided before.
> @@ -1449,6 +1473,71 @@ determine_loop_nest_reuse (struct loop *
> }
> }
>
> +/* Do a cost-benefit analysis to determine if prefetching is profitable
> + for the current loop given the following parameters:
> + AHEAD: the iteration ahead distance,
> + EST_NITER: the estimated trip count,
> + NINSNS: estimated number of instructions in the loop,
> + PREFETCH_COUNT: an estimate of the number of prefetches
> + MEM_REF_COUNT: total number of memory references in the loop. */
> +
> +static bool
> +is_loop_prefetching_profitable (unsigned ahead, HOST_WIDE_INT
> est_niter,
> + unsigned ninsns, unsigned
> prefetch_count,
> + unsigned mem_ref_count)
> +{
> + int insn_to_mem_ratio, insn_to_prefetch_ratio;
> +
> + if (mem_ref_count == 0)
> + return false;
> +
> + /* Prefetching improves performance by overlapping cache missing
> + memory accesses with CPU operations. If the loop does not have
> + enough CPU operations to overlap with memory operations,
> prefetching
> + won't give a significant benefit. One approximate way of checking
>
> + this is to require the ratio of instructions to memory references
> to
> + be above a certain limit. This approximation works well in
> practice.
> + TODO: Implement a more precise computation by estimating the time
> + for each CPU or memory op in the loop. Time estimates for memory
> ops
> + should account for cache misses. */
> + insn_to_mem_ratio = ninsns / mem_ref_count;
> +
> + if (insn_to_mem_ratio < PREFETCH_MIN_INSN_TO_MEM_RATIO)
> + return false;
> +
> + /* Profitability of prefetching is highly dependent on the trip
> count.
> + For a given AHEAD distance, the first AHEAD iterations do not
> benefit
> + from prefetching, and the last AHEAD iterations execute useless
> + prefetches. So, if the trip count is not large enough relative to
> AHEAD,
> + prefetching may cause serious performance degradation. To avoid
> this
> + problem when the trip count is not known at compile time, we
> + conservatively skip loops with high prefetching costs. For now,
> only
> + the I-cache cost is considered. The relative I-cache cost is
> estimated
> + by taking the ratio between the number of prefetches and the total
> + number of instructions. Since we are using integer arithmetic, we
> + compute the reciprocal of this ratio.
> + TODO: Account for loop unrolling, which may reduce the costs of
> + shorter stride prefetches. Note that not accounting for loop
> + unrolling over-estimates the cost and hence gives more
> conservative
> + results. */
> + if (est_niter < 0)
> + {
> + insn_to_prefetch_ratio = ninsns / prefetch_count;
> + return insn_to_prefetch_ratio >= MIN_INSN_TO_PREFETCH_RATIO;
> + }
> +
> + if (est_niter <= (HOST_WIDE_INT) ahead)
> + {
> + if (dump_file && (dump_flags & TDF_DETAILS))
> + fprintf (dump_file,
> + "Not prefetching -- loop estimated to roll only %d
> times\n",
> + (int) est_niter);
> + return false;
> + }
> + return true;
> +}
> +
> +
> /* Issue prefetch instructions for array references in LOOP. Returns
> true if the LOOP was unrolled. */
>
> @@ -1460,6 +1549,8 @@ loop_prefetch_arrays (struct loop *loop)
> HOST_WIDE_INT est_niter;
> struct tree_niter_desc desc;
> bool unrolled = false, no_other_refs;
> + unsigned prefetch_count;
> + unsigned mem_ref_count;
>
> if (optimize_loop_nest_for_size_p (loop))
> {
> @@ -1469,12 +1560,13 @@ loop_prefetch_arrays (struct loop *loop)
> }
>
> /* Step 1: gather the memory references. */
> - refs = gather_memory_references (loop, &no_other_refs);
> + refs = gather_memory_references (loop, &no_other_refs,
> &mem_ref_count);
>
> /* Step 2: estimate the reuse effects. */
> prune_by_reuse (refs);
>
> - if (!anything_to_prefetch_p (refs))
> + prefetch_count = estimate_prefetch_count (refs);
> + if (prefetch_count == 0)
> goto fail;
>
> determine_loop_nest_reuse (loop, refs, no_other_refs);
> @@ -1485,27 +1577,22 @@ loop_prefetch_arrays (struct loop *loop)
> the loop body. */
> time = tree_num_loop_insns (loop, &eni_time_weights);
> ahead = (PREFETCH_LATENCY + time - 1) / time;
> - est_niter = estimated_loop_iterations_int (loop, false);
> -
> - /* The prefetches will run for AHEAD iterations of the original loop.
> Unless
> - the loop rolls at least AHEAD times, prefetching the references
> does not
> - make sense. */
> - if (est_niter >= 0 && est_niter <= (HOST_WIDE_INT) ahead)
> - {
> - if (dump_file && (dump_flags & TDF_DETAILS))
> - fprintf (dump_file,
> - "Not prefetching -- loop estimated to roll only %d
> times\n",
> - (int) est_niter);
> - goto fail;
> - }
> -
> - mark_nontemporal_stores (loop, refs);
> + est_niter = estimated_loop_iterations_int (loop, false);
>
> ninsns = tree_num_loop_insns (loop, &eni_size_weights);
> unroll_factor = determine_unroll_factor (loop, refs, ninsns, &desc,
> est_niter);
> if (dump_file && (dump_flags & TDF_DETAILS))
> - fprintf (dump_file, "Ahead %d, unroll factor %d\n", ahead,
> unroll_factor);
> + fprintf (dump_file, "Ahead %d, unroll factor %d, trip count %ld\n"
> + "insn count %d, mem ref count %d, prefetch count %d\n",
> + ahead, unroll_factor, est_niter, ninsns, mem_ref_count,
> + prefetch_count);
> +
> + if (!is_loop_prefetching_profitable (ahead, est_niter, ninsns,
> + prefetch_count, mem_ref_count))
> + goto fail;
> +
> + mark_nontemporal_stores (loop, refs);
>
> /* Step 4: what to prefetch? */
> if (!schedule_prefetches (refs, unroll_factor, ahead))
> @@ -1556,7 +1643,11 @@ tree_ssa_prefetch_arrays (void)
> fprintf (dump_file, " L1 cache size: %d lines, %d kB\n",
> L1_CACHE_SIZE_BYTES / L1_CACHE_LINE_SIZE, L1_CACHE_SIZE);
> fprintf (dump_file, " L1 cache line size: %d\n",
> L1_CACHE_LINE_SIZE);
> - fprintf (dump_file, " L2 cache size: %d kB\n", L2_CACHE_SIZE);
> + fprintf (dump_file, " L2 cache size: %d kB\n", L2_CACHE_SIZE);
>
> + fprintf (dump_file, " min insn-to-prefetch ratio: %d \n",
> + MIN_INSN_TO_PREFETCH_RATIO);
> + fprintf (dump_file, " min insn-to-mem ratio: %d \n",
> + PREFETCH_MIN_INSN_TO_MEM_RATIO);
> fprintf (dump_file, "\n");
> }
>
> Index: params.def
> ===================================================================
> --- params.def (revision 148159)
> +++ params.def (working copy)
> @@ -741,6 +741,17 @@ DEFPARAM (PARAM_SLP_MAX_INSNS_IN_BB,
> "Maximum number of instructions in basic block to be
> considered for SLP vectorization",
> 1000, 0, 0)
>
> +DEFPARAM (PARAM_MIN_INSN_TO_PREFETCH_RATIO,
> + "min-insn-to-prefetch-ratio",
> + "min. ratio of insns to prefetches to enable prefetching for "
> + "a loop with an unknown trip count",
> + 10, 0, 0)
> +
> +DEFPARAM (PARAM_PREFETCH_MIN_INSN_TO_MEM_RATIO,
> + "prefetch-min-insn-to-mem-ratio",
> + "min. ratio of insns to mem ops to enable prefetching in a
> loop",
> + 3, 0, 0)
> +
> /*
> Local variables:
> mode:c
>
> -----Original Message-----
> From: Shobaki, Ghassan
> Sent: Thursday, June 04, 2009 2:54 PM
> To: 'Richard Guenther'
> Cc: Zdenek Dvorak; gcc-patches@xxxxxxxxxxx
> Subject: RE: Patch to Avoid Bad Prefetching
>
>
> I meant the time estimate.
>
> For example, consider a loop with 2 cache missing memory ops and 4 CPU
> ops. Assuming that we set the min_insn_to_mem_ratio to 3, the current
> *imprecise* heuristic will calculate an insn-to-mem ratio of (2+4)/2=3
> and conclude that this loop will benefit from prfetching whether the 4
> CPU ops are integer adds or FP divides. A more precise calculation will
> go like this:
>
> Time for memory ops = 200 cycle (assuming a cache-miss latency of 200
> and that the machine can do 2 mem ops simultaneously)
>
> If the 4 CPU ops are integer arithmetic:
> Time for CPU ops = 4*1=4 (assuming that int arith ops take one cycle
> each)
> Max potential benefit from prefetching = 4/204 = 2% (probably not
> significant enough to pay off for the prefething cost)
>
> On the other hand,
> If the 4 CPU ops are FP divides:
> Time for CPU ops = 4*20=80 (assuming that FP divides take 20 cycles
> each)
> Max potential benefit from prefetching = 80/(200+80) = 29% (most likely
> significant enough to pay off for the prefething cost and deliver a good
> performance gain)
>
> As you can see, this is much more precise than simply looking at the
> ratio, but it requires a good time estimate for each operation. I assume
> that the function tree_num_loop_insns() internally computes such time
> estimates if we pass it time weights. Of course, I know that middle-end
> time estimates will not be very precise compared to backend estimates.
> BTW, are the middle-end time estimates machine dependent?
>
> Are you OK with checking in the patch with the ratio for now?
>
> Thanks
> -Ghassan
>
> -----Original Message-----
> From: Richard Guenther [mailto:richard.guenther@xxxxxxxxx]
> Sent: Thursday, June 04, 2009 1:57 PM
> To: Shobaki, Ghassan
> Cc: Zdenek Dvorak; gcc-patches@xxxxxxxxxxx
> Subject: Re: Patch to Avoid Bad Prefetching
>
> On Thu, Jun 4, 2009 at 9:24 PM, Shobaki, Ghassan
> <Ghassan.Shobaki@xxxxxxx> wrote:
>>
>> I agree with Zdenek that the first heuristic is imprecise and needs
> improvement. At least it has to distinguish between simple integer
> arithmetic ops and expensive FP ops. I am planning on doing this very
> soon. As I explained in the previous email, that will require computing
> an instruction-by-instruction time estimate using the estimates that GCC
> already computes but adapting them to properly account for the costs of
> cache missing memory ops.
>
> GCC has both size and time estimates for code - I guess this is where
> you use
> the size estimate but really ment the time estimate, no?
>
> Richard.
>
>
>
|
|