|
|
On 11/19/2009 08:30 PM, Uros Bizjak wrote:
Note, that combine will see a nonimmediate_operand, where memory
operand will be later fixed in a reload pass. Reload will
automatically move memory operand to SSE register to satisfy operand3
constraint.
Actually, looking a bit deeper into the splitting logic of FMA
instructions, I think that this whole business of using
ix86_fma4_valid_op_p () predicate and splitting depending on type of
operand is flawed.
Let me illustrate this by vector FMA insn, where we split invalid
instructions by looking at the operands using
ix86_expand_fma4_multiple_memory fix-up function:
(define_insn "fma4_fmadd<mode>4256"
[(set (match_operand:FMA4MODEF4 0 "register_operand" "=x,x,x")
(plus:FMA4MODEF4
(mult:FMA4MODEF4
(match_operand:FMA4MODEF4 1 "nonimmediate_operand" "x,x,xm")
(match_operand:FMA4MODEF4 2 "nonimmediate_operand" "x,xm,x"))
(match_operand:FMA4MODEF4 3 "nonimmediate_operand" "xm,x,x")))]
"TARGET_FMA4
&& ix86_fma4_valid_op_p (operands, insn, 4, true, 2, true)"
"vfmadd<fma4modesuffixf4>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
[(set_attr "type" "ssemuladd")
(set_attr "mode" "<MODE>")])
;; Split fmadd with two memory operands into a load and the fmadd.
(define_split
[(set (match_operand:FMA4MODEF4 0 "register_operand" "")
(plus:FMA4MODEF4
(mult:FMA4MODEF4
(match_operand:FMA4MODEF4 1 "nonimmediate_operand" "")
(match_operand:FMA4MODEF4 2 "nonimmediate_operand" ""))
(match_operand:FMA4MODEF4 3 "nonimmediate_operand" "")))]
"TARGET_FMA4
&& !ix86_fma4_valid_op_p (operands, insn, 4, true, 1, true)
&& ix86_fma4_valid_op_p (operands, insn, 4, true, 2, true)
&& !reg_mentioned_p (operands[0], operands[1])
&& !reg_mentioned_p (operands[0], operands[2])
&& !reg_mentioned_p (operands[0], operands[3])"
[(const_int 0)]
{
ix86_expand_fma4_multiple_memory (operands, 4, <MODE>mode);
emit_insn (gen_fma4_fmadd<mode>4256 (operands[0], operands[1],
operands[2], operands[3]));
DONE;
})
This generates following vectorized loop:
.L2:
vmovaps b(%rax), %xmm1
vmovaps d(%rax), %xmm0
vfmaddps %xmm0, c(%rax), %xmm1, %xmm0
vmovaps %xmm0, a(%rax)
addq $16, %rax
cmpq $40960, %rax
jne .L2
However, the same result can be obtained by carefully placing operand
constraints in alternatives:
(define_insn "fma4_fmadd<mode>4"
[(set (match_operand:SSEMODEF4 0 "register_operand" "=x,x")
(plus:SSEMODEF4
(mult:SSEMODEF4
(match_operand:SSEMODEF4 1 "nonimmediate_operand" "%x,x")
(match_operand:SSEMODEF4 2 "nonimmediate_operand" " x,m"))
(match_operand:SSEMODEF4 3 "nonimmediate_operand" "xm,x")))]
"TARGET_FMA4"
"___vfmadd<ssemodesuffixf4>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
[(set_attr "type" "ssemuladd")
(set_attr "mode" "<MODE>")])
Please note, that all other insn predicates, splitters and fixups for
this instruction were disabled. This pattern still correctly vectorizes
operation with three arrays and for gcc.target/i386/fma4-vector.c
generates similar vectorized sequence:
.L2:
vmovaps b(%rax), %xmm1
vmovaps c(%rax), %xmm2
___vfmaddps d(%rax), %xmm2, %xmm1, %xmm0
vmovaps %xmm0, a(%rax)
addq $16, %rax
cmpq $40960, %rax
jne .L2
Please note, that by introducing "%" into operand predicate, we also
increased the number of possible reloads.
Uros.
|
|