On Sun, Mar 29, 2009 at 04:42:50PM +0200, Aurelien Jarno wrote:
> On Sun, Mar 29, 2009 at 03:34:53PM +0200, Aurelien Jarno wrote:
> > On Sat, Mar 28, 2009 at 05:18:34PM -0700, Nathan Froyd wrote:
> > > On Sat, Mar 28, 2009 at 11:54:43PM +0100, Aurelien Jarno wrote:
> > > > On Sat, Mar 28, 2009 at 02:30:13PM -0700, Nathan Froyd wrote:
> > > > > I am not a TCG expert, but there are several loops in TCG over all
> > > > > globals and it seems like those loops would go faster if they didn't
> > > > > have to consider registers that would never be touched. If this patch
> > > > > series makes no difference in TCG's performance, then I'd be glad to
> > > > > have an explanation of why that's the case.
> > > >
> > > > Do you actually have run a benchmark with those changes? TCG is
> > > > sometimes a bit strange, and some optimizations does not change the
> > > > execution speed, while others improve it a lot. It is very difficult to
> > > > predict what will give a gain or not.
> > > >
> > > > Suggestions of benchmarks: gzip/bzip2 on a big file using user emulation
> > > > or a compilation in system emulation.
> > >
> > > Benchmarking? Pffft. ;)
> > >
> > > A benchmarking session with qemu-ppc and bzip2/bunzip2 on ~400MB files
> > > and a 603e emulated CPU suggests that these changes are not terribly
> > > beneficial (maybe 1% improvement, if that). I don't imagine that a
> > > similarly stressful benchmark in system emulation would be much
> > > different. Consider the patch series withdrawn.
> > >
> > I have done a few profiling on qemu-system-ppc and qemu-system-mips. You
> > are actually right that the loop on the TCG variables lists takes time.
> > This is mainly due to the call of save_globals() for TCG functions marked
> > as TCG_OPF_CALL_CLOBBER.
> > However it looks like it should be better to address this comment first
> > before trying to reduce the number of TCG variables:
> > /* XXX: for load/store we could do that only for the slow path
> > (i.e. when a memory callback is called) */
> Thinking a bit more I think we should avoid mapping FPU registers as
> global TCG variables. Those variables are mostly modified by helpers
> (except for move and load/store), and they will be written back to
> memory before the call to the helper. This means TCG can't delay the
> memory accesses, so there is very few (or no) difference in the
> generated code if the FPU register is accessed through a global TCG
> variable or through tcg_gen_ld_tl().
> I have done the test with qemu-system-mips, and I have found a gain
> around 1% in speed.
My measurements were wrong, the gain is around 9%.
Aurelien Jarno GPG: 1024D/F1BCDB73