Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Tue, 25 Mar 2014, David Brown wrote:

On 25/03/14 04:31, Xinrong Fu wrote:
Hi guys:
   What does the number of stalled cycles in the CPU pipeline frontend
means? Why is the stalled frontend cycles of 32bit program more than
64bit program's stalled cycles when they running on same 64bit system?
Is there any gcc options to fix it?

If the question is: Why a 32 bits program compiled 64 bits is a lot slower:

There can be several reasons yet let's name a few.

a) for example if you use signed 32 bits indexation, for example

int i, array[64];

i = ...;
x = array[i];

this goes very fast in 32 bits processor and 32 bits mode yet a lot slower in 64 bits mode, as i needs a sign extension to 64 bits.
So the compiler generates 1 additional instruction in 64 bits mode
to sign extend i from 32 bits to 64 bits.

b) some processors can 'issue' more 32 bits instructions a clock than 64 bits instructions. This can have many reasons, for example the processor can just decode a limited amount of bytes per clock and as 32 bits instructions occupy less space that means they can decode 4 instructions of 32 bits yet just 3 of 64 bits. Please note: not taking into account vector instructions here, just seeing an instruction as an absolute instruction here and not taking into account how wide the register is upon which it operates.

Agner Fog has more exact measurements on how little bytes modern processors can actually decode a clock.

My chessprogram Diep which is deterministic integer code (so no vector codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 64 bits than in 32 bits. This where it does use a few 64 bits datatypes (very little though). In 64 bits the datasize used doesn't grow, instruction wise it grows immense of course.

Besides the above reasons another reason why 32 bits programs compiled 64 bits can be a lot slower in case of Diep is:

c) the larger code size causes more L1 instruction cache misses.

And that i a major problem especially as those L1i's are already so tiny at modern processors.

d) gcc is total horrible in optimizing branches. Where a compiler like intel c++ easily gets 20%-25% performance out of the PGO (profile guided optimization), gcc gets total peanuts out of the pgo phase for my chessprogram. 3% or so.

This all has to do with how it deals with branches and the horrible optimizations that trigger.

Now these horrors you could either benefit from going to 64 bits, as it no longer uses that horror by then, or get additional penalty from moving to 64 bits. That last case for example at an older generation AMD processor when the jump suddenly is outside of what the processor is seeing in its lookahead, causing a huge penalty suddenly for a branch mispredict.

It largely depends upon the processor you have, especially older types AMD processors suffer there.

So moving there to 64 bits could speed you up occasional, even when not using any optimization at all, just because some of the old FUBAR codes used for 32 bits no longer get triggered.

Are you asking why the same program runs faster when compiled as 64-bit
rather than 32-bit?  There are /many/ reasons why 64-bit x86 code can be
faster than 32-bit x86 code - without having any idea about your code,
we can only make general points.  In comparison to 32-bit x86, the
64-bit mode has access to more registers,

Usually processors are optimized to just use a few registers whereas they use all sorts of tricks (where additional registers get used) already to make up for it, so the additional registers hardly is an
advantage of any kind in 64 bits, not even in algorithmic codes here.

In tests performed using more registers using assembler code the processors actually slow down. So there is a performance benefit in reusing the same few registers over and over again.

This performance penalty of using more registers is not only there in x64, it already was the case in x86 processors. In fact it's easy to measure in the pentium from 2 decades ago already.

Hopefully that'll change a tad in the future - yet i consider that unlikely, as it also would involve changes in the intel c++ compiler.

has wider registers (which

Exactly:

If you use 64 bits datatypes like "long long" then obviously 64 bits is a huge advantage over 32 bits. This can easily give a factor 2 speed improvement in case of integer codes that are 64 bits.

speeds data movement), less complicated instruction decoding and
instruction prefixes, more efficient floating point, and much more
efficient calling conventions.  It has the disadvantage that pointers
take up twice as much data cache and memory bandwidth, as they are twice
the size.

From a distance seen you're totally correct here that caches are the
problem. To zoom in: the larger pointer is more of a problem for the instruction part of the cache.

In itself the larger pointer doesn't mean the size the data occupies in the datacache grows.

Yet in the compiler in 64 bits needs more instructions to get to the 32 bits data and such 64 bits pointer instructions simply are larger laying more stress upon the instruction decoding/transport, whereas we already know it can just decode 3 hands of bytes a clock.

Now for a lot of programs this isn't a big problem as another bitwise AND is a very fast instruction, yet for my software which is pretty optimized i feel additional instructions not in the last place as it makes the already supertiny L1 instruction cache ugh out even more and as the IPC already is very high :)

As for gcc options to "fix" it, there is no problem to fix - it is
normal that 64-bit code is a bit more efficient than 32-bit code from
the same program, but details vary according to the code in question.

One thing I notice from your post is that you are compiling without
enabling optimisation, which cripples the compiler's performance.
Enabling "-O2" will probably make your code several times faster (again,
without information on the program, I can only make general statements).
Different optimisation settings like "-Os", "-O3", and individual
optimisation flags may or may not make the code faster, but "-O2" is a
good start.

A good tip in GCC is to never go further than -O2

Going further at your own risk :)

The past 20 years or so, gcc actually never generated faster code for my chess software with -O3, usually it causes problems and slows down.

Kind Regards,
Vincent




[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux