assembly - Massive performance improvement g++ 4.6 vs 4.7 vs 4.8 in aesni -

- April 15, 2015

i've been working time on comparison between cpu aesni , gpu aes. i've updated g++ compiler (from 4.6 4.8) , saw significant increase in performance (~2x) cpu aesni.

i have simplified c code "simulate" aes encryption using aesni instructions (listed bellow).

__m128i cipher_128i; _aligned(16) unsigned char in_alligned[16]; _aligned(16) unsigned char out_alligned[16];  // store plaintext in cipher variable encrypt memcpy(in_alligned, buf_in, 16); cipher_128i = _mm_load_si128((__m128i *) in_alligned);  cipher_128i = _mm_xor_si128(cipher_128i, key_exp_128i); /* 9 rounds of aesenc, using associated key parts */ cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); /* 1 aesenclast rounds */ cipher_128i = _mm_aesenclast_si128(cipher_128i, key_exp_128i);  // store register & copy destination _mm_store_si128((__m128i *) out_alligned, cipher_128i); memcpy(buf_out, out_alligned, 16);

this code on 1gb buf_in data on amd 5400k (serial execution) yields following:

g++-4.6 | real 0m2.982s, user 0m2.466s, sys 0m0.433s
g++-4.7 | real 0m1.453s, user 0m0.877s, sys 0m0.512s
g++-4.8 | real 0m1.157s, user 0m0.592s, sys 0m0.468s

i generated assembly each version of g++ (4.6, 4.7, 4.8) , found compiler replacing sets of instructions of type movdqa/movq movdqu (see picture bellow). http://postimg.org/image/q6j8qwyol/

is safe assume improvement ? make sense ? why did g++ 4.6 not consider instruction in first place ?

3 things noticed affecting performance between 3:

1) better copying of data. in old gcc, appears breaking 16b copies 2 8b loads/stores. because unaligned instructions used terrible performance (they micro coded) on older architectures. after intel's nehalem processor, unaligned instructions made fast aligned instructions, assuming no cache splits. compilers, therefore, try take advantage of being more liberal in use of unaligned instructions.

2) looks gcc optimized away buffer overrun check, contributed overhead. haven't looked detail why.

3) looks optimized away need dynamically aligned stack pointer 32b (needed in first case use movdqa, not needed in second case, therefore, perf-bug, , optimized away in third case).

Search This Blog

DTr

assembly - Massive performance improvement g++ 4.6 vs 4.7 vs 4.8 in aesni -

Comments

Post a Comment

Popular posts from this blog

c++ - OpenCV Error: Assertion failed <scn == 3 ::scn == 4> in unknown function, -

php - render data via PDO::FETCH_FUNC vs loop -

The canvas has been tainted by cross-origin data in chrome only -