assembly - Massive performance improvement g++ 4.6 vs 4.7 vs 4.8 in aesni -
i've been working time on comparison between cpu aesni , gpu aes. i've updated g++ compiler (from 4.6 4.8) , saw significant increase in performance (~2x) cpu aesni.
i have simplified c code "simulate" aes encryption using aesni instructions (listed bellow).
__m128i cipher_128i; _aligned(16) unsigned char in_alligned[16]; _aligned(16) unsigned char out_alligned[16]; // store plaintext in cipher variable encrypt memcpy(in_alligned, buf_in, 16); cipher_128i = _mm_load_si128((__m128i *) in_alligned); cipher_128i = _mm_xor_si128(cipher_128i, key_exp_128i); /* 9 rounds of aesenc, using associated key parts */ cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); cipher_128i = _mm_aesenc_si128(cipher_128i, key_exp_128i); /* 1 aesenclast rounds */ cipher_128i = _mm_aesenclast_si128(cipher_128i, key_exp_128i); // store register & copy destination _mm_store_si128((__m128i *) out_alligned, cipher_128i); memcpy(buf_out, out_alligned, 16);
this code on 1gb buf_in data on amd 5400k (serial execution) yields following:
- g++-4.6 | real 0m2.982s, user 0m2.466s, sys 0m0.433s
- g++-4.7 | real 0m1.453s, user 0m0.877s, sys 0m0.512s
- g++-4.8 | real 0m1.157s, user 0m0.592s, sys 0m0.468s
i generated assembly each version of g++ (4.6, 4.7, 4.8) , found compiler replacing sets of instructions of type movdqa/movq movdqu (see picture bellow). http://postimg.org/image/q6j8qwyol/
is safe assume improvement ? make sense ? why did g++ 4.6 not consider instruction in first place ?
3 things noticed affecting performance between 3:
1) better copying of data. in old gcc, appears breaking 16b copies 2 8b loads/stores. because unaligned instructions used terrible performance (they micro coded) on older architectures. after intel's nehalem processor, unaligned instructions made fast aligned instructions, assuming no cache splits. compilers, therefore, try take advantage of being more liberal in use of unaligned instructions.
2) looks gcc optimized away buffer overrun check, contributed overhead. haven't looked detail why.
3) looks optimized away need dynamically aligned stack pointer 32b (needed in first case use movdqa, not needed in second case, therefore, perf-bug, , optimized away in third case).
Comments
Post a Comment