c - How to vectorize this kernel? -
i made question in middle of one, , seems nobody answer in previous topic.
my question following. have been vectorizing 1 application success. however, particular kernel:
inline int cfunction(node** a, node* b){ long dot = 0, i; for(i=0;i<size;i++) dot += (*a)->data[i] * b->data[i]; if(abs(2 * dot) <= b->norm) return 0; long q = round((double) dot / b->norm); for(i=0;i<size;i++) (*a)->data[i] -= q * b->data[i]; (*a)->norm = (*a)->norm + q * q * b->norm - 2 * q * dot; return 1; }
i able vectorize first loop. if put icpc run code, with:
icpc *.c *.h -g -o2 -msse4.2 -vec-report=1
i have following report:
main.c(558): (col. 9) remark: loop vectorized.
main.c(566): (col. 2) remark: loop vectorized.
which tells me icpc vectorizes code. now, if hand vectorize first loop have (perfect?) speedup factor ints. tells me compiler not doing job @ vectorizing (especially because if use short s performance same). however, second loop, no whatsoever performance gains if hand vectorize this:
const int q = round((double) dot / b->norm) ; int32_t * pa = (*a)->data; int32_t * const pb = b->data; const __m128i vecqi = _mm_set1_epi32(q); __m128i vecresi, vecpi, vecci, vecqci; for(i=0;i<size-3;i+=4){ vecpi = _mm_load_si128((__m128i *)&(pa)[i] ); vecci = _mm_load_si128((__m128i *)&(pb)[i] ); vecqci = _mm_mullo_epi32(vecqi,vecci); vecresi = _mm_sub_epi32(vecpi,vecqci); _mm_store_si128((__m128i *) ((pa) + i), vecresi ); } for(;i<size;i++) pa[i] -= q * pb[i]; (*a)->norm = (*a)->norm + q * q * b->norm - 2 * q * dot;
does have clue of why not getting performance gains vectorizing second kernel?
thanks.
Comments
Post a Comment