c++ - Efficiency of STL algorithms with fixed size arrays -
in general, assume stl implementation of algorithm @ least efficient can come (with additional benefit of being error free). however, came wonder whether stl's focus on iterators might harmful in situations.
lets assume want calculate inner product of 2 fixed size arrays. naive implementation this:
std::array<double, 100000> v1; std::array<double, 100000> v2; //fill arbitrary numbers double sum = 0.0; (size_t = 0; < v1.size(); ++i) { sum += v1[i] * v2[i]; }
as number of iterations , memory layout known during compile time , operations can directly mapped native processor instructions, compiler should able generate "optimal" machine code (loop unrolling, vectorization / fma instructions ...).
the stl version
double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);
on other hand adds additional indirections , if inlined, compiler still has deduce working on continuous memory region , region lies. while possible in principle, wonder, whether typical c++ compiler it.
so question is: think, there can performance benefit of implementing standard algorithms operate on fixed size arrays on own, or stl version outperform manual implementation?
as suggested did measurements ,
- for code below
- compiled vs2013 x64 in release mode
- executed on win8.1 machine i7-2640m,
the algorithm version consistently slower 20% (15.6-15.7s vs 12.9-13.1s). relative difference, stays constant on 2 orders of magnitude n
, reps
.
so guess answer is: using standard library algorithms can hurt performance.
it still interesting, if general problem or if specific platform, compiler , benchmark. welcome post own resutls or comment on benchmark.
#include <iostream> #include <numeric> #include <array> #include <chrono> #include <cstdlib> #define use_std_algorithm using namespace std; using namespace std::chrono; static const size_t n = 10000000; //size of arrays static const size_t reps = 1000; //number of repitions array<double, n> a1; array<double, n> a2; int main(){ srand(10); (size_t = 0; < n; ++i) { a1[i] = static_cast<double>(rand())*0.01; a2[i] = static_cast<double>(rand())*0.01; } double res = 0.0; auto start=high_resolution_clock::now(); (size_t z = 0; z < reps; z++) { #ifdef use_std_algorithm res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res); #else (size_t t = 0; t < n; ++t) { res+= a1[t] * a2[t]; } #endif } auto end = high_resolution_clock::now(); std::cout << res << " "; // <-- necessary, loop isn't optimized away std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl; } /* * update: results (ubuntu 14.04 , haswell) * stl: algorithm * g++-4.8-2 -march=native -std=c++11 -o3 main.cpp : 1.15287e+24 3551 ms * g++-4.8-2 -march=native -std=c++11 -ffast-math -o3 main.cpp : 1.15287e+24 3567 ms * clang++-3.5 -march=native -std=c++11 -o3 main.cpp : 1.15287e+24 9378 ms * clang++-3.5 -march=native -std=c++11 -ffast-math -o3 main.cpp : 1.15287e+24 8505 ms * * loop: * g++-4.8-2 -march=native -std=c++11 -o3 main.cpp : 1.15287e+24 3543 ms * g++-4.8-2 -march=native -std=c++11 -ffast-math -o3 main.cpp : 1.15287e+24 3551 ms * clang++-3.5 -march=native -std=c++11 -o3 main.cpp : 1.15287e+24 9613 ms * clang++-3.5 -march=native -std=c++11 -ffast-math -o3 main.cpp : 1.15287e+24 8642 ms */
edit:
did quick check g++-4.9.2 , clang++-3.5 o3
and std=c++11
on fedora 21 virtual box vm on same machine , apparently compilers don't have same problem (the time same both versions). however, gcc's version twice fast clang's (7.5s vs 14s).
Comments
Post a Comment