Why python hashlib.md5 is faster than linux coreutils md5sum -
i found python hashlib.md5 might faster coreutils md5sum.
python hashlib
def get_hash(fpath, algorithm='md5', block=32768): if not hasattr(hashlib, algorithm): return '' m = getattr(hashlib, algorithm)() if not os.path.isfile(fpath): return '' open(fpath, 'r') f: while true: data = f.read(block) if not data: break m.update(data) return m.hexdigest() coreutils md5sum
def shell_hash(fpath, method='md5sum'): if not os.path.isfile(fpath): return '' cmd = [method, fpath] #delete shlex p = popen(cmd, stdout=pipe) output, _ = p.communicate() if p.returncode: return '' output = output.split() return output[0] there 4 columns test results time of calculate md5 , sha1.
1th column cal time of coreutils md5sum or sha1sum.
2th column cal time of python hashlib md5 or sha1, reading 1048576 chunk.
3th column cal time of python hashlib md5 or sha1, reading 32768 chunk.
4th column cal time of python hashlib md5 or sha1, reading 512 chunk.
4.08805298805 3.81827783585 3.72585606575 5.72505903244 6.28456497192 3.69725108147 3.59885907173 5.69266486168 4.08003306389 3.82310700417 3.74562311172 5.74706888199 6.25473690033 3.70099711418 3.60972714424 5.70108985901 4.07995700836 3.83335709572 3.74854302406 5.74988412857 6.26068210602 3.72050404549 3.60864400864 5.69080018997 4.08979201317 3.83872914314 3.75350999832 5.79242300987 6.28977203369 3.69586396217 3.60469412804 5.68853116035 4.0824379921 3.83340883255 3.74298214912 5.73846316338 6.27566385269 3.6986720562 3.6079480648 5.68188500404 4.10092496872 3.82357311249 3.73044300079 5.7778570652 6.25675201416 3.78636980057 3.62911510468 5.71392583847 4.09579920769 3.83730792999 3.73345088959 5.73320293427 6.26580905914 3.69428491592 3.61320495605 5.69155502319 4.09030103683 3.82516098022 3.73244214058 5.72749185562 6.26151800156 3.6951239109 3.60320997238 5.70400810242 4.07977604866 3.81951498985 3.73287010193 5.73037815094 6.26691818237 3.72077894211 3.60203289986 5.71795105934 4.08536100388 3.83897590637 3.73681998253 5.73614501953 6.2943251133 3.72131896019 3.61498594284 5.69963502884 (my computer has 4-core i3-2120 cpu @ 3.30ghz, 4g memory. file calculated these program 2g in size. odd rows md5 , rows sha1. time in table in second.) with more 100 times test, found python hashlib faster md5sum or sha1sum.
i read docs in source code python2.7/modules/{md5.c,md5.h,md5module.c} , gnulib lib/{md5.c,md5.h}. both implementation of md5 (rfc 1321).
in gnulib, md5 chunk read 32768.
i didn't know md5 , c source code. me explain these results?
the other reason why want ask question many people think md5sum faster python_hashlib granted , prefer use md5sum when writting python code. seems wrong.
coreutils had it's own c implementation, whereas python calls out libcrypto architecture specific assembly implementations. difference greater sha1. has been fixed in coreutils-8.22 (when configured --with-openssl), , enabled in newer distos fedora 21, rhel 7 , arch, etc.
note calling out command though slower on systems better long term strategy 1 can take advantage of logic encapsulated within separate commands, rather reimplementing. example in coreutils there pending support improved reading of sparse files zeros not redundantly read kernel etc. better take advantage of transparently if possible.
Comments
Post a Comment