C# - Hash contents of MS Office documents without metadata -

August 15, 2010

i trying identify files duplicate contents. decided comparison using hashing mechanism (md5, sha1 or other). works fine ".txt" files. however, ms office files (.doc,.docx,.xls, etc) fails.

md5/sha1 hash not constant ms office files, if have same "text" content. assume ms office stores kind of meta-data in file, changes each time save file. hash different.

e.g. have file abc.doc , compute hash (hash1) it. open , change 1 word , save file. undo change made , save , compute hash (hash2). hash1 != hash2 in case. same if try on ".txt" file

is there way de-dupe ms office documents based on hashing contents? can hash contents of file , not meta-data?

i don't think can done without extracting text of document using tool , hashing text. can recommend stellent outside in, owned oracle. overkill solution needs. provide tool extract text many types of files, including office files , versions.

Search This Blog

Silver

C# - Hash contents of MS Office documents without metadata -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

netbeans - Remove indent guide lines -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -