C# - Hash contents of MS Office documents without metadata -
i trying identify files duplicate contents. decided comparison using hashing mechanism (md5, sha1 or other). works fine ".txt" files. however, ms office files (.doc,.docx,.xls, etc) fails.
md5/sha1 hash not constant ms office files, if have same "text" content. assume ms office stores kind of meta-data in file, changes each time save file. hash different.
e.g. have file abc.doc , compute hash (hash1) it. open , change 1 word , save file. undo change made , save , compute hash (hash2). hash1 != hash2 in case. same if try on ".txt" file
is there way de-dupe ms office documents based on hashing contents? can hash contents of file , not meta-data?
i don't think can done without extracting text of document using tool , hashing text. can recommend stellent outside in, owned oracle. overkill solution needs. provide tool extract text many types of files, including office files , versions.
Comments
Post a Comment