c# - Remove All Duplicates In A Large Text File -

January 15, 2013

i stumped @ problem , result have stopped working while. work large pieces of data. approx 200gb of .txt data every week. data can range 500 million lines. lot of these duplicate. guess 20gb unique. have had several custom programs made including hash remove duplicates, external remove duplicates none seem work. latest 1 using temp database took several days remove data.

the problem programs crash after point , after spending large amount of money on these programs thought come online , see if can help. understand has been answered on here before , have spent last 3 hours reading 50 threads on here none seem have same problem me i.e huge datasets.

can recommend me? needs super accurate , fast. preferably not memory based have 32gb of ram work with.

the standard way remove duplicates sort file , sequential pass remove duplicates. sorting 500 million lines isn't trivial, it's doable. few years ago had daily process sort 50 100 gigabytes on 16 gb machine.

by way, might able off-the-shelf program. gnu sort utility can sort file larger memory. i've never tried on 500 gb file, might give shot. can download along rest of gnu core utilities. utility has --unique option, should able sort --unique input-file > output-file. uses technique similar 1 describe below. i'd suggest trying on 100 megabyte file first, working larger files.

with gnu sort , technique describe below, perform lot better if input , temporary directories on separate physical disks. put output either on third physical disk, or on same physical disk input. want reduce i/o contention as possible.

there might commercial (i.e. pay) program sorting. developing program sort huge text file efficiently non-trivial task. if can buy few hundreds of dollars, you're money ahead if time worth anything.

if can't use ready made program, . . .

if text in multiple smaller files, problem easier solve. start sorting each file, removing duplicates files, , writing sorted temporary files have duplicates removed. run simple n-way merge merge files single output file has duplicates removed.

if have single file, start reading many lines can memory, sorting those, removing duplicates, , writing temporary file. keep doing entire large file. when you're done, have number of sorted temporary files can merge.

in pseudocode, looks this:

filenumber = 0 while not end-of-input     load many lines can list     sort list     filename = "file"+filenumber     write sorted list filename, optionally removing duplicates     filenumber = filenumber + 1

you don't have remove duplicates temporary files, if unique data 10% of total, you'll save huge amount of time not outputting duplicates temporary files.

once of temporary files written, need merge them. description, figure each chunk read file contain somewhere around 20 million lines. you'll have maybe 25 temporary files work with.

you need k-way merge. that's done creating priority queue. open each file, read first line each file , put queue along reference file came from. then, take smallest item queue , write output file. remove duplicates, keep track of previous line output, , don't output new line if it's identical previous one.

once you've output line, read next line file 1 output came from, , add line priority queue. continue way until you've emptied of files.

i published series of articles time sorting large text file. uses technique described above. thing doesn't remove duplicates, that's simple modification methods output temporary files , final output method. without optimizations, program performs quite well. won't set speed records, should able sort , remove duplicates 500 million lines in less 12 hours. less, considering second pass working small percentage of total data (because removed duplicates temporary files).

one thing can speed program operate on smaller chunks , sorting 1 chunk in background thread while you're loading next chunk memory. end having deal more temporary files, that's not problem. heap operations slower, time more recaptured overlapping input , output sorting. end getting i/o free. @ typical hard drive speeds, loading 500 gigabytes take somewhere in neighborhood of 2 , half 3 hours.

take @ article series. it's many different, small, articles take through entire process describe, , presents working code. i'm happy answer questions might have it.

Search This Blog

Silver

c# - Remove All Duplicates In A Large Text File -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

objective c - Greedy NSProgressIndicator Allocation -

how to set an OCR language in Google Drive -