pandas - Better way to compare all items in a dataframe and replace similar items with fuzzy matching python -

June 15, 2010

i'm wondering if there's better way compare items in dataframe column each other , replace items if have high fuzzy set matching score. ended using combinations, feeling memory intensive , inefficient. code below.

to clarify: central question here not fuzzy matching aspect, aspect of comparing items in list each other , replacing items match.

newl = list(true_df2.name.unique())

def remove_duplicate_names(newl, name, origdf, namesave):     """     function removes duplicate names. replaces longer names shorter names     takes in (1) newl: list of unique names, generic words have been stripped out.     (2)name: name of dataframe column     (3)origdf: original dataframe being rewritten     (4)namesave: name of saved matchedwords file: e.g, 'save1'. created (4) because file      takes long time run.      returns dataframe       """     if isinstance(newl, pd.dataframe):         newl = list(newl[name].unique())      if isinstance(newl, list):           cnl = list(combinations(newl, 2))         matchword = []         in cnl:             fp = fuzz.partial_ratio(i[0], i[1])             if len(i[0]) > 3 , len(i[1]) > 3:                  if not i[0] == i[1]:                     #if i[0] or i[1] == 'york university':                     #    continue                      #i can edit these conditions make matches more or less strict                     #higher values mean more strict                     #using more criteria 'and' means more strict                     if fp >= 98:                         shortstr = min(i, key=len)                         longstr = max(i,key=len)                         matchword.append((shortstr, longstr))         pair in matchword:         #replace in each longstring spot, shorter string         print 'pair', pair         print origdf[name][origdf[name].str.contains(pair[1])]         #origdf[name][origdf[name].str.contains(pair[1])] = pair[0].strip()         origdf.ix[origdf[name].str.contains(pair[1]), 'name'] = pair[0]     return origdf

Search This Blog

Silver

pandas - Better way to compare all items in a dataframe and replace similar items with fuzzy matching python -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -