pandas - Better way to compare all items in a dataframe and replace similar items with fuzzy matching python -
i'm wondering if there's better way compare items in dataframe column each other , replace items if have high fuzzy set matching score. ended using combinations, feeling memory intensive , inefficient. code below.
to clarify: central question here not fuzzy matching aspect, aspect of comparing items in list each other , replacing items match.
newl = list(true_df2.name.unique())
def remove_duplicate_names(newl, name, origdf, namesave): """ function removes duplicate names. replaces longer names shorter names takes in (1) newl: list of unique names, generic words have been stripped out. (2)name: name of dataframe column (3)origdf: original dataframe being rewritten (4)namesave: name of saved matchedwords file: e.g, 'save1'. created (4) because file takes long time run. returns dataframe """ if isinstance(newl, pd.dataframe): newl = list(newl[name].unique()) if isinstance(newl, list): cnl = list(combinations(newl, 2)) matchword = [] in cnl: fp = fuzz.partial_ratio(i[0], i[1]) if len(i[0]) > 3 , len(i[1]) > 3: if not i[0] == i[1]: #if i[0] or i[1] == 'york university': # continue #i can edit these conditions make matches more or less strict #higher values mean more strict #using more criteria 'and' means more strict if fp >= 98: shortstr = min(i, key=len) longstr = max(i,key=len) matchword.append((shortstr, longstr)) pair in matchword: #replace in each longstring spot, shorter string print 'pair', pair print origdf[name][origdf[name].str.contains(pair[1])] #origdf[name][origdf[name].str.contains(pair[1])] = pair[0].strip() origdf.ix[origdf[name].str.contains(pair[1]), 'name'] = pair[0] return origdf
Comments
Post a Comment