python 2.7 - Fetching top n records in pandas pivot , based on multiple criteria and plotting them with matplotlib -
usecase : extending pivot functionality of pandas. fetch top n records & plot them against own "click %"(s) vs. no of records of name
import pandas pd import numpy np df1 = pd.dataframe({'name':['a', 'a', 'b', 'b','c','a'], 'click':[1,1,0,1,1,0]}) click name 0 1 1 1 2 0 b 3 1 b 4 1 c 5 0 [6 rows x 2 columns] #fraction of records present & clicks fraction of it's own records present f=df1.pivot_table(rows='name', aggfunc=[len, np.sum]) f['len']['click']/sum(f['len']['click']) , f['sum']['click']/sum(f['sum']['click']) (name 0.500000 b 0.333333 c 0.166667 name: click, dtype: float64, name 0.50 b 0.25 c 0.25 name: click, dtype: float64)
but able plot them need store top n records in object supported matplotlib. tried storing
"top names" a,b, c ..etc creating dict (output of f['len']['click']/sum(f['len']['click']
) )- , sorted values - after stored "click %" [a -> 0.50, b -> 0.25 , c-> 0.25]
in same dictionary.
**since overkill - wondering if there's more pythonic way ? **
i tried head groupby clause, doesn't give me looking for. looking dataframe above 0.500000 b 0.333333 c 0.166667 name: click, dtype: float64, name 0.50 b 0.25 c 0.25 except top n logic should embedded (head(n) not work n depends on data-set - guess need use "apply" ? - , post object , "" object needs identified matplotlib own labels (top n "name" here)
here's dict function implementation :- # overkill fetch top n custom criteria above
def freq_counts(df_var,n): # df_var df1.name , make top n logic generic each column name perct_freq=dict((df_var.value_counts()*100)/len(df_var)) vec=[] key,value in perct_freq.items(): if value>=n : vec.append([key,value]) return vec freq_counts(df1.name,3) # eg. top 3 freq counts - names, see vec[i][0] has corresponding keys #in example when calculate "perct_freq", series object, ideally want avoid converting dict - overkill !
- store actual occurances (len of names) , , find fraction of "name" in population
- against this, fins "sucess outcome" , find fraction of own population
- finally plot top n name(s), output of (1) & (2) in same plot - criteria top n should based on (1) percentage ie. (1) & (2) use dataframes support plot name labels in x axis (1) y axis (primary) (2) y axis (secondary)
pps: in code above - (1) > f['len']['click']/sum(f['len']['click']) and
(2) > f['sum']['click']/sum(f['sum']['click'])
Comments
Post a Comment