python 2.7 - Fetching top n records in pandas pivot , based on multiple criteria and plotting them with matplotlib -

September 15, 2011

usecase : extending pivot functionality of pandas. fetch top n records & plot them against own "click %"(s) vs. no of records of name

import pandas pd import numpy np  df1 = pd.dataframe({'name':['a', 'a', 'b', 'b','c','a'], 'click':[1,1,0,1,1,0]})    click name 0      1    1      1    2      0    b 3      1    b 4      1    c 5      0     [6 rows x 2 columns]  #fraction of records present & clicks fraction of it's own records present f=df1.pivot_table(rows='name', aggfunc=[len, np.sum]) f['len']['click']/sum(f['len']['click']) , f['sum']['click']/sum(f['sum']['click']) (name       0.500000 b       0.333333 c       0.166667 name: click, dtype: float64, name       0.50 b       0.25 c       0.25 name: click, dtype: float64)

but able plot them need store top n records in object supported matplotlib. tried storing

"top names" a,b, c ..etc creating dict (output of f['len']['click']/sum(f['len']['click']

) )- , sorted values - after stored "click %" [a -> 0.50, b -> 0.25 , c-> 0.25] in same dictionary.

**since overkill - wondering if there's more pythonic way ? **

i tried head groupby clause, doesn't give me looking for. looking dataframe above 0.500000 b 0.333333 c 0.166667 name: click, dtype: float64, name 0.50 b 0.25 c 0.25 except top n logic should embedded (head(n) not work n depends on data-set - guess need use "apply" ? - , post object , "" object needs identified matplotlib own labels (top n "name" here)

here's dict function implementation :- # overkill fetch top n custom criteria above

def freq_counts(df_var,n): # df_var df1.name , make top n logic generic each column name     perct_freq=dict((df_var.value_counts()*100)/len(df_var))     vec=[]     key,value in perct_freq.items():         if value>=n :             vec.append([key,value])     return vec freq_counts(df1.name,3) # eg. top 3 freq counts - names, see vec[i][0] has corresponding keys #in example when calculate "perct_freq", series object, ideally want avoid converting dict - overkill !

store actual occurances (len of names) , , find fraction of "name" in population
against this, fins "sucess outcome" , find fraction of own population
finally plot top n name(s), output of (1) & (2) in same plot - criteria top n should based on (1) percentage ie. (1) & (2) use dataframes support plot name labels in x axis (1) y axis (primary) (2) y axis (secondary)

pps: in code above - (1) > f['len']['click']/sum(f['len']['click']) and
(2) > f['sum']['click']/sum(f['sum']['click'])

Search This Blog

Silver

python 2.7 - Fetching top n records in pandas pivot , based on multiple criteria and plotting them with matplotlib -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -