java - Labeled Latent Dirichlet Allocation input values -
i doing tag prediction , keyword extraction on stackexchange posts. have ~36,000 posts consisting of title, body , tags. processes them filtering out noisy elements. after perform labeled latent dirichlet allocation (llda) obtained here.
when looking @ output, majority of first half of topic-keyword assignment pretty good, example:
topic 0: hardware hardware 0.01417490938078998 apple 0.007714736647543383 macbook 0.004179344296774437 mac 0.003794235182959134 topic 1: mac mac 0.09533364420104305 os 0.02075003721054881 mini 0.00682593613383348 macs 0.00435445224274711 topic 2: powerpc powerpc 0.010548590021130589 ppc 0.007893573342376935 mac 0.0039821054483700795 ibook 0.003731934198917873 os 0.003471650527888505
however, more come close end of output file, topic-keyword assignments weird:
topic 976: shopping-recommendation difference 7.5409094336777e-5 intel 7.5409094336777e-5 ppc 7.5409094336777e-5 turn 7.5409094336777e-5 topic 977: pci-card difference 7.5409094336777e-5 intel 7.5409094336777e-5 ppc 7.5409094336777e-5 turn 7.5409094336777e-5 topic 978: tmux difference 7.5409094336777e-5 intel 7.5409094336777e-5 ppc 7.5409094336777e-5 turn 7.5409094336777e-5 topic 979: difference 7.5409094336777e-5 intel 7.5409094336777e-5 ppc 7.5409094336777e-5 turn 7.5409094336777e-5
can please explain why such wrong assignments in end? , also, why values extremely low?
as said before have ~36,000 posts, these values perform llda:
option.est = true; option.alpha = 50/920 // 920 number of topics option.beta = 0.1; option.niters = 3000; option.twords = 15; option.nburnin = 350; option.samplinglag = 256;
i found little no documentation previous values, trial , error found these fit best of have managed get. however, maybe better understanding can explain me and/or suggest values best?
Comments
Post a Comment