跳至主要内容

Probit Normal Correlated Topic Model

Read full paper at:
http://www.scirp.org/journal/PaperInformation.aspx?PaperID=52713#.VKH8NcCAM4

Author(s)   
The logistic normal distribution has recently been adapted via the transformation of multivariate Gaussian variables to model the topical distribution of documents in the presence of correlations among topics. In this paper, we propose a probit normal alternative approach to modelling correlated topical structures. Our use of the probit model in the context of topic discovery is novel, as many authors have so far concentrated solely of the logistic model partly due to the formidable inefficiency of the multinomial probit model even in the case of very small topical spaces. We herein circumvent the inefficiency of multinomial probit estimation by using an adaptation of the diagonal orthant multinomial probit in the topic models context, resulting in the ability of our topic modeling scheme to handle corpuses with a large number of latent topics. An additional and very important benefit of our method lies in the fact that unlike with the logistic normal model whose non-conjugacy leads to the need for sophisticated sampling schemes, our approach exploits the natural conjugacy inherent in the auxiliary formulation of the probit model to achieve greater simplicity. The application of our proposed scheme to a well-known Associated Press corpus not only helps discover a large number of meaningful topics but also reveals the capturing of compellingly intuitive correlations among certain topics. Besides, our proposed approach lends itself to even further scalability thanks to various existing high performance algorithms and architectures capable of handling millions of documents.
Cite this paper
Yu, X. and Fokoué, E. (2014) Probit Normal Correlated Topic Model. Open Journal of Statistics, 4, 879-888. doi: 10.4236/ojs.2014.411083.
 

[1] Blei, D.M. and Ng, A.Y., Jordan, M.I. and Lafferty, J. (2003) Latent Dirichlet Allocation. Journal of Machine Learning Research, 3.
[2] Blei, D.M. and Lafferty, J.D. (2006) Correlated Topic Models. Proceedings of the 23rd International Conference on Machine Learning, MIT Press, Cambridge, Massachusetts, 113-120.
[3] Mimno, D., Wallach, H.M. and Mccallum, A. (2008) Gibbs Sampling for Logistic Normal Topic Models with Graph-Based Priors. Proceedings of NIPS Workshop on Analyzing Graphs, 2008.
[4] Chen, J.F., Zhu, J., Wang, Z., Zheng, X. and Zhang, B. (2013) Scalable Inference for Logistic-Normal Topic Models. In Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z. and Weinberger, K.Q., Eds., Advances in Neural Information Processing Systems 26, Curran Associates, Inc., 2445-2453.
[5] Johndrow, J., Lum, K. and Dunson, D.B. (2013) Diagonal Orthant Multinomial Probit Models. JMLR Proceedings, Volume 31 of AISTATS, 29-38.
[6] Albert, J.H. and Chib, S. (1993) Bayesian Analysis of Binary and Polychotomous Response Data. Journal of the American Statistical Association, 88, 669-679.
http://dx.doi.org/10.1080/01621459.1993.10476321
[7] Grun, B. and Hornik, K. (2011) Topicmodels: An R Package for Fitting Topic Models. Journal of Statistical Software, 40, 1-30.
[8] Salomatin, K., Yang, Y.M. and Lad, A. (2009) Multi-Field Correlated Topic Modeling. Proceedings of the SIAM International Conference on Data Mining, SDM 2009, April 30-May 2, 2009, Sparks, 628-637.
[9] Yao, L.M., Mimno, D. and McCallum, A. (2009) Efficient Methods for Topic Model Inference on Streaming Document Collections. KDD 2009: Proceedings of 15th ACM SIGKDD int’l Conference on Knowledge Discovery and Data Mining, 937-946.
[10] Newman, D., Asuncion, A., Smyth, P. and Welling, M. (2009) Distributed Algorithms for Topic Models. Journal of Machine Learning Research, 10, 1801-1828.
[11] Smola, A. and Narayanamurthy, S. (2010) An Architecture for Parallel Topic Models. Proc. VLDB Endow., 3, 703-710.
[12] Zhu, J., Chen, N., Perkins, H. and Zhang, B. (2013) Gibbs Max-Margin Topic Models with Data Augmentation. CoRR, abs/1310.2816.  eww141230lx

评论

此博客中的热门博文

A Comparison of Methods Used to Determine the Oleic/Linoleic Acid Ratio in Cultivated Peanut (Arachis hypogaea L.)

Cultivated peanut ( Arachis hypogaea L.) is an important oil and food crop. It is also a cheap source of protein, a good source of essential vitamins and minerals, and a component of many food products. The fatty acid composition of peanuts has become increasingly important with the realization that oleic acid content significantly affects the development of rancidity. And oil content of peanuts significantly affects flavor and shelf-life. Early generation screening of breeding lines for high oleic acid content greatly increases the efficiency of developing new peanut varieties. The objective of this study was to compare the accuracy of methods used to classify individual peanut seed as high oleic or not high oleic. Three hundred and seventy-four (374) seeds, spanning twenty-three (23) genotypes varying in oil composition (i.e. high oleic (H) or normal/not high oleic (NH) inclusive of all four peanut market-types (runner, Spanish, Valencia and Virginia), were individually tested ...

Location Optimization of a Coal Power Plant to Balance Costs against Plant’s Emission Exposure

Fuel and its delivery cost comprise the biggest expense in coal power plant operations. Delivery of electricity from generation to consumers requires investment in power lines and transmission grids. Placing a coal power plant or multiple power plants near dense population centers can lower transmission costs. If a coalmine is nearby, transportation costs can also be reduced. However, emissions from coal plants play a key role in worsening health crises in many countries. And coal upon combustion produces CO 2 , SO 2 , NO x , CO, Metallic and Particle Matter (PM10 & PM2.5). The presence of these chemical compounds in the atmosphere in close vicinity to humans, livestock, and agriculture carries detrimental health consequences. The goal of the research was to develop a methodology to minimize the public’s exposure to harmful emissions from coal power plants while maintaining minimal operational costs related to electric distribution losses and coal logistics. The objective was...

Evaluation of the Safety and Efficacy of Continuous Use of a Home-Use High-Frequency Facial Treatment Appliance

At present, many home-use beauty devices are available in the market. In particular, many products developed for facial treatment use light, e.g., a flash lamp or a light-emitting diode (LED). In this study, the safety of 4 weeks’ continuous use of NEWA TM , a high-frequency facial treatment appliance, every alternate day at home was verified, and its efficacy was evaluated in Japanese individuals with healthy skin aged 30 years or older who complained of sagging of the facial skin.  Transepidermal water loss (TEWL), melanin levels, erythema levels, sebum secretion levels, skin color changes and wrinkle improvement in the facial skin were measured before the appliance began to be used (study baseline), at 2 and 4 weeks after it had begun to be used, and at 2 weeks after completion of the 4-week treatment period (6 weeks from the study baseline). In addition, data obtained by subjective evaluation by the subjects themselves on a visual analog scale (VAS) were also analyzed. Fur...