APPLIED MATH SEMINAR

Speaker: David M. Blei, Princeton University

Title: Scalable Topic Modeling

When/where: Tuesday, March 1st, 4:15 PM, AKW 200

Abstract: Probabilistic topic modeling provides a suite of tools for the
unsupervised analysis of large collections of documents.  Topic
modeling algorithms can uncover the underlying themes of a collection
and decompose its documents according to those themes.  The resulting
analysis can be used for corpus exploration, document search, and a
variety of prediction problems.

In this talk, I will review the state-of-the-art in probabilistic
topic models.  I will describe the ideas behind latent Dirichlet
allocation and its extensions, and I'll point to some of the software
available for topic modeling.

I will then describe our online strategy for fitting topic models,
which lets us analyze massive document collections and document
collections arriving in a stream.  Specifically, we develop a
stochastic optimization algorithm for the variational objective
function.  Our algorithm can fit models to millions of articles in a
matter of hours.  I will present a study of 3.3M articles from
Wikipedia. These results show that the online approach finds topic
models that are as good or better than those found with traditional
inference algorithms.

(Joint work with Matthew Hoffman and Francis Bach)