|
CS Colloquium
September 13, 2012
4:00 p.m., AKW 200
Refreshments will be available at 3:45
Speaker: David
Mimno
Title: Finding thousands of topics in millions of books
Abtract: Statistical topic models have become popular in
domains as distinct as biomedical research, political science, and literary
scholarship. These methods represent text documents as combinations of
themes, or topics, which are themselves probability distributions over
a vocabulary. This low-dimensional topic representation is robust to variation
in word choice and ambiguity in word sense, allowing users to analyze
trends in large text collections. Existing methods for training topic
models, however, have not kept pace with the size of today's document
corpora. In this talk I will describe a new method that combines the best
aspects of two inference methods, stochastic online inference and Markov
chain Monte Carlo. I will demonstrate the scalability of this algorithm
on a corpus of 1.2 million out-of-copyright books.
Bio: David Mimno is a postdoctoral researcher in the
Computer Science department at Princeton University. He received his PhD
from the University of Massachusetts, Amherst. Before graduate school,
he served as Head Programmer at the Perseus Project, a digital library
for cultural heritage materials, at Tufts University. He is supported
by a CRA Computing Innovation fellowship.

|