APPLIED MATH SEMINAR
Title: On the Dirichlet Process for Analyzing and Sorting Large Databases of Sequential Data
Speaker: Lawrence Carin, Duke University
When/where: Tuesday, April 22nd, 4:15 p.m., AKW 200
Abstract:
A new hierarchical nonparametric Bayesian framework is presented for the problem of analyzing and sorting a large database of sequential data. The models for multiple datasets, each characterized by sequential data, are learned jointly, and the inter-dataset relationships are obtained simultaneously. This setting is used to analyze and sort large databases composed of sequential data, such as music. Within each data set, we represent the sequential data with an infinite hidden Markov model (iHMM), avoiding the problem of model selection (selecting a number of states). Across the data sets, the multiple iHMMs are learned jointly, employing a nested Dirichlet process (nDP) representation. In addition to improved learning of each of the models via appropriate data sharing, the learned sharing mechanisms are used to infer inter-data relationships of interest for data search. Specifically, the learned sharing mechanisms are used to define the affinity matrix in a graph-diffusion sorting framework. To speed up the inference for large databases, the nDP-iHMM is truncated to yield a nested Dirichlet-distribution based HMM representation, which accommodates fast variational Bayesian (VB) analysis for large-scale inference. The effectiveness of the framework is demonstrated using a database composed of 2500 digital music pieces.