#### John Langford: Imperative Learning

**Abstract:**Machine learning is often applied deep in the bowels of a process. It’s application is particularly difficult when there is a complex decision process requiring multiple predictions. I will discuss a new method for machine learning which makes a blend of programming and machine learning much more natural. In essence, you can freely mix prediction and programming (whether functional or imperative) in complex structured prediction problems.

#### Hugo Larochelle: The Neural Autoregressive Distribution Estimator : Distribution Modeling as Sequential Prediction

**Abstract:**The problem of estimating the distribution of multivariate data is one of the most general problems addressed by machine learning research. Good models of multivariate distributions can be used to extract meaningful features from unlabeled data, to provide a good prior over some output space in a structured prediction task (e.g. image denoising) or to provide regularization cues to a supervised model (e.g. a neural network classifier).

The most successful distribution/density models are usually based on graphical models, that represent the joint distribution p(x) of observations x using latent variables that explain the statistical structure within x. Mixture models are a well known example and rely on a latent representation with mutually exclusive components. For high dimensional data however, models that rely on a distributed representation, i.e. a representation where components can vary separately, are usually preferred. These include the very successful restricted Boltzmann machine (RBM) model. However, unlike mixture models, the RBM is not a tractable distribution estimator, in the sense that computing the probability of observations p(x) can only be computed up to a multiplicative constant and must be approximated.

In this talk, I’ll describe the Neural Autoregressive Distribution Estimator (NADE), which combines the tractability of mixture models with the modeling power of the RBM, by relying on distributed representations. The main idea in NADE is to use the probability product rule to decompose the joint distribution p(x) into the sequential product of its conditionals, and model all these conditionals using a single neural network. Thus, the task of distribution modeling is now tackled as the task of predicting the observations within x sequentially, in some given (usually random) order.

I’ll describe how to model binary, real-valued and categorical observations with NADE. We’ll see that NADE can achieve state-of-the-art performance on many datasets, including datasets of document bags-of-words and speech spectrograms. I’ll also discuss how NADE can be used successfully for document information retrieval, image classification and scene annotation.

#### Csaba Szepesvári: Online Learning with Costly Features and Labels

In this talk, we consider the online probing problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying for seeing the loss that he is evaluated against. Either way, the learner pays for the imperfections of his predictions and whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free, or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label significantly increases the regret of the

problem.

problem.

This is joint work with Navid Zolghadr, Andras Gyorgy and Russ Greiner.