VIRTUAL ONLY: Workshop on Advances in Theory and Algorithms for Deep Reinforcement Learning
Institute for Computational and Experimental Research in Mathematics (ICERM)
August 2, 2021  August 4, 2021
Your device timezone is . Do you want to view schedules in or choose a custom timezone?
Monday, August 2, 2021

10:00  10:45 am EDTGathertown Morning CoffeeCoffee Break  Virtual

10:45  11:00 am EDTWelcomeVirtual
 Brendan Hassett, ICERM/Brown University

11:00  11:30 am EDTA Boosting Approach to Reinforcement LearningVirtual
 Elad Hazan, Princeton University
Abstract
We will describe an algorithmic approach for learning in large Markov decision processes whose complexity is independent of the number of states. This task is in general computationally hard. We will present a boostinginspired methodology that gives rise to provably efficient methods under certain weak learning conditions. No background in boosting or reinforcement learning is required for the talk.

11:45 am  12:15 pm EDTAlgorithmic for metric elicitation via the geometry of classifier statisticsVirtual
 Sanmi Koyejo, University of Illinois at UrbanaChampaign
Abstract
Selecting a suitable metric for realworld machine learning applications remains an open problem, as default metrics such as classification accuracy often do not capture tradeoffs relevant to the downstream decisionmaking. Unfortunately, there is limited formal guidance in the machine learning literature on how to select appropriate metrics. We are developing formal interactive strategies by which a practitioner may discover which metric to optimize, such that it recovers user or expert preferences. I will outline our current work on metric elicitation, including some open problems.

12:30  1:30 pm EDTLunch/Free TimeVirtual

1:30  2:00 pm EDTTowards a Theory of Representation Learning for Reinforcement LearningVirtual
 Alekh Agarwal, Microsoft
Abstract
Provably sampleefficient reinforcement learning from rich observational inputs remains a key open challenge in research. While impressive recent advances have allowed the use of linear modelling while carrying out sampleefficient exploration and learning, the handling of more general nonlinear models remains limited. In this talk, we study reinforcement learning using linear models, where the features underlying the linear model are learned, rather than apriori specified. While the broader question of representation learning for useful embeddings of complex data has seen tremendous progress, doing so in reinforcement learning presents additional challenges: good representations cannot be discovered without adequate exploration, but effective exploration is challenging in the absence of good representations. Concretely, we study this question in the context of lowrank MDPs [Jiang et al., 2017, Jin et al., 2019, Yang and Wang, 2019], where the features underlying a stateaction pair are not assumed to be known, unlike most prior works. We develop two styles of methods, modelbased and modelfree. For the modelbased method, we learn an approximate factorization of the transition model, plan within the model to obtain a fresh exploratory policy and then update our factorization with additional data. In the modelfree technique, we learn features so that quantities such as value functions at subsequent states can be predicted linearly in those features. In both approaches, we address the intricate coupling between exploration and representation learning, and provide sample complexity guarantees. More details can be found at https://arxiv.org/abs/2006.10814 and https://arxiv.org/abs/2102.07035. [Based on joint work with Jingling Chen, Nan Jiang, Sham Kakade, Akshay Krishnamurthy, Aditya Modi and Wen Sun]

2:15  2:45 pm EDTFrankWolfe Methods in Probability SpaceVirtual
 Jose Blanchet, Stanford University
Abstract
We study the problem of minimizing a smooth convex functional of a probability measure. This formulation can be used to encompass a wide range of problems and algorithms of interest in diverse areas such as reinforcement learning, variational inference, deconvolution and adversarial training. We introduce and study a class of FrankWolfe algorithms for solving this problem together with associated convergence guarantees which match finite dimensional optimization results. We illustrate our results in the context of Wasserstein barycenter relaxations with unconstrained support, optimal deconvolution, among others.

3:00  3:30 pm EDTBreakCoffee Break  Virtual

3:30  4:00 pm EDTPreference based RL with finite time guaranteesVirtual
 Aarti Singh, Carnegie Mellon University
Abstract
As reinforcement learning is used for solving increasingly complex problems, eliciting meaningful labels and rewards for supervision is becomes challenging. Preferences in the form of pairwise comparisons have emerged as an alternate feedback mechanism that are often easier to elicit and more accurate. Despite promising results in applications, the theoretical understanding of preference based RL is still in its infancy. This talk will outline our efforts in understanding the fundamental limits of learning when given access to both preferences and labels, algorithms that achieve those limits and some open questions.

4:15  4:45 pm EDTMinimum complexity interpolation in random features modelsVirtual
 Andrea Montanari, Stanford University
Abstract
Despite their many appealing properties, kernel methods are heavily affected by the curse of dimensionality. For instance, in the case of inner product kernels in ℝd, the Reproducing Kernel Hilbert Space (RKHS) norm is often very large for functions that depend strongly on a small subset of directions (ridge functions). Correspondingly, such functions are difficult to learn using kernel methods. This observation has motivated the study of generalizations of kernel methods, whereby the RKHS norm  which is equivalent to a weighted ℓ2 norm  is replaced by a weighted functional ℓp norm, which we refer to as p norm. Unfortunately, tractability of these approaches is unclear. The kernel trick is not available and minimizing these norms requires to solve an infinitedimensional convex problem. We study random features approximations to these norms and show that, for p>1, the number of random features required to approximate the original learning problem is upper bounded by a polynomial in the sample size. Hence, learning with p norms is tractable in these cases. We introduce a proof technique based on uniform concentration in the dual, which can be of broader interest in the study of overparametrized models

4:45  5:30 pm EDTGathertown ReceptionReception  Virtual
Tuesday, August 3, 2021

10:30  11:00 am EDTGathertown Morning CoffeeCoffee Break  Virtual

11:00  11:30 am EDTPlanning and Learning from InterventionsVirtual
 Caroline Uhler, MIT
Abstract
Massive data collection holds the promise of a better understanding of complex phenomena and ultimately, of better decisions. An exciting opportunity in this regard stems from the growing availability of perturbation / intervention data (manufacturing, advertisement, education, genomics, etc.). In order to obtain mechanistic insights from such data, a major challenge is the development of a framework that integrates observational and interventional data. I will present such a causal framework and discuss how it allows predicting the effect of yet unseen interventions and identifying the optimal interventions to perform.

11:45 am  12:15 pm EDTBatch Value Function TournamentVirtual
 Nan Jiang, University of Illinois UrbanaChampaign
Abstract
Offline RL has attracted significant attention from the community as it offers the possibility of applying RL when active data collection is difficult. A key missing ingredient, however, is a reliable modelselection procedure that enables hyperparameter tuning, and reduction to offpolicy evaluation either suffer exponential variance or relies on additional hyperparameters, creating a chickenandegg problem. In this talk I will discuss our recent progress on a version of this problem, where we need to identify Q* from a large set of candidate functions using a polynomialsized exploratory dataset. The question is also a longstanding open problem about the informationtheoretic nature of batch RL, and many suspected that the task is simply impossible. In our recent work, we provide a solution to this seemingly impossible task via (1) a tournament procedure that performs pairwise comparisons, and (2) a clever trick that partitions the large stateaction space adaptively according to the compared functions. The resulting algorithm, BVFT, is very simple and can be readily applied to cross validation, with preliminary empirical results showing promising performance.

12:30  1:30 pm EDTLunch/Free TimeVirtual

1:30  2:00 pm EDTA Lyapunov approach for finitesample convergence bounds with offpolicy RLVirtual
 Sanjay Shakkottai, University of Texas Austin
Abstract
In this talk, we derive finitesample bounds for Markovian Stochastic Approximation, using the generalized Moreau envelope as a Lyapunov function. This result, we show, enables us to derive finitesample bounds for a large class of valuebased asynchronous reinforcement learning (RL) algorithms. Specifically, we show finitesample meansquare convergence bounds for asynchronous RL algorithms such as Qlearning, nstep TD, TD(lambda), and offpolicy TD algorithms including Vtrace. As a byproduct, by analyzing the convergence bounds of nstep TD and TD(lambda), we provide theoretical insights into the biasvariance tradeoff, i.e., efficiency of bootstrapping in RL. Based on joint work with Zaiwei Chen, Siva Theja Maguluri and Karthikeyan Shanmugam.

2:15  2:45 pm EDTReinforcement learning with factorizationVirtual
 Devavrat Shah, Massachusetts Institute of Technology
Abstract
In the setup of a singleagent reinforcement learning viewed through the framework of Markov Decision Process, typically there are two primary challenges: (a) given access to the model or simulator), learning a good or optimal policy, and (b) identifying the model, using limited observed data potentially generated under suboptimal and unknown policy.
Like the singular value decomposition of a matrix, the spectral decomposition or factorization of a “nice” multivariate function suggests that it can be represented as a finite or countably infinite sum of product of functions of individual variables. In this talk, we shall discuss how factorization of Qfunction can help design sample efficient learning with access to model simulator, and how the factorization of transition kernel can help learn the model from a single trajectory per agent in the setting of offline reinforcement learning with heterogenous agents.
This is based on joint works:
Qfunction learning: https://arxiv.org/abs/2006.06135 Offline personalized model learning: https://arxiv.org/abs/2102.06961 " 
3:00  3:30 pm EDTBreakCoffee Break  Virtual

3:30  4:00 pm EDTReinforcement Learning in High Dimensional Systems (and why "reward" is not enough...)Virtual
 Sham Kakade, University of Washington
Abstract
A fundamental question in the theory of reinforcement learning is what properties govern our ability to generalize and avoid the curse of dimensionality. With regards to supervised learning, these questions are well understood theoretically, and, practically speaking, we have overwhelming evidence on the value of representational learning (say through modern deep networks) as a means for sample efficient learning. Providing an analogous theory for reinforcement learning is far more challenging, where even characterizing the representational conditions which support sample efficient generalization is far less well understood. This talk will highlight recent advances towards characterizing when generalization is possible in reinforcement learning, focusing on both lower bounds (addressing issues of what constitutes a good representation) along with upper bounds (where we consider a broad set of sufficient conditions).

4:15  4:45 pm EDTLinear Convergence of EntropyRegularized Natural Policy Gradient with Linear Function ApproximationVirtual
 R. Srikant, University of Illinois at UrbanaChampaign
Abstract
We will present some recent results on the finitetime performance of the natural policy gradient method with linear function approximation and softmax parameterization. Our main result uses a Lyapunov drift analysis to show that entropy regularization leads to fast convergence..

4:45  5:30 pm EDTGathertown ReceptionReception  Virtual
Wednesday, August 4, 2021

10:30  11:00 am EDTGathertown Morning CoffeeCoffee Break  Virtual

11:00  11:30 am EDTTBAVirtual
 Mengdi Wang, Princeton

11:45 am  12:15 pm EDTA generalized policy gradient analysisVirtual
 Matus Telgarsky, University of Illinois UrbanaChampaign

12:30  1:30 pm EDTLunch/Free TimeVirtual

1:30  2:00 pm EDTIs Pessimism Provably Efficient for Offline RL?Virtual
 Zhaoran Wang, Northwestern University
Abstract
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. Due to the lack of further interactions with the environment, offline RL suffers from the insufficient coverage of the dataset, which eludes most existing theoretical analysis. In this paper, we propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function. Such a penalty function simply flips the sign of the bonus function for promoting exploration in online RL, which makes it easily implementable and compatible with general function approximations.
Without assuming the sufficient coverage of the dataset, we establish a datadependent upper bound on the suboptimality of PEVI for general Markov decision processes (MDPs). When specialized to linear MDPs, it matches the informationtheoretic lower bound up to multiplicative factors of the dimension and horizon. In other words, pessimism is not only provably efficient but also minimax optimal. In particular, given the dataset, the learned policy serves as the ""best effort"" among all policies, as no other policies can do better. Our theoretical analysis identifies the critical role of pessimism in eliminating a notion of spurious correlation, which emerges from the ""irrelevant"" trajectories that are less covered by the dataset and not informative for the optimal policy. 
2:15  2:45 pm EDTReinforcement Learning, Bit by BitVirtual
 Benjamin Van Roy, Stanford University
Abstract
Data efficiency poses an impediment to carrying the success of reinforcement learning agents over from simulated to real environments. The design of dataefficient agents calls for a deeper understanding of information acquisition and representation. I will discuss concepts and a regret bound that together offer principled guidance. The bound sheds light on questions of what information to seek, how to seek that information, and what information to retain. To illustrate concepts, I will also share results generated by simple agents that build on them.

3:00  3:30 pm EDTBreakCoffee Break  Virtual

3:30  4:00 pm EDTCareful PessimismVirtual
 Emma Brunskill, Stanford University
Abstract
Pessimism with respect to quantified uncertainty has received significant recent interest in offline batch RL methods. I’ll present some of our recent work in this direction such as model free and policy search offline RL.

4:15  4:45 pm EDTOptimization for overparameterized systems: from convexity to PL* conditionVirtual
 Mikhail Belkin, UCSD
Abstract
The success of deep learning is due, to a large extent, to the remarkable effectiveness of gradientbased optimization methods applied to large neural networks. In this talk I will discuss some general mathematical principles allowing for efficient optimization in overparameterized nonlinear systems, a setting that includes deep neural networks. I will discuss that optimization problems corresponding to these systems are not convex, even locally, but instead satisfy the PolyakLojasiewicz (PL) condition on most of the parameter space, allowing for efficient optimization by gradient descent or SGD. We connect the PL condition of these systems to the condition number associated to the tangent kernel and show how a nonlinear theory for those systems parallels classical analyses of overparameterized linear equations. In a related but conceptually separate development, I will discuss a new perspective on the remarkable recently discovered phenomenon of transition to linearity (constancy of NTK) in certain classes of large neural networks. I will show how this transition to linearity results from the scaling of the Hessian with the size of the network. Combining these ideas, I will show how the transition to linearity can be used to demonstrate the PL condition and convergence for a large class of wide neural networks. Joint work with Chaoyue Liu and Libin Zhu

4:45  5:00 pm EDTClosing RemarksVirtual
All event times are listed in ICERM local time in Providence, RI (Eastern Daylight Time / UTC4).
All event times are listed in .
ICERM local time in Providence, RI is Eastern Daylight Time (UTC4). Would you like to switch back to ICERM time or choose a different custom timezone?
Schedule Timezone Updated