On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning

Download the paper!Read on ArXiv!Run the code!Video available!
ABSTRACT:
hide & show
Model-based Reinforcement Learning (MBRL) is a promising framework for learning control in a data-efficient manner. MBRL algorithms can be fairly complex due to the separate dynamics modeling and the subsequent planning algorithm, and as a result, they often possess tens of hyperparameters and architectural choices. For this reason, MBRL typically requires significant human expertise before it can be applied to new problems and domains. To alleviate this problem, we propose to use automatic hyperparameter optimization (HPO). We demonstrate that this problem can be tackled effectively with automated HPO, which we demonstrate to yield significantly improved performance compared to human experts. In addition, we show that tuning of several MBRL hyperparameters dynamically, i.e. during the training itself, further improves the performance compared to using static hyperparameters which are kept fixed for the whole training. Finally, our experiments provide valuable insights into the effects of several hyperparameters, such as plan horizon or learning rate and their influence on the stability of training and resulting rewards.

What you need to know:

This is one of those papers that changes your views on things. I knew hyper-parameters were important to deep RL and I knew that model-based RL is a more complicated system to build and deploy, but magnitude of the effects of tuning parameters in MBRL is boggling. In breaking the simulator with a really simple MBRL algorithm, you can learn a lot more about the state of the field.

Full disclaimer: I was not the lead researcher on this project. I helped run experiments and understand the meaning of the experiments in the broader field.

AutoML

Automatic Machine Learning (AutoML) is a field dedicated to the study of using machine learning algorithms to tune our machine learning tools. Humans are really bad at internalizing higher-dimensional relationships, so let’s let a computer do it for us. A harder problem is dynamic parameter tuning (where parameters can change within a run), but more on that later.

MBRL

From another post of mine.

Model-based reinforcement learning (MBRL) follows the framework of an agent interacting in an environment, learning a model of said environment, and then leveraging the model for control. Specifically, the agent acts in a Markov Decision Process (MDP) governed by a transition function xt+1 = f (xt , ut) and returns a reward at each step r(xt, ut). With a collected dataset D := xi, ui, xi+1, ri, the agent learns a model, xt+1 = fθ (xt , ut) to minimize the negative log-likelihood of the transitions. We employ sample-based model-predictive control (MPC) using the learned dynamics model, which optimizes the expected reward over a finite, recursively predicted horizon, τ, from a set of actions sampled from a uniform distribution U(a), (see paper or paper or paper).

MBRL as a candidate for AutoML

Why may we see an even outsized impact of AutoML in MBRL? It has a ton of moving parts. First off, more machine learning parts equals a harder problem, yet there is a bigger reason that compounds to make parameter tuning way more impactful (in intuition) to MBRL. Normally, a parameter tuning graduate student will fine tune one problem at once, but in MBRL there are two weirdly connected systems (the objectives are mismatched), so no human will be able to find the perfect parameters unless yielding to luck.

Deep RL and Mujoco

Mujoco showed up as a favorite of Deep RL because it was available when a massive growth phase came through. Mujoco is expensive, restricted, and not an accurate portrayal of the real world. It is a decent simulator, relatively lightweight, and easy-enough to use. That makes it relatively good for individual researchers trying to prove themselves but not necessarily great for the long-term health of the field. Now state-of-the-art (SOTA) racing has reached a new level, so real intellectual breakthroughs in ML are stagnating. I suspect over the next 5 years the simulators used by deep RL researchers will change substantially, but the jury is still out whether that change in sim will also translate to an improvement in research practices. (Note, back when I was on Medium, I wrote a post about baselines in RL research, and it is even more true now).

Importance of Hyperparameter Optimization for MBRL

In short, the results of this paper are astounding. With sufficient hyper parameter tuning, the MBRL algorithm (PETS) literally breaks mujoco. The famous “halfcheetah” task degenerates into a glorious spiral of data-driven method heaven.

Normally, it is supposed to run. The paper has a much, much wider range of results on multiple environments, but I leave that to the reader. The paper has interesting tradeoffs between optimizing the model (learning the dynamics) and the controller (solving the reward-maximization problem). Additionally, it shows how dynamically changing the parameters throughout a trial can be useful, such as increasing your model horizon as the algorithm collects data and the model becomes more accurate.

Next steps for MBRL

I am really confident about the future of MBRL these days. This paper shows how much more capable the algorithms we have are in terms of optimality. Second, I feel like all the pieces of this “MBRL System” the field uses regularly now has a lot of future directions to exploit.

  1. Something other than vanilla one-step predictions for the dynamics model,
  2. Moving away from sample-based model predictive control,
  3. Connecting model to optimizer with gradients,
  4. Meta-MBRL and other variants towards generalization & interpretability.

Thanks for reading, and thanks again to the authors who did the majority of the work on this paper. I will be adding a video attachment of the poster presentation when we get it done.

Citation

@article{zhang2021importance,
 title={On the Importance of Hyperparameter Optimization for Model-based Reinforcement Learning},
 author={Zhang, Baohe and Rajan, Raghu and Pineda, Luis and Lambert, Nathan and Biedenkapp, Andr{\'e} and Chua, Kurtland and Hutter, Frank and Calandra, Roberto},
 journal={AISTATS},
 year={2021}
}