Adaptive Policy Regularization for Offline-to-Online Reinforcement Learning in HVAC Control

Published on Nov 01, 2024

Scene 1 (0s)

[Audio] Dear attendees of BuildSys 2024, I am [Place holder]. Today I am going to present the paper: "Adaptive Policy Regularization for Offline-to-Online Reinforcement Learning in H-V-A-C Control" on behalf of the authors: Hsin-Yu Liu of U-C-S-D-, Bharathan Balaji of Amazon, Rajesh Gupta of U-C-S-D-, and Dezhi Hong of Amazon..

Scene 2 (29s)

[Audio] First, I will start with Introduction: Why use Reinforcement Learning (R-L---) for building control. Recap of RL Then, the Background of this study, we are going to discuss: What the problem statement is and what is the RL settings of our experiments Next, Methodology: Our method: wismaq (wis-mac), Weighted Increased Simple Moving Average of Q-value How does it adapt the distribution drift in offline to online Reinforcement Learning settings The Experimental Result: We have conducted a series of experiments, including the benchmark, data efficiency, sensitivity, and scalability experiments to ensure the robustness and stability of our methodology Finally, the future work and discussion.

Scene 3 (1m 19s)

[Audio] First, the introduction of this paper. Introduction Why RL? & RL Recap Background Problem Statement & RL Settings Methodology WISMAQ (Weighted Increased Simple Moving Average of Q-value) Adapting Distribution Drift Experimental Result Benchmark/ Data Efficiency/ Sensitivity/ Scalability Future work and Discussion.

Scene 4 (1m 33s)

[Audio] Reinforcement Learning (R-L---) provides a dynamic and adaptive solution for optimizing building energy efficiency, learning directly from interactions with real-time environmental data to balance energy usage and occupant comfort more effectively than traditional rule-based systems. The Recap of Reinforcement Learning: A sequential decision-making process, where this process satisfies the Markov Decision Process property The goal is to learn a policy that could select actions that maximize the long-term rewards (in terms of accumulative discounted rewards in Q-learning).

Scene 5 (2m 12s)

[Audio] Then we will discuss the background of this work.

Scene 6 (2m 26s)

[Audio] There are several approaches of RL methods. If we categorize them in terms of offline and online learning: it includes the traditional online RL, offline RL which learns from the data purely, and offline-to-online RL where we first learn a pre-trained offline model, then improve it with online interaction This might be the most practical approaches for real-world applications, since: For online learning, we need to have an accurate simulator and/or model of the environments For offline learning, the model learning is limited due to the diversity of the state-action space visitation and its quality There are some requirements that limit the power of Offline-to-Online RL: R.1 Require information on absolute scores of expert and random agents, typically not be available for buildings. R.2 Many suffer policy collapse at the very beginning of the transition from offline mode to online mode. R.3 Introduce compute overhead with additional models and/or replay buffers. Our method is trying to avoid these requirements and to be more generalized.

Scene 7 (3m 40s)

[Audio] Here we talk about the RL settings in our study: Environment: Its surface area is 463.6𝑚2, and it is equipped with a V-A-V package (DX cooling coil and gas heating coils) with fully auto-sized input as the H-V-A-C system to be controlled. The details regarding the building models can be found on Sinergym's GitHub repository State: Site outdoor air dry bulb temperature, site outdoor air, relative humidity, site wind speed, site wind direction, site diffuse solar radiation rate per area, site direct solar radiation rate per area, zone thermostat heating setpoint temperature, zone thermostat cooling setpoint temperature, zone air temperature, zone thermal comfort mean radiant temperature, zone air relative humidity, zone thermal comfort clothing value, zone thermal comfort Fanger model P-P-D (predicted percentage of dissatisfied), zone people occupant count, people air temperature, facility total H-V-A-C electricity demand rate, current day, current month, and current hour. Action: Heating setpoint and cooling setpoint in continuous settings for the interior zones. Reward: The reward function is a linear combination of the power consumption and the thermal comfort penalty The scaling constants for energy consumption is 10^−4 and comfort is 1.0, respectively..

Scene 8 (5m 28s)

[Audio] Flow chart of offline-to-online RL: (1.) The offline model learns from the existing dataset. (2.) After pre-training, the agent interacts with the environment online. (3.) The generated transitions are saved in the replay buffer(s) for further learning. (4.) The offline-to-online fine-tuning improves the agent's performance continuously..

Scene 9 (5m 55s)

[Audio] In this section we will discuss the motivation to develop our method and how does it work in detail..

Scene 10 (6m 8s)

[Audio] Here is the preliminary experiment we conducted, we start from the pre-trained offline models of T-D-3 plus BC, and then convert it as the online model TD3. On the left if we pre-train the offline model with R-B-C buffer, after further training online, the agent cannot learn a better policy, thus the mean Q-value is converging. On the contrast, on the right if we train the model with random buffer, the agent could easily learn a better policy. Thus, the mean Q-value is increasing. And the main reason why we use Simple Moving Average of the mean Q-value sampled from the batches, is that as we could observe in these two figures, the solid blue lines are fluctuating and noisy along the training process. We could not measure if the mean Q-value is increasing or decreasing. However, if we use S-M-A--, which is the red curves in these figures, we could see that there are obvious trend as the training goes on..

Scene 11 (7m 10s)

[Audio] Our method: wismaq (Weighted Increased Simple Moving Average of Q-value) uses the de-noised trend of the mean Q-value sampled in the mini-batch sampled as the metric to decide if the policy is going to learn with more exploration of more exploitation We add an auto-tuned regularization term in actor loss, indicated as the L of wismaq The Rectified Linear Unit activation function is used to auto tune the loss term, the Qt of S-M-A is the current S-M-A mean Q-value, where the Q(t-d) of S-M-A is the reference point of a previous Q of SMA. That means, if the the current one is higher than the previous one, we encourage the agent to explore further. In the opposite, if the previous one is higher, then the loss term becomes zero to learn as an offline agent with almost no exploration. In the denominator, we use the sum of these two values to bound the loss term within zero and one. The most important hyperparameter in our method is the psi, which controls the weight of the Q of S-M-A of the reference point. We have also added two add-on methods: For more accurate value estimation – "Bootstrapped Ensemble" to For adapting distribution drift "Combined Experience Replay" to force the latest transition is added into the sampled minibatch..

Scene 12 (8m 37s)

[Audio] In this section we will discuss the motivation to develop our method and how does it work in detail..

Scene 13 (8m 51s)

[Audio] Here is the experimental results of our main experiments. We could see our method wismaq, with the red color, is able to improve the existing offline models, and stabilize the models at the end of the training. With a substantial amount of reduction in variations across random initializations compared with other state-of-the-art methods. Although our method is greedy, it does not deteriorate the models performance due to overfitting. This is because we use a bootstrapped ensemble learning. Where we randomly sample a Q-network to estimate the value during the training..

Scene 14 (9m 26s)

[Audio] Data Efficiency Exp. (left): It demonstrates that with pure offline wismaq training, we simulate the scenarios of different amounts of accessible data:. It is intuitive that with the smaller size of buffers, the agent would learn faster since as training continues, the better policies generate a higher quality of experience replay. However, it comes with a less stable policy and could lead to catastrophic forgetting (1 week). And with too much data the model would learn from the old distribution which might damage the performance (1 year). Finding the optimized size of the buffer is also a crucial factor in off-policy learning. Sensitivity Exp. (right): It shows with smaller weight 𝜉 = 1, the model is unable to learn a better policy because the behavioral cloning term dominates. While higher 𝜉 value leads to a more greedy fashion during policy learning, it might suffer from the inaccuracy of the value estimation. Thus, it is recommended to optimize this hyperparameter for each environment..

Scene 15 (10m 33s)

[Audio] Scalability Experiment (left): Furthermore, we want to examine the scalability of our model when deployed to different environment settings. To ensure the ability of generalization of our wismaq , we conducted experiments with another environment A Datacenter. As the experimental result demonstrates, wismaq can learn a policy that is similar to expert policy while other methods could not adapt to the distribution drift as training goes on. Ablation Exp. (right): To demonstrate the ability of our add-on methods, we conducted a series of ablation experiments to validate their necessity. We run wismaq "no_cer", and "no_WISMAQ" on the same task cool weather with R-B-C buffers, the result indicates that without wismaq the model learns similarly to an offline model. Without C-E-R--, the model could adapt to the latest distribution drift and finally fail to improve itself in the long run. Combining these methods altogether will boost the entire learning ability than separately..

Scene 16 (11m 39s)

[Audio] Finally, our Future Work & Discussion. Introduction Why RL? & RL Recap Background Problem Statement & RL Settings Methodology WISMAQ (Weighted Increased Simple Moving Average of Q-value) Adapting Distribution Drift Accurate Value Estimation Experimental Result Benchmark/ Data Efficiency/ Sensitivity/ Scalability Future work and Discussion.

Scene 17 (11m 53s)

[Audio] We have developed a novel approach wismaq to regularize the agent's policy in offline-to-online RL for use in H-V-A-C control. The use of a Simple Moving Average of the mean Q-value is key to our algorithm and could reflect the actual trend of the policy learning and its corresponding value estimation. The limitation of our method is that with higher dimensions of the state-action spaces. The effect of the curse of dimensionality increases which might lead to less accurate value estimation. We hope our study will encourage domain experts to explore the possibility of offline-to-online reinforcement learning applied in energy systems. We open-source our code for research purpose, aiming to accelerate the advance of deploying offline-to-online RL in Building Control..

Scene 18 (12m 41s)

[Audio] Thank you for your attention and any input and question is welcome..