Entropy-Regularized Deep Reinforcement Learning with Linear Programming

Study of Entropy Regularization through Linear Programming formulations and its feasibility in large scale settings

As part of my undergraduate thesis, I am conducting research on several state of the art regularized reinforcement lerning algorithms, as well as tangent variations of such algorithms. This thesis is mainly about an alternative approach to traditional Deep RL, rooted on the Linear Program reformulation for Markov Decision Processes (MDPs). I implemented Q-REPS from the “Logistic q-learning” paper using neural networks and demonstrated that one can also use this framework to create new Deep RL algorithms that compete with well-known Deep RL algorithms like DQN, SAC or PPO, without needing tricks like gradient clipping, target networks or double q networks. I also propose a novel algorithm “Primal-dual approximate policy iteration” rooted in this linear program reformulation and prove that it can be a strong alternative in large scale settings too. For a detailed explanation of everything, I suggest reading the pdf attached below.

Q-REPS

Algorithms

MinMax-QREPS practical implementation
ELBE QREPS practical implementation

Reward Curves

QREPS performance on benchmark environments

Q-REPS Gameplay videos

Learning CartPole with QREPS
Learning LunarLander-v2 with QREPS
Learning Acrobot-v1 with QREPS

Q-REPS Extensions to continuous environments

With little modification on the original algorithm one could make this algorithm work for continuous actions as well. This is an example on the Halfcheetah environent.

QREPS practical implementation on HalfCheetah

Primal-dual Approximate Policy Iteration

Algorithms

Primal-Dual API practical implementation

Tabular Results

Primal-Dual API performance on 5x5 grid
Primal-Dual API performance on 8x8 grid

Thesis pdf

Thesis presentation

Other work outside of the main thesis

X-QL modifications

Modifications I made on both SAC and TD3 versions of X-QL improve the final reward obtained.

PPO modifications using Gumbel loss to learn the value function

Used X-QL loss function to optimize the value function in PPO algorithm. It needs to be further studied.