Accelerating Reinforcement Learning with Learned Skill Priors (SPiRL)
Authors: Karl Pertsch, Youngwoon Lee, Joseph J. Lim
Code: https://github.com/clvrai/spirl
Website: https://clvrai.github.io/spirl/
Motivation
Previous Work
- Extracting skills from unstructured data and leveraging them for downstream learning tasks
- Meta Learning Approaches - Online Learning, Require a defined set of learning tasks
- Learning skill embeddings
Contribution
Leveraging skills embeddings from large offline datasets and learning a prior over them for efficient downstream learning tasks
data:image/s3,"s3://crabby-images/0351a/0351ae36c393def3ba107a22cf88e535f35ff2c2" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%201.png"
Methodology
Learning the Embedding Space and the Prior over the embeddings
data:image/s3,"s3://crabby-images/12373/123733aa1d8ff27dc24d87f5c10ed6ebda51d943" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%202.png"
- $p(z) \sim \mathcal{N}(0, 1)$
data:image/s3,"s3://crabby-images/2f0c0/2f0c08449f8c77b8ca2b7dde49ebaa742fe63eee" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%203.png"
- Using the KL Divergence Loss to learn the prior over the
data:image/s3,"s3://crabby-images/c871e/c871e5720c924a602ab4f68e12943b9318420efd" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%204.png"
Using the prior over the skills for downstream reinforcement learning tasks
- Replacing one-step reward with H-step reward and single step transitions with H-step transitions
-
Learning the policy over the embeddings instead of the actions for model-free reinforcement learning $\pi_\theta(z |
s_t)$ |
data:image/s3,"s3://crabby-images/6aad2/6aad218cc609423baf9998a114f68565bf5aec01" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%205.png"
- Integration with maximum entropy RL framework
data:image/s3,"s3://crabby-images/03bd1/03bd1805475455349bac8081943aaf4c3195697e" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%206.png"
data:image/s3,"s3://crabby-images/b578d/b578d8c66d29a236f964a676ac7f3dcec5d5c154" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%207.png"
Experiments
- Results have been reported for Soft-Actor Critic
- Three environments
- Maze navigation - D4RL
- Block Stacking
- Kitchen Environment - D4RL Benchmark
data:image/s3,"s3://crabby-images/06778/0677844ac34b53dc75106215bd6edcc8a52fa1a2" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%208.png"
- BLOCK STACKING ENVIRONMENT
data:image/s3,"s3://crabby-images/47944/47944b7f242098c9639c944070480d60c6395695" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%209.png"
- PERFORMANCE
- SPiRL - Skill Prior Reinforcement Learning
- Flat Prior - Single Action Prior without temporal abstraction
- SSP w/o Prior - Learned skills without any prior
- BC + SAC - Learning a Behavior Cloning Policy from the offline data and finetuning it in the downstream RL task using SAC
- SAC - Soft Actor Critic
data:image/s3,"s3://crabby-images/d6397/d639778dd146eca40eea97bf2cdd7c7f886e1868" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%2010.png"
- EXPERIMENTS OVER DIFFERENT HORIZONS (ABLATION STUDIES)
data:image/s3,"s3://crabby-images/2e432/2e432019fde2bbcee042599279255a9c7a07d987" alt="Accelerating%20Reinforcement%20Learning%20with%20Learned%20S%2055f74821b841411e9b7695dd6cab9440/Untitled%2011.png"
Further work
- Experimenting with learning skill priors conditioned on different things; the paper reports results with the priors conditioned on the current state
- How does this approach scale with multi-task learning (continual learning agents)?