How the Deep Q-learning method works and discuss the new ideas that makes the algorithm work

Overview

Deep Q-Learning

Recommend papers

The first step is to read and understand the method that you will implement. It was first introduced in a 2013 paper and further improved and elaborated upon in the Nature DQN paper in 2015. We suggest reading both. In your final report, we want you to briefly describe how the Deep Q-learning method works and discuss the new ideas that makes the algorithm work.

Environment

We will use OpenAI gyms Atari-environments. To test that your installation include these you can use

import gym
env = gym.make('Pong-v0')

If this does not work, you can install it with

pip install gym[atari]

Implement and test DQN

DQN can be tricky to implement because it's difficult to debug and sensitive to the choice of hyperparameters. For this reason, it is advisable to start testing on a simple environment where it is clear if it works within minutes rather than hours.

You will be implementing DQN to solve CartPole.

For different reward functions, the convergence of models at different speeds varies greatly. We have customized a function, when the angle of the joystick is closer to 90 degrees and the position of the trolley is closer to the center of mass, the reward is higher, the covergece speed is higher than we simple define the reward as -1 when the situation done.

As you can see in experiment 1 and *1, the hyperparameters are the same but with different reward functions. In experiment 1, the reward function is simple, the agent gets reward 1 when the game was not done, otherwise, the reward is -1. But in experiment *1, we changed the reward function which is based on the state. When the car is closer to the midpoint, the reward is higher. When the angle between the flag and the horizontal line is closer to 90 degrees, the reward is higher, and vice versa. The results revealed that a good reward function can make a huge difference in performance when it comes to Reinforcement Learning, which can speed up the process of agent learning.

Learn to play Pong

Preprocessing frames

A convenient way to deal with preprocessing is to wrap the environment with AtariPreprocessing from gym.wrappers as follows:

env = AtariPreprocessing(env, screen_size=84, grayscale_obs=True, frame_skip=1, noop_max=30)

You should also rescale the observations from 0-255 to 0-1.

Stacking observations

The current frame doesn't provide any information about the velocity of the ball, so DQN takes multiple frames as input. At the start of each episode, you can initialize a frame stack tensor

obs_stack = torch.cat(obs_stack_size * [obs]).unsqueeze(0).to(device)

When you receive a new observation, you can update the frame stack with and store it in the replay buffer as usual.

next_obs_stack = torch.cat((obs_stack[:, 1:, ...], obs.unsqueeze(1)), dim=1).to(device)

Policy network architecture

We recommend using the convolutional neural network (CNN) architecture described in the Nature DQN paper (Links to an external site.). The layers can be initialized with

self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4, padding=0)
self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2, padding=0)
self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=0)
self.fc1 = nn.Linear(3136, 512)
self.fc2 = nn.Linear(512, self.n_actions)

and we use ReLU activation functions as previously. nn.Flatten() may be helpful to flatten the outputs before the fully-connected layers.

Hyperparameters

We suggest starting with the following hyperparameters:

Observation stack size: 4 Replay memory capacity: 10000 Batch size: 32 Target update frequency: 1000 Training frequency: 4 Discount factor: 0.99 Learning rate: 1e-4 Initial epsilon: 1.0 Final epsilon: 0.01 Anneal length: 10**6

While these should work, they are not optimal and you may play around with hyperparameters if you want.

Results of Pong

Note: The more detail analysis can be viewed in analysis folder.

All the experiments are implemented in Google Colab with 2.5 million frames. The parameters are explained as follows.

Discussion

The curve in the resulting figures may not be a good description of the performance of the current model, because we take the average of the most recent 10 episodes as the score of the current model. So when the experiment is over, we re-evaluated the average value ten times with the saved model. This result will be more representative.

We implement multiple experiments based on the environment Pong-v0. In general, the results are basically satisfactory. The configuration of the model and its performance(Column Average reward) are displayed as Table 2.

Replay Memory Size

Figure 3 visualizes the results of Experiment 1, 2 and 3. It can be observed from 3a that when the replay memory size is 10000, the performance of the model is unstable, comparing with the averaged reward trend in Experiment 3. The reason for the differences is that the larger the experience replay, the less likely you will sample correlated elements, hence the more stable the training of the NN will be. However, a large experience replay requires a lot of memory so the training process is slower. Therefore, there is a trade-off between training stability (of the NN) and memory requirements. In these three experiments, the gamma valued 1, so the model is unbiased but with high variance, and also we have done the Experiment 2 twice, second time is basically satisfactory (as you can see in the graph), but first Experiment 2 were really poor which is almost same with Experiment 3. The result varies a lot among these two experiment due to the gamma equals to 1.

Learning Rate

Now we discuss how learning rate affects the averaged reward. It is found from Figure 4 that a high learning rate has relatively large volatility on the overall curve, and the learning ability is not stable enough, but the learning ability will be stronger.

Win Replay Memory

Here we try a new way to train our model and create a win replay memory for those frames that our agent gets reward 1. After 0.4 million frames, we start to randomly pick 5 samples from this win memory and then train the model every 5 thousand frames. The idea is for this kind of memory, the loss may vary a lot, so the model will tune the parameters more. But the results show that the performance is basically the same or even worse than that of learning rate = 0.0002.

Summary

Each experiment takes 4h on Google Colab. We achieve 10-time average reward of 7.9 as the best result that is better than Experiment 1(suggested configuration on Studium), although the result is somewhat random and may be unreproducible. It seems that the models with higher learning rate(0.002) perform better, but its reward influtuates more sharply.

This repository contains the code for the paper ``Identifiable VAEs via Sparse Decoding''.

Sparse VAE This repository contains the code for the paper ``Identifiable VAEs via Sparse Decoding''. Data Sources The datasets used in this paper wer

Gemma Moran 17 Dec 12, 2022
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context Code in both PyTorch and TensorFlow

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context This repository contains the code in both PyTorch and TensorFlow for our paper

Zhilin Yang 3.3k Jan 06, 2023
An executor that performs image segmentation on fashion items

ClothingSegmenter U2NET fashion image/clothing segmenter based on https://github.com/levindabhi/cloth-segmentation Overview The ClothingSegmenter exec

Jina AI 5 Mar 30, 2022
CSD: Consistency-based Semi-supervised learning for object Detection

CSD: Consistency-based Semi-supervised learning for object Detection (NeurIPS 2019) By Jisoo Jeong, Seungeui Lee, Jee-soo Kim, Nojun Kwak Installation

80 Dec 15, 2022
Unofficial PyTorch Implementation of "Augmenting Convolutional networks with attention-based aggregation"

Pytorch Implementation of Augmenting Convolutional networks with attention-based aggregation This is the unofficial PyTorch Implementation of "Augment

DK 20 Sep 09, 2022
Implementation of Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis

acLSTM_motion This folder contains an implementation of acRNN for the CMU motion database written in Pytorch. See the following links for more backgro

Yi_Zhou 61 Sep 07, 2022
This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021.

inverse_attention This repository provides the official implementation of 'Learning to ignore: rethinking attention in CNNs' accepted in BMVC 2021. Le

Firas Laakom 5 Jul 08, 2022
Source code for Zalo AI 2021 submission

zalo_ltr_2021 Source code for Zalo AI 2021 submission Solution: Pipeline We use the pipepline in the picture below: Our pipeline is combination of BM2

128 Dec 27, 2022
Turi Create simplifies the development of custom machine learning models.

Quick Links: Installation | Documentation | WWDC 2019 | WWDC 2018 Turi Create Check out our talks at WWDC 2019 and at WWDC 2018! Turi Create simplifie

Apple 10.9k Jan 01, 2023
Tweesent-back - Tweesent backend uses fastAPI as the web framework

TweeSent Backend Tweesent backend. This repo uses fastAPI as the web framework.

0 Mar 26, 2022
Multi-Stage Spatial-Temporal Convolutional Neural Network (MS-GCN)

Multi-Stage Spatial-Temporal Convolutional Neural Network (MS-GCN) This code implements the skeleton-based action segmentation MS-GCN model from Autom

Benjamin Filtjens 8 Nov 29, 2022
The code for the NSDI'21 paper "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing".

BMC The code for the NSDI'21 paper "BMC: Accelerating Memcached using Safe In-kernel Caching and Pre-stack Processing". BibTex entry available here. B

Orange 383 Dec 16, 2022
Powerful and efficient Computer Vision Annotation Tool (CVAT)

Computer Vision Annotation Tool (CVAT) CVAT is free, online, interactive video and image annotation tool for computer vision. It is being used by our

OpenVINO Toolkit 8.6k Jan 01, 2023
An MQA (Studio, originalSampleRate) identifier for lossless flac files written in Python.

An MQA (Studio, originalSampleRate) identifier for "lossless" flac files written in Python.

Daniel 10 Oct 03, 2022
Build fully-functioning computer vision models with PyTorch

Detecto is a Python package that allows you to build fully-functioning computer vision and object detection models with just 5 lines of code. Inferenc

Alan Bi 576 Dec 29, 2022
Deep Dual Consecutive Network for Human Pose Estimation (CVPR2021)

Beanie - is an asynchronous ODM for MongoDB, based on Motor and Pydantic. It uses an abstraction over Pydantic models and Motor collections to work wi

295 Dec 29, 2022
CONditionals for Ordinal Regression and classification in tensorflow

Condor Ordinal regression in Tensorflow Keras Tensorflow Keras implementation of CONDOR Ordinal Regression (aka ordinal classification) by Garrett Jen

9 Jul 31, 2022
code associated with ACL 2021 DExperts paper

DExperts Hi! This repository contains code for the paper DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts to appear at

Alisa Liu 68 Dec 15, 2022
The code of paper "Block Modeling-Guided Graph Convolutional Neural Networks".

Block Modeling-Guided Graph Convolutional Neural Networks This repository contains the demo code of the paper: Block Modeling-Guided Graph Convolution

22 Dec 08, 2022
Contour-guided image completion with perceptual grouping (BMVC 2021 publication)

Contour-guided Image Completion with Perceptual Grouping Authors Morteza Rezanejad*, Sidharth Gupta*, Chandra Gummaluru, Ryan Marten, John Wilder, Mic

Sid Gupta 6 Dec 27, 2022