📖 Experience Replay 经验回放

# Drawbacks of normal DQN/TD-learning 1. waste of experience - A transition: $\left(s_t, a_t, r_t, s_{t+1}\right)$. - Experience: all the transitions, for $t=1,2, \cdots$. - Previously, we discard $\left(s_t, a_t, r_t, s_{t+1}\right)$ after using it. - It is a waste... 2. correlated updates - Previously, we use $\left(s_t, a_t, r_t, s_{t+1}\right)$ sequentially, for $t=1,2, \cdots$, to update $\mathbf{w}$. - Consecutive states, $s_t$ and $s_{t+1}$, are strongly correlated (which is bad.) # Experience Replay - A transition: $\left(s_t, a_t, r_t, s_{t+1}\right)$. - Store recent $n$ transitions in a replay buffer. - Remove old transitions so that the buffer has at most $n$ transitions. - Buffer capacity $n$ is a tuning hyper-parameter [1, 2]. - $n$ is typically large, e.g., $10^5 \sim 10^6$. - The setting of $n$ is application-specific. ![image.png](https://cos.easydoc.net/46811466/files/l85ronje.png) # TD with Experience Replay - Find $\mathbf{w}$ by minimizing $L(\mathbf{w})=\frac{1}{T} \sum_{t=1}^T \frac{\delta_t^2}{2}$. - Stochastic gradient descent (SGD): - Randomly sample a transition, $\left(s_i, a_i, r_i, s_{i+1}\right)$, from the buffer. - Compute TD error, $\delta_i$. - Stochastic gradient: $\mathbf{g}_i=\frac{\partial \delta_i^2 / 2}{\partial \mathbf{w}}=\delta_i \cdot \frac{\partial Q\left(s_i a_i ; \mathbf{w}\right)}{\partial \mathbf{w}}$ - SGD: $\mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \mathbf{g}_i$. # Benefits of Experience Replay 1. Make the updates uncorrelated. 2. Reuse collected experience many times. # Prioritized Experience Replay ## Basic Idea - Not all transitions are equally important. - Which kind of transition is more important, left or right? - How do we know which transition is important? - If a transition has high TD error $\left|\delta_t\right|$, it will be given high priority. ![image.png](https://cos.easydoc.net/46811466/files/l85rye9q.png) ## Importance Sampling - Use importance sampling instead of uniform sampling. - Option 1: Sampling probability $p_t \propto\left|\delta_t\right|+\epsilon$. - Option 2: Sampling probability $p_t \propto \frac{1}{\text{rank}(t)}$. - The transitions are sorted so that $\left|\delta_t\right|$ is in the descending order. - $\text{rank}(t)$ is the rank of the $t$-th transition. - In sum, big $\left|\delta_t\right|$ shall be given high priority. ## Scaling Learning Rate - SGD: $\mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \mathbf{g}$, where $\alpha$ is the learning rate. - If uniform sampling is used, $\alpha$ is the same for all transitions. - If importance sampling is used, $\alpha$ shall be adjusted according to the importance. - Scale the learning rate by $\left(n p_t\right)^{-\beta}$, where $\beta \in(0,1)$. - If $p_1=\cdots=p_n=\frac{1}{n}$ (uniform sampling), the scaling factor is equal to $1 .$ - High-importance transitions (with high $p_t$ ) have low learning rates. - In the beginning, set $\beta$ small; increase $\beta$ to 1 over time. ## Update TD-Error - Associate each transition, $\left(s_t, a_t, r_t, s_{t+1}\right)$, with a TD error, $\delta_t$. - If a transition is newly collected, we do not know its $\delta_t$. - Simply set its $\delta_t$ to the maximum. - It has the highest priority. - Each time $\left(s_t, a_t, r_t, s_{t+1}\right)$ is selected from the buffer, we update its $\delta_t$. ![image.png](https://cos.easydoc.net/46811466/files/l85s6sht.png) # Reference 1. Zhang \& Sutton. A deeper look at experience replay. In NIPS workshop, $2017 .$ 2. Fedus et al. Revisiting fundamentals of experience replay. In ICML, $2019 .$