📖 Experience Replay 经验回放
# Drawbacks of normal DQN/TD-learning
1. waste of experience
- A transition: $\left(s_t, a_t, r_t, s_{t+1}\right)$.
- Experience: all the transitions, for $t=1,2, \cdots$.
- Previously, we discard $\left(s_t, a_t, r_t, s_{t+1}\right)$ after using it.
- It is a waste...
2. correlated updates
- Previously, we use $\left(s_t, a_t, r_t, s_{t+1}\right)$ sequentially, for $t=1,2, \cdots$, to update $\mathbf{w}$.
- Consecutive states, $s_t$ and $s_{t+1}$, are strongly correlated (which is bad.)
# Experience Replay
- A transition: $\left(s_t, a_t, r_t, s_{t+1}\right)$.
- Store recent $n$ transitions in a replay buffer.
- Remove old transitions so that the buffer has at most $n$ transitions.
- Buffer capacity $n$ is a tuning hyper-parameter [1, 2].
- $n$ is typically large, e.g., $10^5 \sim 10^6$.
- The setting of $n$ is application-specific.

# TD with Experience Replay
- Find $\mathbf{w}$ by minimizing $L(\mathbf{w})=\frac{1}{T} \sum_{t=1}^T \frac{\delta_t^2}{2}$.
- Stochastic gradient descent (SGD):
- Randomly sample a transition, $\left(s_i, a_i, r_i, s_{i+1}\right)$, from the buffer.
- Compute TD error, $\delta_i$.
- Stochastic gradient: $\mathbf{g}_i=\frac{\partial \delta_i^2 / 2}{\partial \mathbf{w}}=\delta_i \cdot \frac{\partial Q\left(s_i a_i ; \mathbf{w}\right)}{\partial \mathbf{w}}$
- SGD: $\mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \mathbf{g}_i$.
# Benefits of Experience Replay
1. Make the updates uncorrelated.
2. Reuse collected experience many times.
# Prioritized Experience Replay
## Basic Idea
- Not all transitions are equally important.
- Which kind of transition is more important, left or right?
- How do we know which transition is important?
- If a transition has high TD error $\left|\delta_t\right|$, it will be given high priority.

## Importance Sampling
- Use importance sampling instead of uniform sampling.
- Option 1: Sampling probability $p_t \propto\left|\delta_t\right|+\epsilon$.
- Option 2: Sampling probability $p_t \propto \frac{1}{\text{rank}(t)}$.
- The transitions are sorted so that $\left|\delta_t\right|$ is in the descending order.
- $\text{rank}(t)$ is the rank of the $t$-th transition.
- In sum, big $\left|\delta_t\right|$ shall be given high priority.
## Scaling Learning Rate
- SGD: $\mathbf{w} \leftarrow \mathbf{w}-\alpha \cdot \mathbf{g}$, where $\alpha$ is the learning rate.
- If uniform sampling is used, $\alpha$ is the same for all transitions.
- If importance sampling is used, $\alpha$ shall be adjusted according to the importance.
- Scale the learning rate by $\left(n p_t\right)^{-\beta}$, where $\beta \in(0,1)$.
- If $p_1=\cdots=p_n=\frac{1}{n}$ (uniform sampling), the scaling factor is equal to $1 .$
- High-importance transitions (with high $p_t$ ) have low learning rates.
- In the beginning, set $\beta$ small; increase $\beta$ to 1 over time.
## Update TD-Error
- Associate each transition, $\left(s_t, a_t, r_t, s_{t+1}\right)$, with a TD error, $\delta_t$.
- If a transition is newly collected, we do not know its $\delta_t$.
- Simply set its $\delta_t$ to the maximum.
- It has the highest priority.
- Each time $\left(s_t, a_t, r_t, s_{t+1}\right)$ is selected from the buffer, we update its $\delta_t$.

# Reference
1. Zhang \& Sutton. A deeper look at experience replay. In NIPS workshop, $2017 .$
2. Fedus et al. Revisiting fundamentals of experience replay. In ICML, $2019 .$