📖 Multi-Step TD Target
# Multi-Step Return
since $U_t=R_t+\gamma U_{t+1}$ and $U_{t+1}=R_{t+1}+\gamma \cdot U_{t+2}$, we can derive following identities:

**Identity**: $U_t=\sum_{i=0}^{m-1} \gamma^i \cdot R_{t+i}+\gamma^m \cdot U_{t+m}$.
- $m$-step TD target for Sarsa:
$$
y_t=\sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i}+\gamma^m \cdot Q_\pi\left(s_{t+m}, a_{t+m}\right) .
$$
- One-step TD target for Sarsa:
$$
y_t=r_t+\gamma \cdot Q_\pi\left(s_{t+1}, a_{t+1}\right) .
$$
- $m$-step TD target for Q-learning:
$$
y_t=\sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i}+\gamma^m \cdot \max _a Q^{\star}\left(s_{t+m}, a\right) .
$$
- One-step TD target for Q-learning:
$$
y_t=r_t+\gamma \cdot \max _a Q^{\star}\left(s_{t+1}, a\right) .
$$
# One-Step v.s. Multi-Step
- One-step TD target uses only one reward: $r_t$.
- $m$-step TD target uses $m$ rewards: $r_t, r_{t+1}, r_{t+2}, \cdots, r_{t+m-1}$.
- If $m$ is suitably tuned, $m$-step target works better than onestep target [1].
# Reference
1. Hossel et al. Rainbow: combining improvements in deep reinforcement learning. In AAAI, $2018 .$