📖 Multi-Step TD Target

# Multi-Step Return since $U_t=R_t+\gamma U_{t+1}$ and $U_{t+1}=R_{t+1}+\gamma \cdot U_{t+2}$, we can derive following identities: ![image.png](https://cos.easydoc.net/46811466/files/l84gcphx.png) **Identity**: $U_t=\sum_{i=0}^{m-1} \gamma^i \cdot R_{t+i}+\gamma^m \cdot U_{t+m}$. - $m$-step TD target for Sarsa: $$ y_t=\sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i}+\gamma^m \cdot Q_\pi\left(s_{t+m}, a_{t+m}\right) . $$ - One-step TD target for Sarsa: $$ y_t=r_t+\gamma \cdot Q_\pi\left(s_{t+1}, a_{t+1}\right) . $$ - $m$-step TD target for Q-learning: $$ y_t=\sum_{i=0}^{m-1} \gamma^i \cdot r_{t+i}+\gamma^m \cdot \max _a Q^{\star}\left(s_{t+m}, a\right) . $$ - One-step TD target for Q-learning: $$ y_t=r_t+\gamma \cdot \max _a Q^{\star}\left(s_{t+1}, a\right) . $$ # One-Step v.s. Multi-Step - One-step TD target uses only one reward: $r_t$. - $m$-step TD target uses $m$ rewards: $r_t, r_{t+1}, r_{t+2}, \cdots, r_{t+m-1}$. - If $m$ is suitably tuned, $m$-step target works better than onestep target [1]. # Reference 1. Hossel et al. Rainbow: combining improvements in deep reinforcement learning. In AAAI, $2018 .$