directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients