Online Markov Decision Processes under Bandit Feedback
published: March 25, 2011, recorded: December 2010, views: 3094
Slides
Related content
Report a problem or upload files
If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Description
We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in terms of the total reward received. In each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is a no-regret algorithm. In this paper we propose a new learning algorithm and assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of the new algorithm is O(T^{2/3} (ln T)^{1/3}), giving the first rigorously proved convergence rate result for the problem.
See Also:
Download slides: nips2010_neu_omd_01.pdf (172.5 KB)
Download article: nips2010_1311.pdf (232.5 KB)
Link this page
Would you like to put a link to this lecture on your homepage?Go ahead! Copy the HTML snippet !
Write your own review or comment: