A MDP Take-away
Last updated on August 13, 2024 am
A MDP Take-away
A Markov decision process take-away.
Problem Formulation
states: a set of all possible statesstart_state: starting stateactions(state): with given state, return a list of all valid actionT(state, action, new_state): return probability ofs'if taking actionaat statesreward(state, action, new_state): return reward for the given transitionis_end(state): check whether at terminal stategamma/discount: discount factor for future (not bigger than 1 and not less than 0)
Value Iteration
Value iteration converges quite fast.
- initialize
vof all statessto 0 - for iteration
tin1..(until convergence or max iteration depth)- for each state
s- for
ainactions(s)- get
qvalueQ(s, a)
- get
- take maximum of all
qvaluesmax_q - update
vofswithmax_q
- for
- for each state
- return all
v
Q Value
1 | |
with given state s and action a:
- for all possible new state
new_state- take possibility
T(state, action, new_state)asp - take rewards as
r item=p * (r + discount * v), wherevis value ofnew_state- return the sum of all
item
- take possibility
Policy Evaluation
Policy is a function which maps from a state to an action, specifying with given state, what action should be applied.
Use policy evaluation to determine whether a policy pi is good enough or outperforming other policies.
- initialize
vof all statessto 0 - for iteration
tin1..(until convergence or max iteration depth)- for each state
s- action
acomes form evaluated policypi - update
vofswithQ(s, a)
- action
- for each state
This is done by cutting down all possible actions in value iteration to one action generated by evaluated policy pi.
Policy Iteration
By evaluating and updating policy to find the best policy, i.e., all of the actions given by the policy is optimal and have the highest v value. So does the policy.
- randomly initialize policy
pi - for iteration
tin1.., until convergence (pinot change) or max iteration depth- get
vof states by policy evaluation - improve policy
- for
sin all states- for
ain all valid actionsq=Q(s, a)
- get maximum
qasmax_q - update policy of
stoawithmax_q
- for
- for
- get
References
- Markov Decision Processes 1 - Value Iteration | Stanford CS221: AI (Autumn 2019) - YouTube
- Lecture 17 - MDPs & Value/Policy Iteration | Stanford CS229: Machine Learning Andrew Ng (Autumn2018) - YouTube
- [CS188 FA22] Lecture 7 - MDPs I - YouTube
- cs188-fa22-note07.pdf
- Value Iteration vs. Policy Iteration in Reinforcement Learning | Baeldung on Computer Science
- 4.3 Policy Iteration
- Week 4 Learning - CS50’s Introduction to Artificial Intelligence with Python
A MDP Take-away
https://blog.lingkang.dev/2023/05/07/mdp/