The Markov property = independence of path property
A decision making process is said to have the Markov property when the decision is made independent of the previous states with the exception of the current state. Such a process is called the Markov Process.
For example, playing chess is a Markov Process. Because it does not matter how you got to the current board position. In order to make the next move, all you need, is not - if you have taken their pawns first, or how you got the opponent's queen - but the current board's layout. So, if the state is the board's layout, then chess is a Markov environment.
However, this is a somewhat deceiving example. Because
When you setup the same game differently - it can be modeled Markov or otherwise.
Most of the real life examples aren't 100% Markov.
Whether your process is Markov or not is a spectrum, not binary.
Consider the case of teaching a model airplane to fly from destination A to B. (This has been done many years ago: https://www.youtube.com/watch?v=M-QUkgk3HyE&t=93s). In this case, the controller action space would be:
how fast the motor should be spinning at?
what's the aileron angle that's ideal (controls rolls)?
what's the rudder angle (controls yaw)?
what's the horizontal elevator (controls pitch) angle?
You can model the decision making process as Markov if you think of the state as everything that's needed so that you can make the next control decision necessary to be closer to destination B. Ok, that's such a tautology. :P
If you think of the state as just your current position, then it can't be Markov. Because the current position of the airplane alone is not enough to tell you how you should adjust the controls - you also need the velocity, acceleration, which wing has more lift etc.
But wait, it's almost impossible! Because if you expand this list, you can get to things like “the current wind velocity and direction, the current temperature/dew point, the current pressure of the environment, nearby animals etc". Because, if you were to be pedantic, all of these have some effect on its control's effectiveness, but they are not likely to significantly affect its trajectory to destination B.
It doesn't seem 100% correct to model this problem as a Markov in this process but we do in Reinforcement Learning.
Reasons we prefer to assume Markov when in reality tbhe might not satisfy 100% the Markov-ness.
We can see that:
If the state is Markov. Equation 1 can be simplified to equation 2. From that, we can deduce that all future states and expected next reward from the current state and action - meaning we not only can deduce the next one, but all future state and reward pair.
Pr here stands for the transition probability.
Following from the above bullet point, by having this Markov Property we can significantly simplify the math, e.g., To solve the the Bellman equation: https://twitter.com/ytsheng/status/963981598755504128, I was able to cross off some terms in the conditionals.
Markov property guarantees that the best policy for choosing actions as a function of Markov State is just as good as the best policy for choosing actions as a function of complete histories.
If the Markov property does hold, then the environment defines a Markov Decision Process.