The Difference between RL and Supervised Learning - Part 2

In series one, I briefly highlighted why Reinforcement Learning is so exciting - because its actually a fundamental concept that spans across many fields. In this post, we are going to highlight the difference between RL and Supervised Learning, and find edge cases.

In David Silver's lecture series, he mentioned that the main difference between Supervised Learning and Reinforcement Learning is this:

  • Feedback in Reinforcement Learning could be delayed.
    • I believe that his reasoning is that in episodic settings, you won't know the result of your actions until the end of the episode. (When I say episodic, think Atari games, each game is an episode, and that you won't know if your actions are good actions until the end of the game when you know if you have won or lost.) Whereas in supervised learning, you know the result every batch.
    • My critique: This is kind of confusing to frame it that way. Because actually delayed feedback is common in supervised Learning. (e.g., Semi-supervised approach such as “survival analysis”: subfield of ML that mainly deals with modeling the time until the event of interest.)

Some people would say that:

  • Reinforcement Learning is a sequential decision making process, where current prediction determines future sampling space, which in turn determines future predictions.
  • My Critique: RNN is a sequence model that takes the previous action's hidden state into consideration when generating the next prediction. You could argue is some ways, sequence to sequence (RNN) [1] also fit in this bucket. So this doesn't uniquely differentiate RL from Supervised Learning. However, if you zoom out one layer - each (sample, label) pair is independent of each other in RNN settings, therefore you could view RNN non sequential between tasks.

I summarize my findings below: the main difference between Supervised Learning and Reinforcement Learning is two fold:

  • First, Feedback type. Reinforcement Learning's feedback is weaker than feedback from a supervised learninge context. i.e., valuative feedback vs. instructive feedback.. Evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken. Think of the instructive feedback as a teacher grading your math homework telling you if they are right or wrong. Think of the evaluative feedback as a cook tasting the sauce it made and it has to decide whether or not to add more salt. The second is more subjective than the first.
    • Feedback could also be sampled, non exhaustive in RL. Think of your performance on stage, you could only look at the front couple rows of people to evaluate how well you did rather than the entire audience one by one.
    • Feedback could be sequential instead of one-shot. (Note in this case, RNN's feedback is one-shot if we look at it from a sample level rather than word prediction level.)
    • Some critique might say: What about off policy? You don't necessarily decide your next action based on your current action taken. An example of which is Q-Learning: “Q-Learning estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it could be not following a greedy policy”. You don't necessarily always take the greedy action due to the exploration steps. So the direction change could be independent of the action in some cases. But this is more of an nuanced argument because on average the off policy learning (such as QLearning) in RL is still going to be evaluating the action space from the greedy choices most of the time.
  • Secondly, Data Availability. In supervised learning, sample data is gathered independent of the model and the learning algorithm. The goal of prediction is well defined. The labels (ground truth) is known before the model is developed. However, no sample data is given in reinforcement learning prior to training. The RL algorithm gathers sample data during its training phase. Design choices such as definition of rewards (think if intaking an apple is +5 in your happiness level), optimization choices such as policy vs. Q value has direct impact to predictions of future actions. In other words, sampling space is dependent on design choices of the model and algorithm. If the sampling space is for whatever reason biased, for example, if an autonomous driving car algorithm prefers sampling highway routes, it won't be able to sample sufficient data in local roads in order to learn how to handle local traffic.

In our next series, I want to talk about this terminology graph, and go through them one by one.

RL landscape