Reinforcement Learning Isomorphisms - Part 1

This series is inspired by the Feynman learning method and is also motivated by my own personal struggle navigating the sea of information online about reinforcement learning. This will be my best effort at summarizing the field of RL. This includes a list of RL terminologies and concepts, and how they relate to each other.

Part 1 is about isomorphisms of RL across different fields. Part 2 will explain the intricacies of RL terminologies.

The brain is an agent. The globe is the world.

The brain is an agent. The globe is the world.

  • Agent: Gets signals from the world, chooses actions, performs actions, gets a reward (or estimates a reward)
  • Environment: Takes the action from the agent, shows signals the agent can observe.
  • Internal State of an agent in an MDP: a summary of things that you need to know to decide the next action
    • Ht = A1, O1, R1, A2, O2, …. At, Ot, Rt (could be all past history)
    • Could be a function of Ht (due to the agent not having a complete memory of all the past history)
  • The difference between reward and value
    • Reward is short term intrinsic desirability from an action
    • Value is long term desirability, total amount of reward that can accumulate from an action

I will elaborate more the intricacies of the terminologies with Part 2 of this series.

Parallel concepts!

Parallel concepts!


While this series is primarily focusing on the Machine Learning aspect of RL, it's important to highlight RL techniques' isomorphisms across other fields.

  • Neuroscience:

Neuroscientists have been studying how the brain generates behaviors for decades. At the neural level, reinforcement allows for the strengthening of synaptic associations between pathways carrying conditioned and unconditioned stimulus information. [1]  Or if you'd like to think about the reward function, loosely speaking the human system uses dopamine as our decision reward. 

  • Psychology:

Classical conditioning is learning new behaviors through a series of association (think covariant matrix and bayesian inference). Operant conditioning a learning process through which strength of behavior is modified by reward and punishment. RL is more closely related to operant conditioning. Because that's literally how you train your little agent (human or mechanical). You reward them for doing things you like so as to encourage them to repeat similar actions.

  • Economics: 

Economic agents were portrayed as fully rational Bayesian maximizers of subjective utility. However, studies have shown that the agents (us humans) aren't fully rational agents. We frequently optimize for satisfaction rather than optimality. In other words, due to our limited resources, our rationality is bounded. [2] This is a problem that is well studied in Reinforcement Learning - an agent navigating to through a maze with only limited information about the world. Also, behavior economics is an entire field that is focused on the study of how agents (either individual or organizations) make decisions, which is somewhat tangential to instructing an RL agent to make rational decisions. I'd imagine you could borrow a lot of concepts across the two fields.

  • Mathematics:

Operations Research is a field that focuses on using analytical methods to learn how to make better business decisions. How do you efficiently and accurately simulate the system so that you could perform optimizations on top of it to minimize cost, maximize reward etc.

Which business decision should you make given the business situation? (Similar to which action should you take given the signals from the environment in RL). This is a question with lots of $$ involved!

  • Engineering:

From Wikipedia: "Optimal Control is a research area where it is focused on finding a control law for a given system such that a certain optimality criterion is achieved. A control problem includes a cost function [3] that is a function of state and control variables. An optimal control is a set of differential equations describing the paths of the control variables that minimize the cost function."

In easy to understand terms, we have a set of optimizing objectives, and a set of constraints. We are trying to find the best value to assign your variables so that you maximize reward/minimize cost while at the same time satisfying your constraint.


RL is actually a very fundamental concept. Maybe it can lend hands in helping humans learn a more efficient way of living our lives. I love this weird and elegant parallel concept - because we are actually learning from machines that are learning from humans that are learning from machines that are learning from humans to be better humans and machines.

You get the idea. :P



[1]: Computational models of reinforcement learning: the role of dopamine as a reward signal

[2] What is bounded rationality?

[3] Cost Function

Stay tuned for Part 2!