Temporal difference (TD) learning is a large part of what really goes on under the hood even as we think we are making ‘free choices’. Technically speaking, this is a way of learning by bootstrapping from the current estimate of the benefit-cost value of circumstances in progress. This process samples the environment and carries out updates based on current estimates. TD learning fine-tunes its predictions to match later and more correct predictions about the future before the ultimate outcome becomes known. Now, let me break this down a bit…
We instinctively feel that we deliberate issues and then make suitable decisions… free choice, so we believe. However, the biological evidence conclusively shows that these decisions actually originate in an ancient brain system called the basal ganglia. Notably, this area is inaccessible to conscious thought. As a result, we quite naturally conjure up reasonable explanations for the decisions we think we make.
More specifically, a small group of neurons deep in the brains of all vertebrates is responsible for making decisions by way of their release of the neurotransmitter dopamine. Dopamine strongly influences our behavior—both current and future actions. Indeed, this deep brain process has an ability to predict reward for future actions, which makes it a central factor in TD (temporal difference) learning.
TD learning is responsible for finding the shortest path to a goal. It learns by searching and finding the benefit–cost ratio of all the transitional choices made in reaching the goal. The brain can then use this to predict the outcome of current and future actions. The dopamine neurons assess the current situation and update the brain about the most favorable course of action from the current situation. This update is mostly a guess in many cases, but a guess that can be improved upon by further updates in the future. Over time, this constant update results in an increasing likelihood of the guess being optimal. This means, the longer we (or any vertebrate) lives, the wiser we naturally become, relatively speaking. TD learning relies on the totality of your life experiences, gleaning whatever is significant from these experiences long after the particulars of the experiences are forgotten.
When you ponder a range of choices, the brain evaluates each, and the transitory extent of dopamine assesses the benefit–cost ratio of each choice. The amount of dopamine is also determined by how motivated you are, which means that more dopamine will increase your level of motivation. This results in either a virtuous or a vicious cycle… addiction being an example of the vicious circle aspect. A healthier sustainable diet would be an example of a virtuous cycle.
TD learning is effective as it merges information on the benefit–cost ratio from diverse facets of life. This avoids the extreme difficulty of thinking through myriad variables and unknowns. This innate ability to quickly deliver better guesses often makes the difference between life and death in the wild. The circumstances of civilization probably make this innate ability problematic at times.
“So what?”, we may wonder.
Virtually all harm we see in the world arises from feeling other people have free will, i.e., they are responsible for their actions. We blame ‘them’ and react, setting in motion an often-snowballing series of actions and reactions. Fully realizing that no one has free will enables one to be a more peaceful actor in the world. Why? Knowing that no one has any fundamental choice in life fosters a deep and universal sense of forgiveness… if that is the right word for it. In other words, sincerely blaming anyone for anything becomes impossible. As a result, there can be no flame of blame to spark a reaction.