Reinforcement Learning - letting the machine learn through observation

Learning and understanding are the basic instincts that have driven humans to where we are today. We wouldn’t be as advanced in science and technology today if our ancestors hadn’t observed and learned from the environment around them. Today, we are trying to make machines learn like humans. Surely, we can train our machine using a paired dataset, i.e., by telling it that a particular input yields a specific type of output. But did humans really learn like this all the time?

Learning from interactions with our surrounding environment is a foundational idea underlying nearly all theories of learning and intelligence. This is where Reinforcement Learning comes in. The approach here is much more focused on goal-directed learning from interaction than are the other approaches to machine learning.

What is Reinforcement Learning?

Reinforcement learning is about learning what to do. The learner here is told which actions it can take, but it is responsible for the discovery of which actions yield the most reward while trying them, mapping situations to actions. In many cases, current action may not just affect only the immediate reward but also the next situation and thus, all the subsequent situations and their rewards. So this means that the choices of the learner matter here. These two characteristics:- trial-and-error search and delayed reward are the two most important distinguishing features of reinforcement learning.

Reinforcement Learning is different from Supervised learning. Supervised learning is learning from a training set of labeled examples to predict specific outcomes (like Feature Detection, Linear regression, etc.). This is an important kind of learning, but practically, it is often hard to find labeled examples of the desired behaviour we want for the learner to act in the given situation.

Reinforcement learning is also different from Unsupervised Learning, which deals with finding patterns and structure hidden in unlabeled data. It is useful for the learner to uncover hidden patterns and structures in the data; however, the end goal of Reinforcement Learning is to maximise the rewards, which is totally different from classifications and outcomes of Supervised and Unsupervised Learning.

Exploration vs Exploitation

One of the challenges that arises in Reinforcement Learning is the trade-off between exploration and exploitation.

To maximize our reward, a learner must prefer the actions that he has tried in the past and found to be effective in producing a higher reward. But to discover the actions that provide such rewards, it has to try actions that have not been selected before.

So here the learner has to exploit what it has already tried to obtain reward, but it also has to explore in order to make better action selections in the future. But the dilemma here is what to choose, as both exploration and exploration can not be done simultaneously by a single learner. It must balance how much time it spends exploring and how much time it spends exploiting discovered actions.

This entire problem of balancing exploration and exploitation does not even arise in supervised and unsupervised learning, where the learning objective is fixed and does not depend on sequential decision-making. Reinforcement Learning is a different type of learning.

Key components of Reinforcement Learning

Beyond the learner ( also known as the agent ) and the environment, the four main components of a reinforcement learning system are:

Policy - A policy defines the learning agent’s way of behaving at a given time. We can say that a policy determines what actions an agent should take in that specific condition. In human terms, it can be called a set of stimulus-response rules or associations. It can be a simple function or a lookup table, whereas in others, it may involve extensive, complex computations like a search process. The policy is the core of a reinforcement learning agent in the sense that it alone is sufficient to determine behaviour.
Reward Signal - A reward signal defines the goal of the reinforcement learning problem. At each time step, the environment sends a single number called a reward to the agent. As discussed above, the agent’s objective is to maximize the total reward it receives over the long run. So the reward signal defines what the good and bad events for the agent are. In human terms, rewards are similar to the experiences of pleasure or pain. We learn from pain to maximize the pleasure in our lives. This reward signal is also responsible for altering the policy. If an action selected by the policy is followed by a low reward, then the policy may be changed to select some other action for that situation in the future.
Value Function - The reward indicates what is good in an immediate sense, but a value function tells us what is good in the long run. The value of the state is the total amount of reward an agent can expect to accumulate over the future, starting from that state.

Why the need for a value function? A state might always yield a low immediate reward but still have a high value because it is regularly followed by other states that yield high rewards. Or the reverse could be true. Again, in human analogy, rewards are pleasure and pain, whereas values correspond to a far-sighted judgement of how pleased or displeased we are that our environment is in a particular state.

Without rewards, there could be no values, and the only purpose of estimating values is to achieve more rewards. But it is values with which we make and evaluate future decisions. We seek actions that bring states of highest value, not highest reward, because these actions obtain the greatest amount of reward over the long run.
Model - A Model is something that mimics the behaviour of the environment, or more generally, that allows inferences to be made about how the environment will behave. For example, given a state and action, the model might predict the resultant next state and next reward. So these models are used for deciding on a course of action by considering possible future situations before they are actually experienced.

Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners, viewed as almost the opposite of planning.

Scope

Reinforcement Learning emphasizes learning through direct interaction with the environment, where the agent continuously observes states, takes actions, and receives feedback in the form of rewards. The agent here learns not only the policies that perform well, but also why they perform well by observing state transitions and action outcomes over time.

By using value functions and policies that map states to actions, reinforcement learning algorithms can adapt during an agent’s lifetime, improving its behaviour incrementally. This ability to learn online allows RL to be significantly more data-efficient and responsive to changes in the environment.

Also, Reinforcement Learning naturally fits problems where actions influence future states and rewards. By modeling the environment as a sequence of interactions rather than a static dataset, it provides a framework for solving and planning problems under uncertainty.

In short, Reinforcement learning is well-suited for problems that require adaptive, state-aware, and goal-directed behaviour learned through continuous interaction with the environment.

Some Examples

Reinforcement Learning is best understood through problems where an agent must take a sequence of actions by interacting with the environment and learning from the feedback received. Let’s look at some applications/examples.

Games - In games such as Chess, Go, or some video game, the agent will observe the current game state, choose a move ( action ), and receive rewards based on the outcome. Sometimes the rewards are delayed, as a single move may not determine a win or a loss ( like in Chess ). So the agent must learn strategies to maximise long-term reward.
Robotics - In robotics, an agent can be used to learn tasks such as walking, grasping objects, or balancing. The robot observes its state (joint angles, velocities, sensor readings), takes actions (motor commands), and receives rewards based on task performance. Through trial and error, the robot learns control policies without being explicitly programmed for every situation.
Resource management - Reinforcement learning is applied to problems such as CPU scheduling, cache replacement, network routing, and power management. The agent learns policies that balance competing objectives, such as performance and energy consumption, by observing system behavior and adapting decisions over time.

These examples highlight the strength of reinforcement learning in problems where decisions are sequential, feedback is delayed, and actions influence future outcomes, making it fundamentally different from traditional supervised or unsupervised learning approaches.

Thank you for reading!!

The content of the blog is referenced from Reinforcement Learning - An Introduction (2nd Edition) by Richard S.Sutton and Andrew G. Barto. If you are curious to learn more about Reinforcement Learning, I recommend this book. Also, try out the Gymnasium to develop and try reinforcement learning algorithms.

Also, if I have missed anything or if you have any thoughts to share, feel free to contact me:)

What is Reinforcement Learning?

Exploration vs Exploitation

Key components of Reinforcement Learning

Scope

Some Examples

Email