When it comes to Game playing AI, RL - reinforcement learning is one of the most popular methods to train models. It is insanely cool as a concept because the inputs are effectively the pixel buffer and an established game state. Rewards are defined and the algorithm ultimately learns how to achieve the reward.

There's obviously a sequential element to this problem with respect to when an action is taken and why and how it plays out into the future. In order to achieve this, games require an interface for us to aquire the necessary atomic state of the game and the ability to progress the game programmatically. The reason is so we can control inputs and measure outputs. In the middle of it all is our intelligence.

When we play a game we have constraints for our game actions, for example playing Super Mario Bros. we have the directional pad for left-right, speed and jump button.

The output of the gameplay AI models are ultimately one of these input controls. The number of input controls are effectively the neuron outputs of the final layer of the neural network.

The most successful architecture to date is DQN used in Agent 57 to destroy the human Atari Benchmark.

Link to DeepMind Article

Deep Q-Network (DQN) learning is a method used in artificial intelligence to teach a computer how to make decisions and maximize its rewards, often in the context of video games.

At its core, DQN focuses on calculating the expected cumulative reward for taking a specific action in a given situation. It does this by considering the immediate reward the action provides and estimating the potential future rewards that might follow.

One critical aspect of DQN learning is that it acknowledges that the value of future rewards diminishes as we look further into the future. In other words, the closer in time a reward is, the more weight it carries in the decision-making process. Actions that lead to immediate rewards are favored over actions that might result in rewards far in the future.

DQN employs a deep neural network to model and predict the best actions to take. This network is trained through trial and error, with the AI agent learning from its experiences. It updates its understanding of which actions are more valuable in different situations as it accumulates data over time.

So, DQN excels at balancing the trade-off between immediate rewards and the potential for greater rewards in the future. It's a crucial concept in reinforcement learning, which has applications not only in gaming but also in various other fields, including robotics, finance, and optimization problems.

In summary, Deep Q-Network learning is a technique that leverages the diminishing weight of actions into the future to help AI systems make optimal decisions and maximize their rewards, making it a valuable tool in a wide range of applications.

Balancing Exploration and Exploitation with Epsilon-Greedy in Deep Q-Networks.

In the realm of reinforcement learning, particularly when we're dealing with algorithms like Deep Q-Networks (DQN), there's a critical concept known as "epsilon-greedy." It's a strategy that plays a crucial role in how AI agents make decisions in uncertain environments.

Epsilon-greedy is all about balancing exploration and exploitation. Let me break it down:

Exploration: When an AI is exploring, it's trying out different actions to learn more about the environment. It doesn't necessarily choose the actions that it believes will yield the maximum immediate reward. Instead, it experiments to gather more information about the consequences of different choices.

Exploitation: On the other hand, exploitation is about choosing actions that the AI believes will lead to the best immediate rewards based on its current knowledge. It's the AI's way of capitalizing on what it has already learned.

Epsilon-greedy combines these two approaches. The parameter "epsilon" (ε) determines the balance. Here's how it works:

With probability ε, the AI agent chooses to explore. This means it selects a random action from the set of all possible actions. This randomness is essential for discovering potentially better strategies.

With probability 1-ε, the AI agent chooses to exploit. In this case, it selects the action that is currently believed to be the best based on its existing knowledge.

The value of ε is a critical factor. If ε is set to a high value, like 0.9, the agent will explore frequently, which can be beneficial for learning in complex environments. However, if ε is set to a low value, like 0.1, the agent will mostly exploit its current knowledge, which can be useful when it has already learned a good strategy.

The key idea behind epsilon-greedy is to strike a balance. At the beginning of training, you may want more exploration to learn about the environment. As the agent gains more knowledge and expertise, you can gradually decrease ε to shift the balance more toward exploitation.

Epsilon-greedy is a simple yet powerful concept that helps AI agents, like those using DQN, learn effectively in dynamic and uncertain situations, making it a fundamental part of reinforcement learning.

Below is my a quick summary I did on other research papers doing deep reinforcement learning game playing AI my comments on their architecture.