Implementing Dynamic pricing strategy in python — Part 3 — reinforcement learning

Induraj
5 min readMar 2, 2023

--

Reinforcement learning (RL) is a type of machine learning where an agent learns to take actions in an environment to maximize a reward signal over time.

It is inspired by how humans learn through trial and error.

  • In RL, the agent interacts with the environment by taking action and observing the resulting state and reward.
  • The goal of the agent is to learn a policy, which is a mapping from states to actions, that maximizes the expected cumulative reward over time.
  • The agent uses a trial-and-error approach to learn the policy, and it does so by updating its policy based on the rewards it receives.

The RL process can be broken down into four main components:

  1. Environment: this is the world in which the agent operates and it provides the state of the world to the agent.
  2. Agent: The one who takes an action in response to the state and receives a reward from the environment.
  3. State: This is a description of the environment at a particular time
  4. Reward: This is a scalar signal that the agent receives as a result of its action

A real-world example:

Imagine you have a robot, and you want it to learn how to pick up a ball and put it in a basket. The robot doesn’t know how to do this, but you can give it a reward when it does something right.

  • At first, the robot tries to pick up the ball and put it in the basket randomly. Sometimes it succeeds, and sometimes it fails. When it succeeds, you give it a reward (like a piece of candy), and when it fails, you don’t give it a reward.
  • The robot tries to remember what it did when it succeeded and tries to do the same thing again next time. If it fails, it tries something else. Over time, the robot gets better at picking up the ball and putting it in the basket because it learns from its successes and failures.

This is called reinforcement learning. The robot is learning through trial and error, and it’s getting better by receiving rewards (or positive reinforcement) when it succeeds. The diagram below shows how the robot learns from its experiences and adjusts its actions based on whether it receives a reward or not.

Types of reinforcement learning:

RL algorithms are classified based on whether the agent learns using value-based or policy-based methods.

  • In value-based methods, the agent learns to estimate the value of each action in each state and selects the action with the highest estimated value.
  • In policy-based methods, the agent directly learns the policy by searching for the best action in each state.

RL has many applications, including robotics, game-playing, and autonomous driving. It has been used to train agents to play games like Go, chess, and poker at a superhuman level, and to control complex robotic systems in manufacturing and healthcare. Here we will look at it being used to dynamic pricing.

Implementation:

The ultimate objective is to dynamically change the price based on the demand, current price, and cost.

First, we define the PricingEnvironment class, which represents the environment in which our reinforcement learning agent will operate. It has three methods: __init__, get_profit, and get_demand.

  • The __init__ the method initializes the environment with a demand function, a maximum price, and costs.
  • The get_profit method takes a price and a demand as input and returns the profit as revenue minus costs.
  • The get_demand method takes a price as input and returns the demand based on the demand function.
# Define the pricing environment
class PricingEnvironment:
def __init__(self, demand_fn, max_price, costs):
self.demand_fn = demand_fn
self.max_price = max_price
self.costs = costs

def get_profit(self, price, demand):
revenue = price * demand
cost = self.costs
return revenue - cost

def get_demand(self, price):
return self.demand_fn(price)

Secondly, We define the QLearningAgent class, which represents the reinforcement learning agent. It has four methods: __init__, get_Q_value, choose_action, and update_Q_value.

  • The __init__ method initializes the agent with a list of actions, a learning rate (alpha), a discount factor (discount), and an exploration rate (epsilon).
  • The get_Q_value method takes a state and an action as input and returns the Q-value for that state-action pair.
  • The choose_action method takes a state as input and returns an action using an epsilon-greedy policy.
  • The update_Q_valuemethod takes a state, an action, the next state, and a reward as input and updates the Q-value for the state-action pair based on the Bellman equation.
# Define the reinforcement learning agent
class QLearningAgent:
def __init__(self, actions):
self.Q = {}
self.actions = actions
self.alpha = 0.2
self.discount = 0.9
self.epsilon = 0.1

def get_Q_value(self, state, action):
if (state, action) not in self.Q:
self.Q[(state, action)] = 0.0
return self.Q[(state, action)]

def choose_action(self, state):
if random.random() < self.epsilon:
return random.choice(self.actions)
else:
values = [self.get_Q_value(state, a) for a in self.actions]
max_value = max(values)
if values.count(max_value) > 1:
best_actions = [i for i in range(len(self.actions)) if values[i] == max_value]
i = random.choice(best_actions)
else:
i = values.index(max_value)
return self.actions[i]

def update_Q_value(self, state, action, next_state, reward):
old_value = self.get_Q_value(state, action)
next_max = max([self.get_Q_value(next_state, a) for a in self.actions])
new_value = (1 - self.alpha) * old_value + self.alpha * (reward + self.discount * next_max)
self.Q[(state, action)] = new_value

Third, we define what is needed for our dynamic pricing — the demand function, the cost and the pricing environment

# Define the demand function
def demand_fn(price):
return 100 - price

# Define the cost
costs = 50

# Define the pricing environment
env = PricingEnvironment(demand_fn, 100, costs)

# Define the reinforcement learning agent
agent = QLearningAgent(range(env.max_price + 1))

finally, we train and test our agent

# Train the agent
num_episodes = 1000
for i in range(num_episodes):
state = (env.get_demand(0), 0)
for t in range(30):
price = agent.choose_action(state)
demand = env.get_demand(price)
profit = env.get_profit(price, demand)
next_state = (env.get_demand(price), min(price + 1, env.max_price))
agent.update_Q_value(state, price, next_state, profit)
state = next_state

# Test the agent
state = (env.get_demand(0), 0)
for t in range(30):
price = max(agent.actions, key=lambda x: agent.get_Q_value(state, x))
demand = env.get_demand(price)
print("Price:", price, "Demand:", demand)
state = (demand, price)

--

--