我正在试验深入强化学习,并创造了以下的环境,我正在运行一个模拟采购的原材料。开始数量是指我在未来12周内购买的材料的数量(Sim_weeks)。我必须以195000英镑的倍数购买,预计每周要用45000磅的材料。
start_qty= 100000
sim_weeks = 12
purchase_mult = 195000
#days on hand cost =
forecast_qty = 45000
class ResinEnv(Env):
def __init__(self):
# Actions we can take: buy 0, buy 1x,
self.action_space = Discrete(2)
# purchase array space...
self.observation_space = Box(low=np.array([-1000000]), high=np.array([1000000]))
# Set start qty
self.state = start_qty
# Set purchase length
self.purchase_length = sim_weeks
#self.current_step = 1
def step(self, action):
# Apply action
#this gives us qty_available at the end of the week
self.state-=forecast_qty
#see if we need to buy
self.state += (action*purchase_mult)
#now calculate the days on hand from this:
days = self.state/forecast_qty/7
# Reduce weeks left to purchase by 1 week
self.purchase_length -= 1
#self.current_step+=1
# Calculate reward: reward is the negative of days_on_hand
if self.state<0:
reward = -10000
else:
reward = -days
# Check if shower is done
if self.purchase_length <= 0:
done = True
else:
done = False
# Set placeholder for info
info = {}
# Return step information
return self.state, reward, done, info
def render(self):
# Implement viz
pass
def reset(self):
# Reset qty
self.state = start_qty
self.purchase_length = sim_weeks
return self.state我正在辩论奖励功能是否足够。我想要做的是最小化每个步骤手头上的天数之和,在每个步骤中,给定步骤的日数由代码中的天数来定义。我决定,既然目标是最大化奖励功能,那么我就可以将手头的天数转换为负数,然后使用这个新的负数作为奖励(这样,最大化奖励就可以最小化手头的天数)。然后,我又加上了一个严厉的惩罚,就是在任何一周都允许数量为负数。
有更好的方法吗?我对这门学科很陌生,对Python也很陌生。任何建议都是非常感谢的!我
https://stackoverflow.com/questions/65692797
复制相似问题