I have a problem I need to solve that, as far as I can tell, doesn't fit very well into most of the existing RL literature.

Essentially the task is to create on optimal plan over a time horizon extending a flexible number of steps into the future. The action space is both discrete and continuous - there are multiple available distinct actions, some of which need to be given continuous (but constrained) parameters.

In this problem however, the state of the environment is known ahead of time for all the future time steps, and the updated state of the agent after each action can be calculated deterministically given the action and the environment state.

Modelling the entire problem as a MILP is not feasible due to the size of the action and state space, and we have a very large data set for agent and environment state to play with. Does anyone have any suggestions for papers or models that might be appropriate for this scenario?

Comments

You must log in or register to comment.

UnusualClimberBear t1_j7lvpz8 wrote on February 7, 2023 at 7:14 PM

Looks like an optimal control problem rather than an RL one. RL is there for situations with no good model available. If stochasticity is present, but you still have a good model once the uncertainty is known, then Markov predictive control is a good way to go.

UnusualClimberBear t1_j7opc2r wrote on February 8, 2023 at 8:55 AM

Also if your world is deterministic but you cannot build a good model of it, it may be that you are close to the situation of games such as Go, and Monte Carlo Tree search algorithms are an option to consider (variants of UCT with or without function approximation)

EmbarrassedFuel OP t1_j7p40eo wrote on February 8, 2023 at 12:13 PM

oh also the model needs to run at inference time in a relatively short period of time on cheap hardware :)

EmbarrassedFuel OP t1_j7p3xc1 wrote on February 8, 2023 at 12:13 PM

I haven't been able to find anything about optimal control with all of:

non-linear dynamics/model
non-linear constraints
both discrete and continuously parameterized actions in the output space

but in general, discovery of papers/techniques in control theory seems to be much harder for some reason

UnusualClimberBear t1_j7pdue6 wrote on February 8, 2023 at 1:45 PM

This is because the information is in the books.

(free online) http://www.cds.caltech.edu/~murray/amwiki/index.php/Main_Page

https://www.amazon.com/Modern-Control-Systems-12th-Edition/dp/0136024580

Yet nonlinear breaks everything there. The usual approach is to linearize at well-chosen positions and compute the control using the closest linearization.

blackhole077 t1_j7l4yc9 wrote on February 7, 2023 at 4:21 PM

Perhaps the Semi-Markov Decision Process Paper by Sutton would be a good start

This should give you the paper: http://www-anw.cs.umass.edu/~barto/courses/cs687/Sutton-Precup-Singh-AIJ99.pdf

It sounds like you're looking for "options" in reinforcement learning, so any papers that cover that idea may be of interest to you.

BasedAcid t1_j7x88ly wrote on February 10, 2023 at 1:14 AM

Search for the keyword “deterministic MDP.” This is a relatively well-studied area.

jimmymvp t1_j7oybk9 wrote on February 8, 2023 at 11:05 AM

Ok, first off, I'm very curious what's the actual problem that you're solving. Can you describe it a bit more in detail or give a link?

If you have a perfect model that's cheap to compute, you can go with sampling approaches, I don't know how your constraints look like though. If your state/action space is too big, you might want to reduce it somehow by learning an embedding.

Is the model differentiable? I guess it is if you're using a MILP approach.

I guess some combination of MCTS with value function learning is plausible if your search space is big, such as it's done with alpha zero etc. I find the hybrid aspect of it very interesting though. It sounds like if you want to do amortized search, you need to combine MCTS and search in continuous space (sampling). Should be simple enough with a perfect model. Probably some ideas from mu zero would come in handy.

EmbarrassedFuel OP t1_j7p519o wrote on February 8, 2023 at 12:24 PM

Basically given some predicted environment state, going forward for say 100 time steps, we need to find an optimal cost course of action. Although the environment state has been predicted, for the purposes of this task the agent can consider it deterministic. The agent has one variable of internal state and can take actions to increase or decrease this value based on interactions with the environment. We can then calculate the new cost over the given time horizon by simulating the actions chosen at each step, but this simulation is fundamentally sequential and wouldn't allow backpropagation of gradients.

>you can go with sampling approaches

What exactly do you mean by this? something like REINFORCE?

> I guess it is if you're using a MILP approach.

Not sure I follow here, but I'm not using a MILP (as in mixed integer linear program). At the moment I'm using a linear programming approximation and heuristics, which doesn't generalize well.

> some combination of MCTS with value function learning

I think this could work, however without looking into it I'm not sure that it would work at inference time in my resource-constrained setting