verbigratia OP t1_j1cxswc wrote

Thanks all.

I've only just started experimenting but my approach so far has been to discretise the observation into Q table sizes varying between [8, 8, 8, 8, 8, 8, 2, 2, action_space.n] and [20, 20, 20, 20, 20, 20, 2, 2, action_space.n]. And 5k-10k episodes, learning rate of 0.1 and discount of 0.99.

The results will not win any SpaceX contracts just yet but they do result in soft-ish landings between the flags more often than not.

I found hovering to be a problem so added some handling to exit the episode after around 500 steps.

At this point, I normally start looking at what others have done, and was surprised not to see more examples demonstrating tabular Q learning in this scenario (despite the issues with the continuous observation space).

Will look at deep RL next but found it interesting to try the tabular approach first.

Edit: grammar