« Back to Research

Decision Making and Reinforcement Learning in Complex Environments

Many outdoor enviroments are unstructured and spatiotemporal which can cause uncertain behavivors of robots. For example, the motion of aquatic vehicles in the water or aerial vehicles in the air can be highly uncertain/stochastic. The decision making (or planning under uncertainty) allows a robot to effectively reject action stochasticity caused by external disturbances.

Time-Varying Markov Decision Processes:

We developed a framework called time-varying Markov Decision Process (TVMDP). The new framework does not need to increase state space or discretize time. Specifically, the TVMDP is built upon an upgraded transition model that varies both spatially and temporally, with an underlying computing mechanism that can be imagined as value iterations combining both spatial "expansion" and temporal "evolution". Such a framework is able to integrate a future horizon of environmental dynamics and produce highly accurate action policies under the spatio-temporal disturbances that are caused by, e.g, tidal and/or air turbulence.

Left: flow pattern near San Francisco Bay (by F. Baart et al). Middle: without considering the time-varying aspect of disturbance, the robot's trajectory makes unnecessary detours. Right: trajectory produced by a decision-making strategy that has integrated prediction of ocean currents near southern California.

State-Continuity Approximation of Markov Decision Processes:

We developed a solution to the MDP based decision-theoretic planning problem using a continuous approximation of the underlying discrete value functions. This approach allows us to obtain an accurate and continuous form of value function even with a small number of states from a very low resolution of state space. We achieved this by taking advantage of the second order Taylor expansion to approximate the value function, where the value function is modeled as a boundary-conditioned partial differential equation which can be naturally solved using a finite element method. Our extensive simulations and the evaluations reveal that our solution provides continuous value functions, leading to better path results in terms of path smoothness, travel distance and time costs, even with a smaller state space.

Left: MDP policy iteration with continuous value approximation by finite element analysis. Middle: MDP policy iteration with exact discrete policy iteration on high-resolution states (traditional). Right: Goal-oriented planner without motion uncertainty (policy) optimization.

Action Learning with Visual Perception:

We recently developed a method for exploring and monitoring coral reef habitats using an autonomous underwater vehicle (AUV) equipped with an onboard camera. To accomplish this task, the vehicle needs to learn to detect and classify different coral species, and also make motion decisions for exploring larger unknown areas while trying to detect as more corals (with species labels) as possible. We propose a systematic framework that integrates object detection, occupancy grid mapping, and reinforcement learning methods. To enable the vehicle to adjudicate decisions between exploration of larger space and exploitation of promising areas, we propose a reward function that combines both an information-theoretic objective for environment spatial coverage and an ingredient that encourages coral detection. We have validated the proposed method through extensive simulations, and the results show that our approach can achieve a good performance even by training with a small number of images (50 images in total) collected in the simulator.

Left: An illustration of the system that enables the AUV to autonomously explore and monitor the coral habitats in unknown environments. The YOLOv3 detector processes the images captured by an onboard camera. The obtained coral species distribution is used to update the occupancy grid map of the environment. The AUV's motion decision is then computed based on the updated map and the current pose of the vehicle. Right: An illustration of the camera setup. The camera is mounted underneath the AUV. We set the camera to look downward (indicated by the red arrow) so that each image pixel has the same depth value. The red and gray areas denote the grids that are inside and outside the camera's field of view, respectively.

Evaluation of the YOLOv3 coral detection and recognition (different coral species).

A short video demonstrating AUV learned monitoring trjactory:

Related Papers:

  • "Reachable Space Characterization of Markov Decision Processes with Time Variability". Junhong Xu, Kai Yin, Lantao Liu. Robotics: Science and Systems (RSS). Messe Freiburg, Germany. June, 2019.
  • "A Solution to Time-Varying Markov Decision Processes". Lantao Liu, Gaurav S. Sukhatme. IEEE Robotics and Automation Letters (RA-L). vol.3, no. 3. pp. 1631-1638, 2018.
  • "Action Learning for Coral Detection and Species Classification". Junhong Xu, Lantao Liu. The OCEANS Conference. Seattle, WA, 2019.
  • "Learning Partially Structured Environmental Dynamics for Marine Robotic Navigation". Chen Huang, Kai Yin, Lantao Liu. The OCEANS Conference. 2018.
  • "Reachability and Differential based Heuristics for Solving Markov Decision Processes". Shoubhik Debnath, Lantao Liu, Gaurav Sukhatme. International Symposium on Robotics Research (ISRR). Chile, 2017.)
  • "Solving Markov Decision Processes with Reachability Characterization from Mean First Passage Times". Shoubhik Debnath, Lantao Liu, Gaurav Sukhatme. IEEE/RSJInternational Conference on Intelligent Robots and Systems (IROS). Madrid, Spain, 2018.)