Imitation and reinforcement learning to control soft robots: a perspective

It would not be an over-stretch to appraise soft robotics as one of the most sizzling topic in this era, owing to its unparalleled performance in terms of exhibited deformability, soft-interaction and luscious dexterity. In spite of that, we see minimal applications of this class around us. We believe the reason roots in the gap between design and fabrication of the soft robots and the capability to control them. The remedy to this imbalance, we believe, is a necessity for a climacteric rise in the capability to control them. The control, however poses challenges due to virtually infinite degrees of freedom of soft robot design, in terms of modelling the behavior of the robot, modelling its interaction with the environment or the environment’s effect on the robot. These challenges provoke the preponderant need for solutions that either can work with the inaccurate models of the robot obtained or avoid it altogether. To answer what can be used with these limitations we present here two potential classes of algorithms from machine learning capable of performing well under just the listed conditions: reinforcement learning (RL) can provide a solution in the form of a policy to control a platform with sophisticated dynamics even when the presented environment to train said policy has questionable accuracy, while Imitation Learning (IL) trains a policy while avoiding the need of a model or its constituent parts altogether and a policy is learned based on the teachings of an expert.


Introduction
The motivation for soft robotics stems from the biological systems due to their soft structure being able to exist and function efficiently in difficult and highly unstructured environments [1].So, robots designed and fabricated under this inspiration tend to offer compliance, deformability and adaptive nature best suited, as initially intended, for unpredictable environments.Yet it is uncommon to find such a platform playing a major role in our society even though sophisticated platforms have already been designed and fabricated that can imitate the biological systems that inspired its design [2].It is important to note here that this field has been around for almost two decades now.The hindrance, arguably exists in the current capability to model and control them.
For that matter, various researchers have managed to present modelling techniques able to capture their approximate behavior or behavior under constrained settings.The most notable among the modelling solutions are constant curvature [3] or piecewise constant curvature models [4], Cosserat rod theory [5] based models, FEM based models with slow computations [6] or SOFA Framework [7] with good accuracy [8] and low computational power using model order reduction [9] and finally machine learning based data-driven models [10], [11].1292 (2023) 012010 IOP Publishing doi:10.1088/1757-899X/1292/1/012010 2 While most of the modelling techniques mentioned above tend to capture the behavior exhibited by the underlying soft robot, the accuracy of that behavior remains somewhat questionable.For example by imposing constant curvature constraint, while it is useful, the proposed solution still remains indecisive under non-negligible external loading.Similarly, Cosserat rod theory though improves the accuracy in the behavior, the underlying partial differential equations seem sensitive to noise in the underlying parameters.Moreover, the computational cost of these partial differential equations increase drastically with added modularity in the soft robot.Similar trend is seen with FEM based approaches as well.This class seems to work much better than the previously proposed approaches however, the computational cost hinders the real-time control of the robots or implementation of learning based approaches for control.On a similar note, data-driven approaches are more applicable with control methodologies based on learning, nevertheless they pose challenges of their own.For example, modelling the interaction with the environment requires dataset that is not easily available and the artificial neural networks trained to capture the soft robots performance tend to accumulate error over a finite horizon while working in closed loop.
So, though all the presented approaches tend to deal with the modelling problems differently, there is a single aspect missing in all, related to capturing global behavior of soft robots due to their virtually infinite degrees of freedom.It majorly impacts the number of solutions proposed to control them.All of the above models have been used for control purposes as in [4].[4] uses piecewise constant curvature based models to achieve curvature and bending control, [12] uses Cosserat theory based simulation environment with RL for following a trajectory or tracking a goal in 3D space, [13] and [14] uses SOFA (FEM-based simulator) for position control with and without model order reduction, respectively, and finally data-driven models of the soft robots to elicit trajectory control using learning based adaptive control [15], open-loop dynamic controller [16] and dynamics model-based RL [17,18].
These proposed solutions are typically effective in controlled environments.In this context, a controlled environment refers to a situation where the actual performance of the soft robot closely matches its modelled behavior.However, this assumption may not be entirely practical, as it requires the model to capture the global behavior of the soft robot, including any potential factors that may affect its performance over time.And there are several factors that may potentially alter the soft robot's performance, such as changes in material properties resulting from factors like elastomeric properties, variations in the external environment, aging of the material, and wear and tear.More often that not, these factors are either ignored or assumed consistent from the control derivation step to the application step.
To ensure the practical applicability of any control solution for soft robots, it is essential to account for these factors.To deal with these challenges and present solutions able to control soft robots with desired accuracy, we may look for the alternative solutions.These approaches should be either independent of the modeled behavior or robust enough to adapt if there is a significant difference between the modeled and actual behavior of the robot.We highlight two potential classes of algorithms able to address aforementioned issues: imitation learning (IL) and reinforcement learning (RL).

Imitation Learning for Soft Robots
Imitation learning is a class of algorithms where a desired task is demonstrated by an expert [19].The expert here can be a human, a pre-trained or incrementally improving policy, a synthetic/learning agent, a demonstration by another robot or a biological system or even a video.Obtained demonstrations are then used to learn a policy (for underlying task in the demonstrations) to be elicited on the platform of choice.
In the literature, this class has been used extensively with a variety of applications, for example learning to imitate repetitive human arm movements on a rigid humanoid robot [20], autonomous and agile navigation of a ground vehicle [21] learned in online mode by imitating the expert states to continuous steering and throttle commands (here the expert is a model-predictive controller that uses models of expensive and advanced sensors), learning to manipulate objects [22] in a sophisticated environment with rigid manipulators using demonstrations acquired from virtual reality via teleoperation, and many more.This class is known to do much with a limited dataset.
While fewer demonstrations tend to work well for rigid robots to learn the underlying task and still be able to generalize over a minutely varying environment (as single shot learning schemes in [23,24]), it may not necessarily be same for soft robots due to following factors: 1) task demonstrations recorded, where human experts are involved, may vary significantly from each other for the same task due to flexible shape and absence of fixed reference frame unlike rigid robots.2) Soft robots tend to undergo varying kinematics upon interaction with the external objects.3) Finally, stochastic behaviour exhibited by soft robots due to varying material properties.These varying properties result in ever-falling generalization capability.We consider all these factors when proposing final solutions in the progressing sections.
There are two practical ways to implement imitation learning on soft robots: scenario 01kinesthetic learning and scenario 02 -learning from observation.Kinesthetic learning is where the expert manipulates the soft robot.Expert demonstrations in terms of state-action pairs are recorded to train a policy.The policy, in return, predicts actions for the soft robot to replicate expert states.The expert task-demonstrations are elicited on the platform of choice with [25] or without expert intervention [26] depending on the level of difficulty of the task and platform at hand.The general algorithmic flow is as shown in the Fig. 1.The examples of this approach may include human teaching a soft robot to fish, perform a surgery, or produce a certain pattern/trajectory in 3D space or on an object, etc.
The second scenario, on the other hand, can be imagined where a robot learns to do a task by observing the expert.The expert do not physically interact with the soft robot here.The features are extracted from the expert demonstrations of a desired task.It is important to note here that the demonstrations may be in a space different than the robot's own configuration, action or operation space.The examples of this scenario can be a soft robot learning to flip burgers based on a video demonstration of a human, or soft robotic arm attempting to follow sequential series of motions as observed from a biological system like octopus arm movements being replicated on an octopus-inspired soft robot, or an elephant trunk movements on an elephant trunk-inspired soft arm, etc.The general algorithmic flow of this approach is as shown in the Fig. 2.
In the first scenario of using IL for soft robots, the gathered observations and associated actions can be used to derive a direct policy, utilizing behavior cloning techniques as described in Chapter 03 of the book by Osa et al. [27].This policy can be improved in real-time using policy optimization, without the need to explicitly learn about the subject platform, the operating environment, or the task itself.However, in the second scenario, an additional mapping, also known as domain adaptation map [28], is required to transform the expert demonstrations into a space or frame replicable by the subject platform.This is because the soft robot and the expert may not only lie in different frames but may also have different morphologies and action spaces.Therefore, the expert demonstrations need to be transformed to match the specific requirements of the subject platform, allowing the IL algorithm to learn and improve the soft robot's performance.
In the context of soft robotics, the second scenario is considered more favorable.We (as a soft robotics community) strive to mimic biological systems, which require knowledge transfer from one system to another, such as transferring the motion of a real octopus to an octopus-inspired robot, as discussed by Kim et al. [1].While the general appearance of the soft robot and the biological system may be similar, the actuation capability can differ significantly.Returning to  Behavior cloning can be applied to the expert's state-action pairs to train a preliminary policy, which can be further optimized using policy optimization approaches.
the first scenario, it appears more feasible to run this scheme on rigid robots, as they have a fixed shape, a reference frame to start from, and joint movement that is repeatable, ignoring the influence of human factors.The demonstrations thus accumulated tend to relate closely without significant variability.However, if such a scheme is applied to soft robots, the absence of a fixed shape or reference frame and virtually infinite degree of freedom, exacerbated by the influence of human experts, results in a significantly varying distribution of desired task demonstrations.Vanilla behavior cloning may not be sufficient to learn the policy for the underlying task.Furthermore, the kinematics or geometry of a rigid robot does not change if it comes into contact with another object.Such an assumption does not necessarily hold for soft robots, as their properties can be impacted by external factors, leading to deviations from the expected behavior.
The challenges posed by both scenario 01 and 02 combined present issues for policy derivation for soft robots, even after domain adaptation in scenario 02.Vanilla behavior cloning is not sophisticated enough to handle these challenges, requiring a more incremental and data-driven approach, such as the Data Aggregation approach or DAgger.Both scenarios can implement this approach with appropriate modifications, as highlighted in Figs. 1 and 2. Though such a learning scheme may be able to learn the underlying task from the distribution of demonstrations, it tends to under-perform when faced with a state not seen before.The policy only has knowledge from the expert demonstrations, and it is impossible to cover all possible situations that the policy may face during application, especially when a high degree of accuracy is desired.Given that IL is not known for exhibiting robust behavior in changing environmental settings, we move on to the next class of algorithms that are best known for robust behavior under varying conditions: RL.

Reinforcement Learning for Soft Robots
This class of machine learning algorithms provides a policy that is learned from a training environment based on trial and error.The training environment can be a simulation engine, a mathematical formulation, or a data-driven machine learning-based model that describes the Figure 2: Scenario 02 -The desired behavior to replicate is observed from an expert, which could be a video, a different robot, a human agent, any other biological system performing the desired behavior, or even a hand-crafted idea.Next, features are extracted from these observations, and they are mapped into a domain that can be replicated by the robot in question, using a domain adaptive map.Finally, a policy can be trained and optimized as in Fig. 1. approximate behavior of the platform under consideration, including encompassing modules that affect/complement the robot's behavior.These encompassing modules may include, but are not limited to, the reward/penalty/objective function that describes the task at hand, the interaction of the subject platform with objects/humans/robots in the environment, or physical constraints that affect/hinder the desired task.
On the other hand, a learning agent is an algorithm (under the RL umbrella) that attempts to learn a policy capable of performing the desired task by taking random actions and observing how the training environment responds to it in terms of a reward/penalty.Such signals, in turn, motivate future actions [29] to maximize the reward or minimize the penalty.Actions taken in this way finally converge to the desired actions that can perform the desired task.However, such a policy is acquired after a significant number of iterations have been performed.This signifies a rather obvious point that such a scheme cannot be executed directly on the robot, as performing tens of thousands of iterations to reach the desired results is not a practical approach from a safety and/or resource perspective.Therefore, such an environment needs to be created in a non-physical domain as highlighted in Fig. 3.
When it comes to the various components that make up the training environment, such as the robot behavior model, interaction model, and the effect of the external environment on the robot model, researchers have attempted to present solutions describing them over the past decade for soft robots, as found in literature [3,4,5,6,7,8,9,10,11,15,18].However, the level of accuracy achieved remains a challenge and an ongoing topic of interest.This introduces an additional hurdle when implementing the trained policy in the real environment, acquired from the training environment, as the training environment only holds an approximation of the real platform's behavior.The discrepancy between the simulation environment (proposed models) and the real world is famously known as the simulation-to-reality (sim2real) gap, which will persist until the behavioral gap is closed.Depending on the training environment, this gap may be known differently, such as the training2real gap or machine2real gap, but serving the same hurdle nonetheless.Additional optimization is thus required to implement the offline-trained policy in the real environment.The complete map for RL is shown in Fig. 3.
At the same time, the amount of time and computational resources required for optimization depend on several factors such as the complexity of the robot or the task at hand, the technique being used, and the level of discrepancy between the simulation environment and the real environment.This requirement becomes more intensified with the performance exacerbation, particularly when the soft robot exhibits stochastic behavior or undergoes changes due to material properties.As the learning agent explores its environment during the random search to learn the optimal sequential series of actions for performing the desired task, the resulting solution tends to be quite robust under varying conditions.
On a different note, it should be mentioned that it may not always be feasible to obtain all the necessary models for the offline training environment, such as models for interaction with an external object with unknown properties or for modeling the impact of external constraints on robot behavior.Additionally, creating a reward function for every type of desired task may introduce additional challenges.For instance, consider the challenge of deriving a reward/penalty/objective function to tie a knot, fish, or flip a burger.Such tasks may be learned more effectively from demonstrations.
The conundrum of choosing between a time-consuming vs fast solution, a computationally exhaustive vs relatively low computational burden approach, a robust vs failure-under-eludedstates approach, and a complicated hand-crafted reward function vs demonstrations-only approach, etc., can be mitigated by combining the two approaches.This leads us to the proposed scheme of combining IL and RL for a robust and fast solution.

Combination of IL and RL for Soft Robots
Looking back, we realize that RL introduces robusticity while IL produces a fast solution for quite sophisticated tasks.So, a combination of both schemes will instigate producing a preliminary policy using the expert demonstrations and optimizing said policy in online mode using an exploration-based optimization (using RL) to bolster unstructured and varying environment conditions.Similar solution is proposed in [25] where the author introduced a synthetic agent based on DQN (off-policy RL algorithm) to support the policy where it may fail under eluded states as shown in fig 4a.A question arises regarding the exploration based on a search-based approach: where do the objective function comes out of for tasks where hand-crafted reward/penalty functions are challenging to design?In such situations, there is another class of algorithms known as Inverse Reinforcement Learning (IRL), which is surveyed in [30].IRL algorithms learn a reward/objective function from expert demonstrations, observed behavior, a learned policy, or directly from an expert.The obtained objective function can then be used for policy optimization.However, there is an underlying assumption that the expert demonstrations are the optimal way to perform the task, which may not always be the case.Nonetheless, the obtained objective/reward function from the demonstrations can produce a solution for the desired task that is as good as the demonstrations themselves, and can also be robustified (as discussed in Chapter 04 of [27]) for unstructured environments.The RL agent, as made abundantly clear in Sec. 3, introduces robustness to the proposed policy for a desired task by using a reward function and search-based algorithms.So, why limit such a solution to merely exploring and intervening during eluded states only when used with IL?At the same time, as also indicated in Sec. 2, the policy learned simply from demonstrations may lack generalization capability due to the high-dimensionality of the observation space and continuous action space of the platform under consideration (soft robots in this case) or show distribution shift due to error accumulation [31] in the policy learned solely through supervised learning on the demonstrations.Under such conditions, an RL agent may be encouraged to train a policy that can match the expert demonstrations over a long finite horizon based on comparative incentives.It is of paramount importance to note that no reward function was hand-crafted for the task or extracted from the demonstrations; instead, the incentives are based on how closely the agent can match the expert demonstrations over a finite period (as presented in [31]).
In addition to learning a single task quickly and making it resilient to a varying and unstructured environment, it is also possible to train a single policy to perform multiple tasks.For example, consider Fig. 4a, where a trained DQN agent serves as the expert for a student policy training on a single task.Instead of using a single DQN agent for a specific task, we can have multiple experts, each associated with an independent task as shown in Fig. 4b.The demonstrations from each expert can be gathered in a different buffer and the student policy can be trained episode-by-episode with episodes coming from different buffers sequentially.This approach can result in a policy that can perform multiple tasks and is robust enough to work under varying conditions, thanks to the intervention of independent RL agents, as also demonstrated in [25].

Challenges and Future Work
Soft robots tend to exhibit stochasticity in their performance, so the viability of the IL-RL combination remains a topic of interest and a subject of our ongoing research.We believe that the policy obtained from the IL-RL solution can work under varying conditions.However, if the stochasticity is incorporated into this variation, the adaptability of this solution may degrade.This is because we are relying solely on the search capabilities of an RL algorithm to adapt, which was trained based on a single training signal from a continuous-state environment.Although a knowledge base such as the one suggested in Fig. 4b, where multiple signals are being used to train a single policy, could introduce adaptability with significantly varying environments, the forever exploring RL-agents would again require more interactions to learn if the environment has changed since the training is running in parallel.
If we try to train sequentially with one type of environment and then another and so on, we will end up with a problem known as catastrophic forgetting, where the policy learns a new task but either underperforms or forgets how to do the previous task as it overwrites the policy weights.Previously learned task in a very different environment may also be construed as a new task.For such scenarios, intelligently exploring or effectively using the gathered knowledge base for learning a new task or adapting previously learned tasks in a very new environment is required.Continual learning is a class of algorithms that deals with such issues and may prove to be an exciting area to explore for future research in the current problem.

Conclusion
In this paper, we have discussed the potential of two main classes of algorithms in machine learning, IL and RL, to find optimal control solutions for soft robots.We have shown that each approach has its own strengths and limitations, and that combining them can provide a more comprehensive solution that is both efficient and robust under varying and unstructured environments.We have further highlighted the importance of addressing the issue of stochasticity in soft robots control, which can pose a challenge to the adaptability of IL-RL solutions.We further propose employing Continual learning for the adaptability problem as it can help to overcome this challenge and enable soft robots to learn and adapt to new tasks and environments effectively.Overall, our perspective on combination of IL and RL for soft robots control highlights their potential to reform the applicability of soft robots for sophisticated tasks.We hope that this paper will inspire further research in this exciting area and lead to the development of new and innovative approaches for controlling soft robots.

Figure 1 :
Figure1: Scenario 01 -Kinesthetic expert demonstrations are recorded using a motion capture system.Behavior cloning can be applied to the expert's state-action pairs to train a preliminary policy, which can be further optimized using policy optimization approaches.

Figure 3 :
Figure 3: This figure highlights the definition of various components involved in the training environment for a policy trained using RL-based algorithms.Policies obtained from offline training environments often underperform when tested on the actual robot, particularly for robots with complex dynamics, such as high DoF soft robots.This discrepancy between the simulation/training environment and the actual environment is commonly referred to as the simulation-to-reality gap and typically requires an additional optimization step with the physical robot.

Figure 4 :
Figure 4: The block diagrams in (a) and (b) illustrate the systemic flow of the algorithm proposed by Rusu et al. in[25].The goal is to train a student policy using policy distillation, where knowledge is distilled from a DQN agent in online mode.The DQN agent acts as an expert for the under-training student policy and provides demonstrations for each encountered state.In (b), N DQN agents act as experts for N different tasks and provide demonstrations for each task in an associated buffer.The knowledge is then distilled from all the DQN agents in online mode to train a student policy capable of performing N tasks.