Task-Oriented Adaptive Learning of Robot Manipulation Skills

Kexin Jin1, Guohui Tian2, Bin Huang 2, Yongcheng Cui3, Xiaoyu Zheng4,
1Shandong University
Corresponding authors: kx.jin@mail.sdu.edu.cn; g.h.tian@sdu.edu.cn;huangbin@sdu.edu.cn;cuiyc@mail.sdu.edu.cn;202334997@mail.sdu.edu.cn

Task-Oriented Adaptive Learning of Robot Manipulation Skills


Abstract

In industrial environments, robots are confronted with constantly changing working conditions and manipulation tasks. Traditional approaches that require robots to learn skills from scratch for new tasks are often slow and heavily dependent on human intervention. To address inefficiency and enhance robots' adaptability, we propose a general Intelligent Transfer System (ITS) that enables autonomous and rapid new skill learning in dynamic environments. ITS integrates Large Language Models (LLMs) with transfer reinforcement learning, harnessing both the advanced comprehension and generative capabilities of LLMs and the pre-acquired skill knowledge of robotic systems. First, to achieve comprehensive understanding of unseen task commands and enable autonomous skill learning in robots, we propose a reward function generation method based on task-specific reward components. This approach significantly improves time efficiency and accuracy while eliminating the need for manual design. Secondly, to accelerate the learning speed of new robotic skills, we propose an Intelligent Transfer Network (ITN) within the ITS. Unlike traditional methods that merely reuse or adapt existing skills, ITN intelligently integrates related skill features, significantly enhancing learning efficiency through synergistic knowledge fusion. We systematically evaluate our method in the simulation environment. The results demonstrate that the system can efficiently learn unseen skills without relying on pre-programmed behaviors, achieving significant improvements in time efficiency—specifically, 72.22% and 65.17% faster learning for two major tasks compared to learning from scratch. Supplementary materials are accessible via our project page:https://jkx-yy.github.io/

Algorithm

ITS generates a process-oriented reward function for the new task and accelerates learning.

Dataset


We have constructed a dataset containing 10000 instructions of task-reward_components. Each instruction includes task command information and the corresponding reward_components, as shown in the figure below. The dataset can be accessed in the code.

Example of a Single Data Instruction in JSON Code Block



{
    "instruction": "For the continuous operation task of the robot, according to the input task command, all the distance calculation components involved are listed step by step according to the execution order.",
    "input": "The three small, medium and large gears on the table are pushed to the specified position in turn.",
    "output": "1. Calculate the distance between the jaws and the small gear: distance_finger_to_small_gear\n2. Calculate the distance between the small gear and the small gear's target location: distance_small_gear_to_small_location\n3. Calculate the distance between the jaws and the medium gear: distance_finger_to_medium_gear\n5. Calculate the distance between the medium gear and the target location of the medium gear: distance_medium_gear_to_mediun_location\n5. Calculate the distance between the jaws and the large gear: distance_finger_to_large_gear\n6. Calculate the distance  between the large gear and the target location of the large gear: distance_large_gear_to_large_location"
  }
                

Reward_components Generation Comparison: Reward_components Generation without Dataset Fine-tuning (LLaMA) vs. Reward_components Generation with Dataset Fine-tuning (LLaMA)


Not Fine-tuned

Fine-tuned


Reward_components generated by LLaMA without fine-tuning are biased towards description and cannot be directly used for reward function generation. However, LLaMA, fine-tuned with our carefully designed reward_components dataset, can accurately generate reward_components that can be directly used for reward function generation.

Reward Function Generation Experiment

In this experiment, we list the generator's reward optimization process and compare the reward function generated by the generator with the reward function generated by Eureka and the manually designed reward function.


Effective prompt for using reward_components : To make the generator make full use of the reward_components obtained by fine-tuning LLaMA, we give a simple example as a prompt of the use of reward_components and can replace the reward _omponents according to different tasks.

@torch.jit.script
def compute_reward(fingertip_midpoint_pos: Tensor, nut_pos: Tensor, bolt_tip_pos: Tensor) -> Tuple[Tensor, Dict[str, Tensor]]:
    
    # Adjusted temperature parameters for reward transformations
    temp_pick = torch.tensor(0.1)  # Increased to make initial progress more rewarding
    temp_place = torch.tensor(0.4)  # Increase, make this step more rewarding

    # Calculate Distance Based on reward_components
    distance_finger_to_nut = torch.norm(fingertip_midpoint_pos - nut_pos, p=2, dim=-1)
    distance_nut_to_bolttip = torch.norm(nut_pos - bolt_tip_pos, p=2, dim=-1)

    # Adjusted rewards for each stage of the task with new scaling
    reward_pick = -distance_finger_to_nut 
    reward_place = -distance_nut_to_bolttip

    # Transform rewards with exponential function to normalize and control scale
    reward_pick_exp = torch.exp(reward_pick / temp_pick)
    reward_place_exp = torch.exp(reward_place / temp_place)

    # Combine rewards for total reward, ensuring sequential completion by multiplication
    total_reward = reward_pick_exp  * reward_place_exp

    # reward_components dictionary
    rewards_dict = {
        "reward_pick": reward_pick_exp,
        "reward_place": reward_place_exp,}

    return total_reward, rewards_dict

      
                                    

1. The generator's reward function generation and optimization process for simple skills:

Task description : There is an M12 nut on the table, and the goal of Franka's robotic arm is first to pick up the M12 nut from the table and raise the nut to the point where the height from the table is the height of the bolt.

Environmental observation code : The environment observation code includes the pose information of the Panda arm's gripper and the object's pose information.

class FactoryTaskNutBoltPick(FactoryEnvNutBolt, FactoryABCTask):
    """Rest of the environment definition omitted."""
    def compute_observations(self):
        """Compute observations."""
        obs_tensors = [

                       self.fingertip_midpoint_pos, 
                       self.fingertip_midpoint_quat,  
                       self.fingertip_midpoint_linvel, 
                       self.fingertip_midpoint_angvel,
                       self.bolt_tip_pos, 
                       self.bolt_tip_quat, 
                       self.nut_pos,
                       self.nut_quat,
                       ]
        self.obs_buf = torch.cat(obs_tensors, dim=-1)  
        pos=self.obs_buf.size()
        com_vector=torch.full([pos[0],self.num_observations-pos[1]],0.0).to(self.device)
        self.obs_buf = torch.cat((self.obs_buf,com_vector),-1)
        return self.obs_buf
                

Reward function optimization process : Taking the task NutBolt _ Pick task as an example, the process of the generator optimizing the reward function is shown.
Epoch1 Task success rate : 52.3 %
Epoch2 Task success rate : 60.9 %
Epoch3 Task success rate : 88.7 %
Epoch4 Task success rate : 91.5 %

We trained a RL policy using the provided reward function code and tracked the values of the individual components in the reward function as well as global policy metrics such as success rates and episode lengths after every 11 epochs and the maximum, mean, minimum values encountered:
reward_pick: ['0.05', '0.06', '0.17', '0.40', '0.68', '0.72', '0.77', '0.79', '0.78', '0.79', '0.79'], Max: 0.80, Mean: 0.57, Min: 0.05 
reward_lift: ['0.83', '0.83', '0.83', '0.83', '0.83', '0.83', '0.83', '0.84', '0.84', '0.86', '0.85'], Max: 0.86, Mean: 0.84, Min: 0.83 
task_score: ['0.00', '0.00', '0.00', '0.02', '0.30', '0.53', '0.67', '0.77', '0.80', '0.87', '0.77'], Max: 0.91, Mean: 0.46, Min: 0.00 
episode_lengths: ['119.00', '119.00', '119.00', '119.00', '119.00', '119.00', '119.00', '119.00', '119.00', '119.00', '119.00'], Max: 119.00, Mean: 119.00, Min: 119.00 
Please carefully analyze the policy feedback and provide a new, improved reward function that can better solve the task. Some helpful tips for analyzing the policy feedback:
    (1) If the success rate is always close to zero, it means that the sequential execution has failed to reach the last step, and you can infer that the task execution failed at the first step by analyzing the rewards for each subtask
    (2) If the values for a certain reward_components are near identical throughout, then this means RL is not able to optimize this component as it is written. You may consider
        (a) Changing its scale or the value of its temperature parameter
        (b) Re-writing the reward_components 
        (c) Discarding the reward_components
    (3) If some reward_components' magnitude is significantly larger, then you must re-scale its value to a proper range
Please analyze each existing reward_components in the suggested manner above first, and then write the reward function code. The output of the reward function should consist of two items:
    (1) the total reward,
    (2) a dictionary of each individual reward_components(reward dictionary for each sub-process)..
The code output should be formatted as a python code string: "```python ... ```".
                

2. Complex Task Reward Generation (Generator) and Component Analysis:

Task description : Push the three small, medium, and large sized gears on the table in sequence to their designated positions.

Environmental observation code :

obs_tensors = [self.fingertip_midpoint_pos,
               self.fingertip_midpoint_quat,
               self.fingertip_midpoint_linvel,
               self.fingertip_midpoint_angvel,
               self.gear_small_pos,
               self.gear_small_quat,
               self.gear_medium_pos,
               self.gear_medium_quat,
               self.gear_large_pos,
               self.gear_large_quat,
               self.small_location_pos,
               self.small_location_quat,
               self.medium_location_pos,
               self.medium_location_quat,
               self.large_location_pos,
               self.large_location_quat]
      

Generated reward_components :

1. Calculate the distance between the jaws and the small gear: distance_finger_to_small_gear
2. Calculate the distance between the small gear and the small gear’s target location: distance_small_gear_to_small_location
3. Calculate the distance between the jaws and the medium gear: distance_finger_to_medium_gear
4. Calculate the distance between the medium gear and the target location of the medium gear: distance_medium_gear_to_medium_location
5. Calculate the distance between the jaws and the large gear: distance_finger_to_large_gear
6. Calculate the distance between the large gear and the target location of the large gear: distance_large_gear_to_large_location
      

Generated reward function:

    ...
       temp_pick = torch.tensor(0.1)
        temp_carry = torch.tensor(0.2)
        temp_place = torch.tensor(0.3)
        # Calculate distances
        distance_finger_to_small_gear = torch.norm(gear_small_pos - fingertip_midpoint_pos, p=2, dim=-1)
        distance_small_gear_to_small_location = torch.norm(gear_small_pos - self.small_location_pos, p=2, dim=-1)
        distance_finger_to_medium_gear = torch.norm(gear_medium_pos - fingertip_midpoint_pos, p=2, dim=-1)
        distance_medium_gear_to_mediun_location = torch.norm(gear_medium_pos - self.medium_location_pos, p=2, dim=-1)
        distance_finger_to_large_gear = torch.norm(gear_large_pos - fingertip_midpoint_pos, p=2, dim=-1)
        distance_large_gear_to_large_location = torch.norm(gear_large_pos - self.large_location_pos, p=2, dim=-1)

        # Define rewards for small gear
        reward_pick_small = -distance_finger_to_small_gear
        reward_carry_small = distance_small_gear_to_small_location
        reward_place_small = -torch.abs(distance_small_gear_to_small_location - torch.tensor(0.0))  
    
        # Transform small gear rewards
        reward_pick_small_exp = torch.exp(reward_pick_small /temp_pick)
        reward_carry_small_exp = torch.exp(reward_carry_small /temp_carry)
        reward_place_small_exp = torch.exp(reward_place_small /temp_place)
        
        # Define rewards for medium gear
        reward_pick_medium = -distance_finger_to_medium_gear
        reward_carry_medium = distance_medium_gear_to_mediun_location
        reward_place_medium = -torch.abs(distance_medium_gear_to_mediun_location - torch.tensor(0.0))  # Ideally, this distance is zero

        # Transform medium gear rewards
        reward_pick_medium_exp = torch.exp(reward_pick_medium /temp_pick)
        reward_carry_medium_exp = torch.exp(reward_carry_medium /temp_carry)
        reward_place_medium_exp = torch.exp(reward_place_medium /temp_place)
        
        # Define rewards for large gear
        reward_pick_large = -distance_finger_to_large_gear
        reward_carry_large = distance_large_gear_to_large_location
        reward_place_large = -torch.abs(distance_large_gear_to_large_location - torch.tensor(0.0))  # Ideally, this distance is zero
        
        # Transform large gear rewards
        reward_pick_large_exp = torch.exp(reward_pick_large/temp_pick )
        reward_carry_large_exp = torch.exp(reward_carry_large /temp_carry)
        reward_place_large_exp = torch.exp(reward_place_large /temp_place)
    ...
 

Reward function generation failure : The generator may generate attributes that are not present in the observation space as part of the design of the reward function. At this point, the cause of the erroneous generation needs to be identified and fed back in a timely manner, and a new reward function needs to be regenerated.ion and feedback on the cause of the error generation.

 self.rew_buf[:], self.rew_dict = compute_reward(self.fingertip_midpoint_pos, self.gear_small_pos, self.gear_small_target_pos, self.gear_medium_pos, self.gear_medium_target_pos, self.gear_large_pos, self.gear_large_target_pos)
AttributeError: 'FactoryTaskGearsPickPlaceGPT' object has no attribute 'gear_small_target_pos'
                

3. Compare the reward function generated by the generator with the reward function generated by Eureka and the manually designed reward function:


Compared with the reward function designed by humans, the other two methods are superior. By comparing the performance of Eureka and the generator, since our method introduces a reward_components with a human experience level, which is more reliable than Eureka's random generation of rewards, and the generated code is more readable, we call Process-oriented reward function generation based on reward_components guidance.

Transfer Learning Experiment Settings

We utilize the four basic skills and complex skills of the industrial assembly of robots to compose the entire experiment. Among them, the sequential complex skills are combinations of various basic skills.


Pick

Place

Screw

Insert


Table 1 lists some basic skills as well as complex skills

Basic Skill Experiments

In the video below, we demonstrate each of the four basic skills to operate different models of mechanical parts using the reward function generated by the generator.


NutBolt_Pick
NutBolt_Place
Peghole_Insert
NutBolt_Screw

Complex Experiments

In the following video, we show two complex skills composed of basic skills: the NutBolt task and the PegHole task. The robot operates parts of different types and sizes. In each task, ITS uses the reward function generated by the generator and uses the skills of the source domain to transfer learning to the target task. The entire process, from receiving a new task command to the robot autonomously and quickly completing this task, was performed without human involvement.


NutBolt_PickPlace
Round_PegHole_PickPlaceInsert
Rectangular_PegHole_PickPlaceInsert

Comparative Experiment

In the following videos, we give the results of applying the No_transfer vs ITN(ours) in two tasks, NutBolt_PickPlace and PegHole_Insert.Applying ITN's system to learn new skills is much more efficient.


NutBolt_PickPlace (NO_Transfer VS ITN)
PegHole_PickPlaceInsert (NO_Transfer VS ITN)

Attention allocation. This graph illustrates the changes in the attention values assigned by ITN to different subskills in a cycle in the NutBolt and PegHole tasks.


The figure shows ITN's policy allocation concerns. The data, derived from a single interaction cycle after training, stabilized and revealed that Intelligent Attention allocated attention to source skills differently at various stages of the target task. This distribution is similar to human experience. ITN successfully adapted to various tasks by adjusting the attention values of source skills, enabling robots to master new tasks quickly.