Building artificial intelligence embodied in the conversation: How did we teach a robot to understand, move and interact

News FetcherMarch 9, 2025

0 2 2 minutes read

Building artificial intelligence embodied in the conversation: How did we teach a robot to understand, move and interact

Imagine a robot question: “Hey, picked the red cup from the kitchen and brought it here.”

It looks simple? But for artificial intelligence, this includes language understanding, moving in space, identifying things, and providing notes in actual time.

This is exactly what you dealt with Alexa Simbot Award Challenge Where we built EMbdied Agent who can understand instructions, move through his environment, interact with organisms, and communicate again.

Here is the way we made it work using BERT, learning to reinforce, and multimedia learning. Let’s pass the various problems and how we dealt with each of them.

Understanding language with Bert

The natural language is chaotic and can become very complicated. We humans say Go to the refrigerator But it can also say Find the refrigerator and open it. The robot should be extracted from different foals.

To do this, we used BERT (two -way encryption representations) to convert text instructions into organized orders, so that it is easier for them to implement them in a successive way.

How to work

The user or the types of instructions speak.
Bert treats the text And the intention is extracted.
This artificial intelligence translates into implementable procedures Love Mobility _TO (refrigerator) or Choose (Red_cup).

Below is the essence Bart -based education analyst:

import torch
import torch.nn as nn
import torch.optim as optim
from transformers import BertTokenizer, BertModel

class InstructionEncoder(nn.Module):
    """
    Fine-tunes BERT on domain-specific instructions, outputs a command distribution.
    """
    def __init__(self, num_commands=10, dropout=0.1):
        super(InstructionEncoder, self).__init__()
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_commands)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        return logits

#Suppose we have some labeled data: (text -> command_id)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = InstructionEncoder(num_commands=12)
model.train()

instructions = ["Go to the fridge", "Pick up the red cup", "Turn left"]
labels = [2, 5, 1]  

input_encodings = tokenizer(instructions, padding=True, truncation=True, return_tensors="pt")
labels_tensor = torch.tensor(labels)

optimizer = optim.AdamW(model.parameters(), lr=1e-5)
criterion = nn.CrossEntropyLoss()

The main results and results

Achieve 92 % accuracy In setting user instructions to robot tasks.
I dealt with the differences in complex formulation Better than the rules -based NLP.
Adaptation to adaptation It improved the understanding of the terms of the environment (“refrigerator”, “meter”, “sofa”).
Strong for synonyms And the differences in the construction of the simple sentence (“GRAB”, “Pick”, “Take”).
Allow real time command analysis (<100 milliliters per query).

Once the robot understands where To go, it needs a way to get there. We used A* Searches For organized environments (such as maps) and Reinforce learning (RL) for dynamic spaces.

A* Look for a fixed path: pre -calculated ways in the organized spaces.
RL for dynamic movementLearn the robot from experience and error using rewards.

This is how we implemented in the A* Search application.

import heapq

def a_star(grid, start, goal):
    def heuristic(a, b):
        return abs(a[0] - b[0]) + abs(a[1] - b[1])

    open_list = []
    heapq.heappush(open_list, (0, start))
    last = {}
    cost_so_far = {start: 0}

    while open_list:
        _, current = heapq.heappop(open_list)

        if current == goal:
            break

        for dx, dy in [(-1, 0), (1, 0), (0, -1), (0, 1)]:  #4 directions
            neighbor = (current[0] + dx, current[1] + dy)
            if neighbor in grid:  #Check if it's a valid position
                new_cost = cost_so_far[current] + 1
                if neighbor not in cost_so_far or new_cost < cost_so_far[neighbor]:
                    cost_so_far[neighbor] = new_cost
                    priority = new_cost + heuristic(goal, neighbor)
                    heapq.heappush(open_list, (priority, neighbor))
                    last[neighbor] = current

    return last

This is to implement how we use RL for dynamic movement.

import gym
import numpy as np
from stable_baselines3 import PPO

class RobotNavEnv(gym.Env):
    """
    A simplified environment mixing a partial grid with dynamic obstacles.
    Observations might include LiDAR scans or collision sensors.
    """
    def __init__(self):
        super(RobotNavEnv, self).__init__()
        self.observation_space = gym.spaces.Box(low=0, high=1, shape=(360,), dtype=np.float32) 
        self.action_space = gym.spaces.Discrete(3) 
        self.state = np.zeros((360,), dtype=np.float32)

    def reset(self):
        self.state = np.random.rand(360).astype(np.float32) 
        return self.state

    def step(self, action):
        #Reward function: negative if collision, positive if progress to goal
        reward = 0.0
        done = False
        if action == 2 and np.random.rand() < 0.1:
            reward = -5.0
            done = True
        else:
            reward = 1.0

        self.state = np.random.rand(360).astype(np.float32)
        return self.state, reward, done, {}

env = RobotNavEnv()
model = PPO("MlpPolicy", env, verbose=1).learn(total_timesteps=5000)

The main results and results

A* SEARCH worked well in its controlled environments.
RL navigation is adapted to obstacles in actual time.
The speed of mobility has improved by 40 % on standard algorithms

Learn about organisms and interaction

Once the destination is reached, the robot should see and interact with organisms. This requires seeing the computer to localize the objects.

We trained a Yolov8 A model for identifying things like cups, doors and devices.

import torch
from ultralytics import YOLO
import numpy as np

#load a base YOLOv8 model
model = YOLO("yolov8s.pt")

#embeddings
object_categories = {
    "cup": np.array([0.22, 0.88, 0.53]),
    "mug": np.array([0.21, 0.85, 0.50]),
    "bottle": np.array([0.75, 0.10, 0.35]),
}

def classify_object(label, embeddings=object_categories):
    """
    If YOLOv8 doesn't have the exact label, we map it to the closest known category 
    by embedding similarity. 
    """
    if label in embeddings:
        return label
    else:
        best_label = None
        best_sim = -1
        for cat, emb in embeddings.items():
            sim = np.random.rand() 
            if sim > best_sim:
                best_label, best_sim = cat, sim
        return best_label

results = model("kitchen_scene.jpg")
for r in results:
    for box, cls_id in zip(r.boxes.xyxy, r.boxes.cls):
        label = r.names[int(cls_id)]
        mapped_label = classify_object(label)

The main results and results

Real time detection in 30 frames per second.
97 % accuracy in determining common home objects.
Empowering natural reactions such as “Blue Book”

Close the episode – from artificial intelligence in the natural language

Now that a robot:

Understand instructions (Bert)
It moves to the destination (A / RL)
It finds and interacts with objects (Yolov8)

It needs to understand how to respond to the user. This feedback loop also helps user experience; To achieve this, we used a Get the text based on GPT For dynamic responses.

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
model_gpt = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B").cuda()

def generate_feedback(task_status):
    """
    Composes a user-friendly message based on the robot's internal status or outcome.
    """
    prompt = (f"You are a helpful home robot. A user gave you a task. Current status: {task_status}.\n"
              f"Please provide a short, friendly response to the user:\n")
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model_gpt.generate(**inputs, max_length=60, do_sample=True, temperature=0.7)
    response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response_text.split("\n")[-1]  

print(generate_feedback("I have arrived at the kitchen. I see a red cup."))

The main results and results

The feedback from artificial intelligence improves user participation.
98 % of the test users found natural responses
The completion rate of the reinforced mission is 35 %

conclusion

Open advanced NLP synergy, strong path planning, real -time detection, and obstetric language new boundaries in cooperative robots. Our agents can explain accurate orders, mobility in dynamic environments, define objects with remarkable accuracy, and provide responses that feel natural.

In addition to carrying out the simple task, these robots are involved in real contact, illustration questions, explaining procedures and adapting to flying. It is a glimpse of the future where machines work more than service: They cooperate, learn and speak as real partners in our daily procedures.

Building artificial intelligence embodied in the conversation: How did we teach a robot to understand, move and interact

Understanding language with Bert

The main results and results

**Mobility with path planning (A* and learning reinforcement)**

How we trained the navigation system

The main results and results

Learn about organisms and interaction

The main results and results

Close the episode – from artificial intelligence in the natural language

The main results and results

conclusion

More reading on some technologies

News Fetcher

Leave a Reply Cancel reply

Understanding language with Bert

The main results and results

Mobility with path planning (A* and learning reinforcement)

How we trained the navigation system

The main results and results

Learn about organisms and interaction

The main results and results

Close the episode – from artificial intelligence in the natural language

The main results and results

conclusion

More reading on some technologies

News Fetcher

Related Articles

Crypto Pros Scoop Dogizen as Bitcoin, Litecoin returns to the right track

Bitcoin Directory wins the support of the revolutionary Neo Pepe strategies

The Czech Republic, and its quiet customers of Skoda cars, deceive an economic contraction that reveals in its crucial ally Germany.

Trump, the Distinguished Code – The insiders are planning the accident?

Leave a Reply Cancel reply

**Mobility with path planning (A* and learning reinforcement)**