robometric_frame.task_performance

Task Performance Metrics for robotics policies.

This module contains metrics for evaluating task execution performance including: - Success Rate (SR) - Task Completion Rate (TCR) - Action Accuracy (MSE, AMSE, NAMSE)

class robometric_frame.task_performance.ActionAccuracy(normalize=False, action_variance=None, **kwargs)[source]

Compute Action Accuracy metrics (MSE, AMSE, NAMSE) for robotics policy evaluation.

This metric computes three related measures of action prediction accuracy:

  • MSE: Mean Squared Error per trajectory

  • AMSE: Average MSE across multiple trajectories

  • NAMSE: Normalized AMSE (scaled by action variance)

\[MSE = \frac{1}{T} \sum_{t=1}^{T} \|\mathbf{a}_t - \hat{\mathbf{a}}_t\|_2^2\]
\[AMSE = \frac{1}{K} \sum_{k=1}^{K} MSE_k\]
\[NAMSE = \frac{AMSE}{\sigma^2_{\text{action}}}\]

where \(\mathbf{a}_t\) is the ground truth action at timestep \(t\), \(\hat{\mathbf{a}}_t\) is the predicted action, \(T\) is the number of timesteps in a trajectory, \(K\) is the number of trajectories, and \(\sigma^2_{\text{action}}\) is the variance of ground truth actions.

Parameters:
  • normalize (bool) – Whether to compute NAMSE. If True, action variance is computed from the data. If False, only MSE and AMSE are computed. Default: False.

  • action_variance (Optional[float]) – Pre-computed action variance for normalization. If provided, this value is used instead of computing from data. Default: None.

  • **kwargs (Any) – Additional keyword arguments passed to the base Metric class.

Example

>>> from robometric_frame import ActionAccuracy
>>> import torch
>>> metric = ActionAccuracy()
>>>
>>> # Single trajectory
>>> predictions = torch.randn(10, 4)  # 10 timesteps, 4-dim actions
>>> targets = torch.randn(10, 4)
>>> metric.update(predictions, targets)
>>> results = metric.compute()
>>> print(f"MSE: {results['mse']:.4f}, AMSE: {results['amse']:.4f}")
>>>
>>> # With normalization
>>> metric = ActionAccuracy(normalize=True)
>>> metric.update(predictions, targets)
>>> results = metric.compute()
>>> print(f"NAMSE: {results['namse']:.4f}")
Example (multiple trajectories):
>>> metric = ActionAccuracy()
>>> # Trajectory 1
>>> metric.update(torch.randn(10, 4), torch.randn(10, 4))
>>> # Trajectory 2
>>> metric.update(torch.randn(15, 4), torch.randn(15, 4))
>>> results = metric.compute()
>>> # AMSE is averaged across both trajectories
full_state_update: bool = False
total_mse: Tensor
total_trajectories: Tensor
total_squared_actions: Tensor
total_actions: Tensor
total_action_count: Tensor
__init__(normalize=False, action_variance=None, **kwargs)[source]

Initialize the ActionAccuracy metric.

update(predictions, targets)[source]

Update metric state with predicted and target actions.

Parameters:
  • predictions (Tensor) – Predicted actions of shape (T, D) where T is the number of timesteps and D is the action dimension.

  • targets (Tensor) – Ground truth actions of shape (T, D).

Raises:

ValueError – If predictions and targets have different shapes or are empty.

Return type:

None

compute()[source]

Compute the final Action Accuracy metrics.

Returns:

  • ‘mse’: Mean Squared Error of the last trajectory

  • ’amse’: Average MSE across all trajectories

  • ’namse’: Normalized AMSE (only if normalize=True)

Return type:

Dictionary containing

Raises:

RuntimeError – If no trajectories have been recorded.

class robometric_frame.task_performance.SuccessRate(threshold=None, ignore_index=None, **kwargs)[source]

Compute Success Rate for robotics policy task evaluation.

Success Rate is calculated as:

\[SR = \frac{N_{\text{success}}}{N_{\text{total}}}\]

where \(N_{\text{success}}\) is the number of successfully completed tasks and \(N_{\text{total}}\) is the total number of tasks attempted.

This metric supports both binary success indicators and continuous success scores with an optional threshold.

Parameters:
  • threshold (Optional[float]) – Threshold for binary classification when using continuous scores. If None, assumes binary inputs (0 or 1). Default: None.

  • ignore_index (Optional[int]) – Value to ignore in the success tensor. Default: None.

  • **kwargs (Any) – Additional keyword arguments passed to the base Metric class.

Example

>>> from robometric_frame import SuccessRate
>>> metric = SuccessRate()
>>> # Binary success indicators
>>> success = torch.tensor([1, 1, 0, 1, 0, 0, 1])
>>> metric(success)
tensor(0.5714)
>>> # With continuous scores and threshold
>>> metric = SuccessRate(threshold=0.8)
>>> scores = torch.tensor([0.9, 0.7, 0.85, 0.6, 0.95])
>>> metric(scores)
tensor(0.6000)
Example (distributed):
>>> # In distributed training, metrics are automatically synced
>>> metric = SuccessRate()
>>> # On GPU 0
>>> success_gpu0 = torch.tensor([1, 1, 0])
>>> metric(success_gpu0)
>>> # On GPU 1
>>> success_gpu1 = torch.tensor([1, 0, 1])
>>> metric(success_gpu1)
>>> # Final result aggregates across all GPUs
>>> result = metric.compute()  # Returns aggregated success rate
full_state_update: bool = False
total_success: Tensor
total_tasks: Tensor
__init__(threshold=None, ignore_index=None, **kwargs)[source]

Initialize the SuccessRate metric.

update(success)[source]

Update metric state with new success indicators.

Parameters:

success (Tensor) – Tensor of shape (N,) containing binary success indicators (0 or 1) or continuous success scores if threshold is set. Values can be int, float, or bool.

Raises:

ValueError – If success tensor is empty or contains invalid values.

Return type:

None

compute()[source]

Compute the final Success Rate.

Return type:

Tensor

Returns:

Success rate as a scalar tensor in range [0, 1].

Raises:

RuntimeError – If no tasks have been recorded (total_tasks == 0).

class robometric_frame.task_performance.TaskCompletionRate(threshold=None, ignore_index=None, **kwargs)[source]

Compute Task Completion Rate for robotics policy task chain evaluation.

Task Completion Rate is calculated as:

\[TCR = \frac{N_{\text{completed tasks}}}{N_{\text{task chains}}}\]

where \(N_{\text{completed tasks}}\) is the number of successfully completed task chains and \(N_{\text{task chains}}\) is the total number of task chains attempted.

This metric evaluates multi-step task sequences, measuring success rates across sequential steps. Research shows that success rates drop significantly between sequential steps, indicating challenges in complex instruction following.

Parameters:
  • threshold (Optional[float]) – Threshold for binary classification when using continuous scores. If None, assumes binary inputs (0 or 1). Default: None.

  • ignore_index (Optional[int]) – Value to ignore in the completion tensor. Default: None.

  • **kwargs (Any) – Additional keyword arguments passed to the base Metric class.

Example

>>> from robometric_frame import TaskCompletionRate
>>> metric = TaskCompletionRate()
>>> # Binary completion indicators for task chains
>>> completion = torch.tensor([1, 0, 1, 1, 0])
>>> metric(completion)
tensor(0.6000)
>>> # With continuous scores and threshold
>>> metric = TaskCompletionRate(threshold=0.8)
>>> scores = torch.tensor([0.9, 0.7, 0.85, 0.95])
>>> metric(scores)
tensor(0.7500)
Example (multi-step evaluation):
>>> # Evaluate task chains over multiple batches
>>> metric = TaskCompletionRate()
>>> # First batch: 3 task chains, 2 completed
>>> batch1 = torch.tensor([1, 0, 1])
>>> metric.update(batch1)
>>> # Second batch: 2 task chains, 1 completed
>>> batch2 = torch.tensor([0, 1])
>>> metric.update(batch2)
>>> # Overall completion rate
>>> metric.compute()
tensor(0.6000)
full_state_update: bool = False
total_completed: Tensor
total_chains: Tensor
__init__(threshold=None, ignore_index=None, **kwargs)[source]

Initialize the TaskCompletionRate metric.

update(completion)[source]

Update metric state with new task chain completion indicators.

Parameters:

completion (Tensor) – Tensor of shape (N,) containing binary completion indicators (0 or 1) or continuous completion scores if threshold is set. Values can be int, float, or bool.

Raises:

ValueError – If completion tensor is empty or contains invalid values.

Return type:

None

compute()[source]

Compute the final Task Completion Rate.

Return type:

Tensor

Returns:

Task completion rate as a scalar tensor in range [0, 1].

Raises:

RuntimeError – If no task chains have been recorded (total_chains == 0).

Modules

action_accuracy

Action Accuracy metrics for robotics policy evaluation.

success_rate

Success Rate metric for robotics policy evaluation.

task_completion_rate

Task Completion Rate metric for robotics policy evaluation.