Self-CriTeach

LLM Self-Teaching and Self-Critiquing for
Improving Robotic Planning

ICLR 2026 Workshop on AI with Recursive Self-Improvement
Jinbang Huang¹, Zhiyuan Li¹,², Yuanzhao Hu¹,³, Zhanguang Zhang¹, Mark Coates⁴, Xingyue Quan¹, Yingxue Zhang¹
¹Huawei Noah's Ark Lab, ²University of Toronto, ³University of British Columbia, ⁴McGill University

Abstract

Large Language Models (LLMs) have recently shown strong promise for robotic task planning, particularly through automatic planning domain generation. However, planning domains are brittle under imperfect logical states and perception noise. Prior approaches largely treat generated planning domains as plan utilities, overlooking their potential as scalable sources of reasoning supervision and structured reward signals. At the same time, reasoning LLMs depend on chain-of-thought (CoT) supervision that is expensive to collect for robotic tasks, and reinforcement learning (RL) faces challenges on reward engineering.

We propose Self-CriTeach, an LLM self-teaching and self-critiquing framework in which an LLM autonomously generates symbolic planning domains that serve a dual role: (i) enabling large-scale generation of robotic planning problem–plan pairs, and (ii) providing structured reward functions. First, the self-written domains enable large-scale generation of symbolic task plans, which are automatically transformed into extended CoT trajectories for supervised fine-tuning. Second, the self-written domains are reused as structured reward functions, providing dense feedback for reinforcement learning without manual reward engineering. This unified training pipeline yields a planning-enhanced LLM with higher planning success rates, stronger cross-task generalization, reduced inference cost, and resistance to imperfect logical states.

Key Contributions

🧠

Self-Teaching Framework

LLMs generate chain-of-thought reasoning from symbolic plans, creating scalable training data without human annotation.

7,476 samples
📈

Performance Improvement

Supervised fine-tuning followed by preference optimization significantly improves planning accuracy.

32.9% Pass@1

Token Efficiency

Structured CoT supervision reduces token consumption while improving accuracy compared to baseline approaches.

35% reduction

Method Overview

1. Automatic PDDL Domain Generation

The base LLM generates PDDL planning domains from task demonstrations, with iterative refinement through validation-repair-pruning loops to ensure domain correctness and compactness.

2. Symbolic-to-CoT Transformation

Symbolic plans are converted into chain-of-thought representations by eliciting natural language reasoning that includes:

3. Supervised Fine-Tuning

The model is trained on tuples ⟨Problem, Plan, CoT⟩ using standard language modeling loss, learning to generate both action sequences and explanatory reasoning.

4. Reinforcement Learning with Structured Rewards

The self-generated planning domain provides fine-grained reward signals:

Pipeline: Domain Generation → Problem-Plan Pairs → CoT Generation → SFT → RL with Domain Rewards

Experimental Results

Main Results: Model Performance

Model Size Pass@1 Pass@4 Avg Tokens Improvement
SCT-4B (Qwen3-4B-P2E) 4B 32.9% 50.0% 21M +31%
Qwen3-4B-Instruct (baseline) 4B 25.1% 46.8% 33M -
SCT-8B (Llama-3.1-8B-P2E) 8B 27.7% 42.7% - -

Per-Domain Performance (Qwen3-4B-P2E)

Dataset Samples Extraction Legal Plans Description
align_data 400 97.0% 97.0% Object alignment tasks
stack_data 400 88.8% 88.8% Block stacking
unstack_data 400 89.5% 89.0% Block unstacking
reorder_data 400 89.5% 89.2% Object reordering
pack_machining_parts 400 96.8% 96.2% Factory assembly (unseen)
prepare_experiment 400 97.0% 96.0% Lab setup (unseen)
reorganize_room 400 97.0% 95.0% Household tasks (unseen)

Key Findings

Code & Resources

💻

GitHub Repository

Complete implementation with training scripts, evaluation pipeline, and documentation.

View Code
🤗

HuggingFace Models

Pre-trained checkpoints: SCT-4B (Qwen3-4B) and SCT-8B (Llama-3.1-8B)

Download Models
📊

Dataset

7,476 training samples and 2,800 evaluation samples with CoT reasoning.

Get Dataset
🎮

Interactive Demo

Try the planner online with custom PDDL problems.

Try Demo

Quick Start

# Install
pip install -r requirements.txt

# Load model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("self-criteach/SCT-4B")
tokenizer = AutoTokenizer.from_pretrained("self-criteach/SCT-4B")

# Generate plan
problem = "Static predicates: [[box, 1], [box, 2], [table, 0], [robot, r]]..."
inputs = tokenizer(problem, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=2048)
plan = tokenizer.decode(outputs[0], skip_special_tokens=True)

Citation

If you find this work useful, please cite:

@inproceedings{huang2026selfcriteach,
  title={Self-CriTeach: LLM Self-Teaching and Self-Critiquing for Improving Robotic Planning via Automated Domain Generation},
  author={Huang, Jinbang and Li, Zhiyuan and Hu, Yuanzhao and Zhang, Zhanguang and Coates, Mark and Quan, Xingyue and Zhang, Yingxue},
  booktitle={ICLR 2026 Workshop on AI with Recursive Self-Improvement},
  year={2026},
  organization={Huawei Noah's Ark Lab}
}

Team

Jinbang Huang

Huawei Noah's Ark Lab

Zhiyuan Li

Huawei Noah's Ark Lab
University of Toronto

Yuanzhao Hu

Huawei Noah's Ark Lab
University of British Columbia

Zhanguang Zhang

Huawei Noah's Ark Lab

Mark Coates

McGill University

Xingyue Quan

Huawei Noah's Ark Lab

Yingxue Zhang

Huawei Noah's Ark Lab