LLM Self-Teaching and Self-Critiquing for
Improving Robotic Planning
Large Language Models (LLMs) have recently shown strong promise for robotic task planning, particularly through automatic planning domain generation. However, planning domains are brittle under imperfect logical states and perception noise. Prior approaches largely treat generated planning domains as plan utilities, overlooking their potential as scalable sources of reasoning supervision and structured reward signals. At the same time, reasoning LLMs depend on chain-of-thought (CoT) supervision that is expensive to collect for robotic tasks, and reinforcement learning (RL) faces challenges on reward engineering.
We propose Self-CriTeach, an LLM self-teaching and self-critiquing framework in which an LLM autonomously generates symbolic planning domains that serve a dual role: (i) enabling large-scale generation of robotic planning problem–plan pairs, and (ii) providing structured reward functions. First, the self-written domains enable large-scale generation of symbolic task plans, which are automatically transformed into extended CoT trajectories for supervised fine-tuning. Second, the self-written domains are reused as structured reward functions, providing dense feedback for reinforcement learning without manual reward engineering. This unified training pipeline yields a planning-enhanced LLM with higher planning success rates, stronger cross-task generalization, reduced inference cost, and resistance to imperfect logical states.
LLMs generate chain-of-thought reasoning from symbolic plans, creating scalable training data without human annotation.
Supervised fine-tuning followed by preference optimization significantly improves planning accuracy.
Structured CoT supervision reduces token consumption while improving accuracy compared to baseline approaches.
The base LLM generates PDDL planning domains from task demonstrations, with iterative refinement through validation-repair-pruning loops to ensure domain correctness and compactness.
Symbolic plans are converted into chain-of-thought representations by eliciting natural language reasoning that includes:
The model is trained on tuples ⟨Problem, Plan, CoT⟩ using standard language modeling loss, learning to generate both action sequences and explanatory reasoning.
The self-generated planning domain provides fine-grained reward signals:
Pipeline: Domain Generation → Problem-Plan Pairs → CoT Generation → SFT → RL with Domain Rewards
| Model | Size | Pass@1 | Pass@4 | Avg Tokens | Improvement |
|---|---|---|---|---|---|
| SCT-4B (Qwen3-4B-P2E) | 4B | 32.9% | 50.0% | 21M | +31% |
| Qwen3-4B-Instruct (baseline) | 4B | 25.1% | 46.8% | 33M | - |
| SCT-8B (Llama-3.1-8B-P2E) | 8B | 27.7% | 42.7% | - | - |
| Dataset | Samples | Extraction | Legal Plans | Description |
|---|---|---|---|---|
| align_data | 400 | 97.0% | 97.0% | Object alignment tasks |
| stack_data | 400 | 88.8% | 88.8% | Block stacking |
| unstack_data | 400 | 89.5% | 89.0% | Block unstacking |
| reorder_data | 400 | 89.5% | 89.2% | Object reordering |
| pack_machining_parts | 400 | 96.8% | 96.2% | Factory assembly (unseen) |
| prepare_experiment | 400 | 97.0% | 96.0% | Lab setup (unseen) |
| reorganize_room | 400 | 97.0% | 95.0% | Household tasks (unseen) |
Complete implementation with training scripts, evaluation pipeline, and documentation.
View CodePre-trained checkpoints: SCT-4B (Qwen3-4B) and SCT-8B (Llama-3.1-8B)
Download Models# Install
pip install -r requirements.txt
# Load model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("self-criteach/SCT-4B")
tokenizer = AutoTokenizer.from_pretrained("self-criteach/SCT-4B")
# Generate plan
problem = "Static predicates: [[box, 1], [box, 2], [table, 0], [robot, r]]..."
inputs = tokenizer(problem, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=2048)
plan = tokenizer.decode(outputs[0], skip_special_tokens=True)
If you find this work useful, please cite:
@inproceedings{huang2026selfcriteach,
title={Self-CriTeach: LLM Self-Teaching and Self-Critiquing for Improving Robotic Planning via Automated Domain Generation},
author={Huang, Jinbang and Li, Zhiyuan and Hu, Yuanzhao and Zhang, Zhanguang and Coates, Mark and Quan, Xingyue and Zhang, Yingxue},
booktitle={ICLR 2026 Workshop on AI with Recursive Self-Improvement},
year={2026},
organization={Huawei Noah's Ark Lab}
}
Huawei Noah's Ark Lab
Huawei Noah's Ark Lab
University of Toronto
Huawei Noah's Ark Lab
University of British Columbia
Huawei Noah's Ark Lab
McGill University
Huawei Noah's Ark Lab
Huawei Noah's Ark Lab