PDDLLM

One Demo Is All It Takes:
Planning Domain Derivation with LLMs from a Single Demonstration

Jinbang Huang¹, Yixin Xiao¹, Zhanguang Zhang¹, Mark Coates², Jianye Hao¹, Yingxue Zhang¹
¹Huawei Noah's Ark Lab    ²McGill University

Abstract

Pre-trained large language models (LLMs) show promise for robotic task planning but often struggle to guarantee correctness in long-horizon problems. Task and motion planning (TAMP) addresses this by grounding symbolic plans in low-level execution, yet it relies heavily on manually engineered planning domains. To improve long-horizon planning reliability and reduce human intervention, we present Planning Domain Derivation with LLMs (PDDLLM), a framework that automatically induces symbolic predicates and actions directly from demonstration trajectories by combining LLM reasoning with physical simulation roll-outs.

Unlike prior domain-inference methods that rely on partially predefined predicates or natural-language descriptions of planning domains, PDDLLM constructs domains with minimal manual initialization and automatically integrates them with motion planners to produce executable plans. Across 1,200 tasks in nine environments, PDDLLM outperforms six LLM-based planning baselines, achieving ≥ 20% higher success rates, lower token cost, and successful deployment on multiple physical robot platforms.

Key Contributions

🧩

One-Demo Domain Generation

An algorithm that combines LLM reasoning with physical simulation roll-outs to automatically generate a human-interpretable PDDL planning domain from a single demonstration.

🔌

Logical Constraint Adapter (LoCA)

A systematic interface that automatically grounds the generated symbolic domain into motion-planner constraints — no manual alignment between PDDL actions and low-level skills.

🤖

Large-Scale + Real-Robot Eval

Extensive evaluation on 1,200 tasks across 9 environments shows superior long-horizon planning and token efficiency. Successfully deployed on three physical robot platforms.

Method Overview

PDDLLM framework overview.

Experimental Results

We compare PDDLLM against six LLM-based planning baselines across nine task families, measuring planning success rate and token cost. All baselines use a 50-second time limit unless noted.

Table 1 — Planning success rate (%) across tasks

PDDLLM matches expert-designed domains on most tasks and dominates every LLM baseline by ≥ 40% overall.

Task Expert LLMTAMP LLMTAMP-FF LLMTAMP-FR RuleAsMem PDDLLM
Stack98.541.770.864.285.597.5
Unstack10089.494.692.188.497.7
Color Classification10018.136.449.088.7100
Alignment10031.152.040.096.0100
Parts Assembly98.933.353.941.395.0100
Rearrange73.35.617.411.81.164.3
Burger Cooking10027.850.048.627.891.7
Bridge Building10043.353.351.720.087.2
Tower of Hanoi10014.314.314.314.3100
Overall95.735.752.548.669.993.3

Takeaway: PDDLLM reaches an overall 93.3% success rate — within striking distance of expert-designed domains (95.7%) and a +40.8% absolute improvement over the strongest LLM-based baseline (LLMTAMP-FF at 52.5%).

Table 2 — vs. reasoning LLMs (success rate + token cost)

Comparison on the three hardest tasks at an extended 500-second time limit. Tokens reported in thousands (k).

Task Success Rate (%) ↑ Token Cost (k) ↓
PDDLLMLLMTAMPo1-TAMPR1-TAMP PDDLLMLLMTAMPo1-TAMPR1-TAMP
Rearrangement73.85.670.840.033421212001460
Tower of Hanoi100.014.333.314.353536529353
Bridge Building87.244.351.740.037550270363
Overall80.513.961.535.941599666725

Takeaway: PDDLLM beats even o1-/DeepSeek-R1-based TAMP variants on the hardest tasks while using fewer tokens than the reasoning models. Tokens are spent only during one-shot domain derivation; execution is handled by a PDDL solver at zero token cost — ideal for long-term deployment.

Table 3 — Domain quality (predicate-level errors vs. expert)

Percentage of missing or redundant predicates relative to expert-designed domains, averaged over three runs.

TaskMissingRedundant
Stack4.2%8.3%
Burger Cooking22.2%3.7%
Bridge Building22.2%3.7%
Tower of Hanoi0.0%14.3%

Takeaway: Even on the most complex tasks (bridge, burger), PDDLLM misses only ~22% of predicates relative to a human expert — yet still attains 87–92% success, indicating the surviving logical structure is sound.

Real-Robot Deployment

PDDLLM was deployed on three physical platforms — a Franka Panda arm, an Agilex Piper arm, and a UR5e arm — across stacking, burger-cooking, bridge-building, and Tower-of-Hanoi tasks. Videos are played 1× unless labeled otherwise.

Agilex Piper · Cube Stacking. 8 cubes stacked from a single demonstration.
Agilex Piper · Making Burgers. Long-horizon assembly with ordered ingredients.
Franka Panda · Bridge Building. Coordinated align + stack skills composed from two demos.
Franka Panda · Tower of Hanoi (16× speed). Recursive long-horizon planning end to end.

Citation

If you find this work useful, please cite:

@article{huang2025pddllm,
  title         = {One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration},
  author        = {Huang, Jinbang and Xiao, Yixin and Zhang, Zhanguang and Coates, Mark and Hao, Jianye and Zhang, Yingxue},
  journal       = {arXiv preprint arXiv:2505.18382},
  year          = {2025},
  url           = {https://arxiv.org/abs/2505.18382}
}