PDDLLM: Planning Domain Derivation with LLMs from a Single Demonstration

Abstract

Pre-trained large language models (LLMs) show promise for robotic task planning but often struggle to guarantee correctness in long-horizon problems. Task and motion planning (TAMP) addresses this by grounding symbolic plans in low-level execution, yet it relies heavily on manually engineered planning domains. To improve long-horizon planning reliability and reduce human intervention, we present Planning Domain Derivation with LLMs (PDDLLM), a framework that automatically induces symbolic predicates and actions directly from demonstration trajectories by combining LLM reasoning with physical simulation roll-outs.

Unlike prior domain-inference methods that rely on partially predefined predicates or natural-language descriptions of planning domains, PDDLLM constructs domains with minimal manual initialization and automatically integrates them with motion planners to produce executable plans. Across 1,200 tasks in nine environments, PDDLLM outperforms six LLM-based planning baselines, achieving ≥ 20% higher success rates, lower token cost, and successful deployment on multiple physical robot platforms.

Key Contributions

🧩

One-Demo Domain Generation

An algorithm that combines LLM reasoning with physical simulation roll-outs to automatically generate a human-interpretable PDDL planning domain from a single demonstration.

🔌

Logical Constraint Adapter (LoCA)

A systematic interface that automatically grounds the generated symbolic domain into motion-planner constraints — no manual alignment between PDDL actions and low-level skills.

🤖

Large-Scale + Real-Robot Eval

Extensive evaluation on 1,200 tasks across 9 environments shows superior long-horizon planning and token efficiency. Successfully deployed on three physical robot platforms.

Method Overview

Demonstration In. A single manipulation trajectory plus a short natural-language task description.
Predicate Imagination. Thousands of parallel physics simulations probe object relations; the LLM abstracts the rollouts into first-order, then higher-order predicates.
Action Invention. The LLM summarizes logical state-transition patterns from the demonstration, grounded in the imagined predicates, into PDDL actions with preconditions and effects.
LoCA + Plan. The Logical Constraint Adapter compiles the predicate and action libraries into a planning domain and automatically interfaces it with a motion planner to solve new tasks.

Experimental Results

We compare PDDLLM against six LLM-based planning baselines across nine task families, measuring planning success rate and token cost. All baselines use a 50-second time limit unless noted.

Table 1 — Planning success rate (%) across tasks

PDDLLM matches expert-designed domains on most tasks and dominates every LLM baseline by ≥ 40% overall.

Task	Expert	LLMTAMP	LLMTAMP-FF	LLMTAMP-FR	RuleAsMem	PDDLLM
Stack	98.5	41.7	70.8	64.2	85.5	97.5
Unstack	100	89.4	94.6	92.1	88.4	97.7
Color Classification	100	18.1	36.4	49.0	88.7	100
Alignment	100	31.1	52.0	40.0	96.0	100
Parts Assembly	98.9	33.3	53.9	41.3	95.0	100
Rearrange	73.3	5.6	17.4	11.8	1.1	64.3
Burger Cooking	100	27.8	50.0	48.6	27.8	91.7
Bridge Building	100	43.3	53.3	51.7	20.0	87.2
Tower of Hanoi	100	14.3	14.3	14.3	14.3	100
Overall	95.7	35.7	52.5	48.6	69.9	93.3

Takeaway: PDDLLM reaches an overall 93.3% success rate — within striking distance of expert-designed domains (95.7%) and a +40.8% absolute improvement over the strongest LLM-based baseline (LLMTAMP-FF at 52.5%).

Table 2 — vs. reasoning LLMs (success rate + token cost)

Comparison on the three hardest tasks at an extended 500-second time limit. Tokens reported in thousands (k).

Task	Success Rate (%) ↑				Token Cost (k) ↓
Task	PDDLLM	LLMTAMP	o1-TAMP	R1-TAMP	PDDLLM	LLMTAMP	o1-TAMP	R1-TAMP
Rearrangement	73.8	5.6	70.8	40.0	334	212	1200	1460
Tower of Hanoi	100.0	14.3	33.3	14.3	535	36	529	353
Bridge Building	87.2	44.3	51.7	40.0	375	50	270	363
Overall	80.5	13.9	61.5	35.9	415	99	666	725

Takeaway: PDDLLM beats even o1-/DeepSeek-R1-based TAMP variants on the hardest tasks while using fewer tokens than the reasoning models. Tokens are spent only during one-shot domain derivation; execution is handled by a PDDL solver at zero token cost — ideal for long-term deployment.

Table 3 — Domain quality (predicate-level errors vs. expert)

Percentage of missing or redundant predicates relative to expert-designed domains, averaged over three runs.

Task	Missing	Redundant
Stack	4.2%	8.3%
Burger Cooking	22.2%	3.7%
Bridge Building	22.2%	3.7%
Tower of Hanoi	0.0%	14.3%

Takeaway: Even on the most complex tasks (bridge, burger), PDDLLM misses only ~22% of predicates relative to a human expert — yet still attains 87–92% success, indicating the surviving logical structure is sound.

Real-Robot Deployment

PDDLLM was deployed on three physical platforms — a Franka Panda arm, an Agilex Piper arm, and a UR5e arm — across stacking, burger-cooking, bridge-building, and Tower-of-Hanoi tasks. Videos are played 1× unless labeled otherwise.

Agilex Piper · Cube Stacking. 8 cubes stacked from a single demonstration.

Agilex Piper · Making Burgers. Long-horizon assembly with ordered ingredients.

Franka Panda · Bridge Building. Coordinated align + stack skills composed from two demos.

Franka Panda · Tower of Hanoi (16× speed). Recursive long-horizon planning end to end.

Citation

If you find this work useful, please cite:

@article{huang2025pddllm,
  title         = {One Demo Is All It Takes: Planning Domain Derivation with LLMs from A Single Demonstration},
  author        = {Huang, Jinbang and Xiao, Yixin and Zhang, Zhanguang and Coates, Mark and Hao, Jianye and Zhang, Yingxue},
  journal       = {arXiv preprint arXiv:2505.18382},
  year          = {2025},
  url           = {https://arxiv.org/abs/2505.18382}
}