# Working Paper

# A Layered Architectural Framework for AI Agent Systems

**AI Agent Workflow Design Stack (AAWDS) — A Technical Analysis**

*v0.2 for reviewed by domain experts. Revision history at end of document.*

---

## Abstract

The AI agent literature has produced approximately forty named "patterns" with substantial duplication, layer confusion, and category inflation. Patterns at incompatible levels of abstraction — control-flow primitives (ReAct), workflow compositions (orchestrator-workers), capability augmentations (RAG, CodeAct), governance overlays (guardrails), and domain applications (digital twin agents) — are commonly listed as peer entries in the same taxonomy. The result is a vocabulary that cannot support unambiguous architectural specification.

This paper proposes a five-layer execution stack inside a deployment envelope as a structurally correct alternative. The framework specifies an agent system through a tuple ⟨E, L1, L2, L3, L4, L5⟩ — a deployment envelope (with two facets: trigger and adaptation) plus five execution layers — with explicit conditional dependencies, enabling unambiguous specification of any architecture and predicting failure modes by layer. Existing patterns are not rejected; they are relocated. The contribution is structural and analytical rather than empirical: validation proceeds by mapping 37 widely cited patterns into the framework and demonstrating that each resolves to a unique coordinate, a synonym of an existing coordinate, or a composition of coordinates at multiple layers. A subsequent adversarial test (§5.4) attempts to break the framework with patterns deliberately chosen to stress its assumptions; the framework absorbs most but not all such patterns cleanly, and the failure cases are reported.

The framework's limits are stated explicitly. It does not represent inter-system composition (agent-to-agent protocols across trust boundaries), dynamic execution traces (within-run layer traversal), self-modifying systems where the action space is itself a target of the agent's actions, or fine-grained governance interception. Each is a known gap, not a deferred refinement.

---

## 1. Motivation and problem statement

### 1.1 The state of agent-pattern literature

Catalogs of agent design patterns have proliferated in industry and academic sources since 2023. Common reference taxonomies include Anthropic's six-pattern compositional framework (Schluntz & Zhang, 2024), Andrew Ng's four agentic design patterns, and various derivative lists in technical blogs and framework documentation. These catalogs share three structural defects.

**First defect: layer conflation.** Named patterns are listed as peer entries despite operating at different levels of abstraction. ReAct (a single-agent reasoning loop) is listed alongside orchestrator-workers (a multi-component compositional pattern), which is in turn listed alongside RAG (a capability augmentation) and guardrails (a governance overlay). These are not alternatives to one another. Most production systems combine examples from each category simultaneously.

**Second defect: synonym multiplication.** The same architectural pattern appears under multiple names. Critic-Generator and Evaluator-Optimizer describe the same configuration. Meta-Agent and Supervisor-Worker describe the same configuration. Plan-Act-Reflect is a composition of Plan-and-Execute and Reflexion, not a primitive. Existing taxonomies rarely test for equivalence before adding entries.

**Third defect: missing structural dimensions.** Trigger model (synchronous, event-driven, batch, continuous) and adaptation horizon (stateless, memory-augmented, self-improving) are absent from most pattern lists despite determining production behavior more strongly than the choice between, e.g., ReAct and Plan-and-Execute. Where these dimensions appear, they are smuggled in as workflow types or memory features rather than acknowledged as base architectural choices.

### 1.2 The functional consequence

A taxonomy that cannot distinguish primitives from compositions, peers from layered choices, or architectural decisions from operational concerns cannot support architectural reasoning. Engineers using such taxonomies report a recurrent failure mode: they choose a "pattern" (typically ReAct or orchestrator-workers) and then discover the choice underdetermines the system, requiring further decisions the taxonomy did not surface. The result is implicit architecture — decisions made by default during implementation rather than deliberately during design.

### 1.3 Contribution

This paper specifies a layered framework that:

1. Identifies the architectural decisions that fully specify an agent system, structured as a tuple ⟨E, L1, L2, L3, L4, L5⟩ — a deployment envelope (decomposing into trigger and adaptation horizon) plus five execution layers.
2. Makes the dependency graph between decisions explicit (some decisions are conditional on others).
3. Treats governance as boundary interposition between layers, not as a free-floating overlay.
4. Locates trigger model and adaptation horizon outside the per-execution stack, in a deployment envelope.
5. Provides a compression test: every named pattern in current literature must map to a unique coordinate, a synonym, or a composition.
6. Submits to an adversarial stress test: patterns deliberately chosen to exceed the framework's assumptions are mapped, and failures are reported rather than concealed.

The framework is design-operational. Its goal is to support architectural reasoning before implementation, not to catalog every term in current usage.

---

## 2. Prior taxonomic work

### 2.1 Anthropic's compositional patterns

Schluntz and Zhang (2024) propose an *augmented LLM* as foundation, with six workflow patterns built on top: prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and a distinct category of autonomous agents. The taxonomy is internally coherent and represents the most rigorous practical framework currently in wide use. Its scope is deliberately limited to compositional patterns — control flow built around LLM calls. It does not address single-agent reasoning loop variants (ReAct, Reflexion, etc.) as peers, treating them implicitly as variations within the autonomous-agent category. This is a defensible scoping choice but leaves the broader taxonomic space unaddressed.

### 2.2 Single-agent reasoning pattern literature

The reasoning-pattern literature is anchored by ReAct (Yao et al., 2022) and Reflexion (Shinn et al., 2023), with subsequent extensions including Tree of Thoughts (Yao et al., 2023), Graph of Thoughts (Besta et al., 2023), ReWOO (Xu et al., 2023), and Chain-of-Verification (Dhuliawala et al., 2023). These works are individually rigorous. Aggregated taxonomies that list them alongside compositional or multi-agent patterns conflate axes — these patterns describe the *shape of the cognitive loop inside a single agent* and compose freely with any compositional or multi-agent topology.

### 2.3 Multi-agent topology taxonomies

Multi-agent coordination patterns inherit from earlier multi-agent systems literature. A recent survey (anonymous, 2025, arXiv:2508.12683) catalogs hierarchical, blackboard, peer-to-peer, market-based, and graph-orchestrated topologies, with extensive coordination-mechanism analysis. The literature is mature but rarely integrated with single-agent reasoning patterns or with capability-augmentation choices. The result is parallel literatures that do not compose.

### 2.4 The persistent failure: layer conflation

Across these literatures, the structural failure repeats: pattern lists treat decisions at different levels of architectural abstraction as peers. The corrective move is not to enumerate more patterns but to identify the levels and force the existing patterns into the right structural relationships.

---

## 3. The framework

### 3.1 Overview

The framework consists of:

- A **deployment envelope** specifying when the system runs and how it changes between runs.
- A **five-layer execution stack** inside the envelope, specifying the per-run architecture.
- Governance treated as **boundary interposition** at layer transitions, not as a separate overlay tier.

The stack is read bottom-up: lower layers constrain higher layers. A higher layer may only invoke capabilities exposed by the layer below.

![AI Agent Workflow Design Stack (AAWDS)](./aawds_diagram.png)

### 3.2 Envelope

The envelope contains two decisions made at deployment time. They are *outside the per-run execution stack* in the structural sense — they are properties of how the system is deployed, not of how a single run unfolds — but their consequences propagate into per-run behavior. A ReAct agent triggered synchronously and one triggered as a daemon are the same agent at L1–L5; they differ in deployment commitments. The relocation is structural, not temporal.

**Trigger model.** Possible values: synchronous (request-response), event-driven (external signal), batch (scheduled, often map-reducible), continuous (always-on daemon). Trigger model determines latency budgets, idempotency requirements, and state persistence.

**Adaptation horizon.** Possible values: stateless (no carry-over between runs), memory-augmented (state persists; behavior shape unchanged), self-improving (system structurally changes from past runs via fine-tuning, recursive prompt evolution, or reinforcement learning). The distinction between memory-augmented and self-improving is empirically thin in current production systems; most claimed "learning" agents are memory-augmented mislabeled. Self-improving systems remain primarily research-grade as of late 2025 / early 2026.

### 3.3 L1 — Capability surface

The capability surface defines the agent's *action space* and *observation space*: what the agent can do to the world and what feedback it receives. It is the substrate above which all higher layers operate.

Components:

- Tool / function calls (named procedures with typed signatures).
- Code execution (the action *is* arbitrary code; CodeAct is the special case where this is the primary action mode).
- Retrieval (vanilla RAG, agentic RAG, Self-RAG; retrieval can be a tool, a pre-fetched context, or a model-controlled subroutine).
- Memory (scratchpad within run, episodic across runs, semantic across users).
- External environment (browser, operating system, simulator, physical actuators).

A pattern law: *no capability surface, no agency*. ReAct without tools is chain-of-thought. Plan-and-execute without an executor surface is an outline. CodeAct differs from tool use only in that L1 includes a code interpreter and the action grammar is broader.

The framework places L1 below all other layers because every higher layer operates *through* it. This is the structural correction over earlier frameworks that treat capability as "augmentation" — a peer choice to control or cognition. Augmentation framing is incorrect: there is no agent without capability, and every cognitive or control choice is implicitly conditioned on what L1 exposes.

### 3.4 L2 — Cognition

The cognition layer specifies the shape of the model's reasoning loop *if there is one*. L2 is conditional: it is undefined when L3 specifies static control, because there is no model-controlled loop to shape.

Categories:

- **Linear** — direct execution, chain-of-thought. One forward pass; no observation feedback within the loop.
- **Reactive** — ReAct (Yao et al., 2022). Interleaved thought, action, observation; the loop continues until termination conditions are met.
- **Plan-first** — Plan-and-Execute, ReWOO (Xu et al., 2023). The loop is split into a planning phase (potentially LLM-controlled) and an execution phase (often non-LLM-controlled). Replanning is triggered on failure.
- **Reflective** — single-pass Reflection, Reflexion (Shinn et al., 2023), Chain-of-Verification (Dhuliawala et al., 2023). A critique step is interposed; in Reflexion specifically, critiques persist across attempts (which technically makes Reflexion a composition of L2 reflective and Envelope memory-augmented; see §5).
- **Branching** — Tree of Thoughts (Yao et al., 2023), Graph of Thoughts (Besta et al., 2023). The reasoning trace branches and evaluates partial states; backtracking and merging are admitted.
- **Stochastic-stabilized** — self-consistency, best-of-N. The same loop is sampled multiple times and aggregated. Fixes stochastic errors, not systematic ones.
- **Utility-driven** — explicit scoring functions select actions. Inherits from classical agent theory (BDI, utility-based agents); rare in current LLM systems but architecturally distinct.

A reduction: every L2 variant can be expressed as a triple `⟨when f is invoked, how many alternatives are explored per invocation, how feedback reshapes f⟩` where `f: (state, memory, objective) → next_action`. The cognitive variants differ in this triple, not in their fundamentals.

### 3.5 L3 — Control locus

The control layer specifies *who decides the next step*: program code, the model, or both. This is the single most consequential architectural decision; it determines debuggability, drift risk, and cost profile more strongly than any other choice.

Three values:

- **Static** — control flow is a programmer-defined directed acyclic graph. The model is invoked at specific nodes but does not determine which node runs next. Examples at this setting include prompt chaining, deterministic routing, parallelization (sectioning, voting, map-reduce), and sequential pipelines.
- **Model-directed** — the model selects the next action inside an unbounded loop. ReAct in its pure form sits here.
- **Mixed** — the program defines boundaries and dispatch logic; the model decides locally within those boundaries. Orchestrator-workers, evaluator-optimizer, plan-and-execute (with LLM planner and programmatic executor), and graph-orchestrated agents (LangGraph default) sit here.

The choice has predictable cost and risk consequences:

| Property         | Static | Mixed   | Model-directed |
|------------------|--------|---------|----------------|
| Debuggability    | high   | medium  | low            |
| Flexibility      | low    | medium  | high           |
| Cost (tokens)    | low    | medium  | high           |
| Drift risk       | low    | medium  | high           |
| Trace complexity | low    | medium  | high           |

A frequently observed failure: teams escalate to model-directed control before exhausting static or mixed alternatives, then incur the full cost penalty without measurable benefit because the task did not require the flexibility.

### 3.6 L4 — Coordination

The coordination layer specifies how many agents participate and how they are wired. It is partially conditional: cardinality is always specified, but topology is undefined for cardinality 1.

A note on scope: L4 specifies *who talks to whom* (the message-passing graph). It does not specify *how multiple agent outputs are combined into a single system output*. Aggregation — voting, judging, weighted combination, last-writer-wins — is an admissibility decision and belongs to L5. This separation is a deliberate refinement over earlier framings of this framework, in which consensus was treated as a primitive topology while debate was treated as a topology+governance composition. The asymmetry was unjustified: both topologies require an aggregator. Under the present framing, *both* are compositions of an L4 topology choice with an L5 aggregator choice. Topologies that have only one terminal output by construction (pipeline, hierarchical with a designated root) do not need explicit L5 aggregation; topologies that produce multiple outputs (peer, debate-style, voting-style) always do.

Topologies for cardinality > 1:

- **Hierarchical (supervisor-worker)** — a coordinator dispatches to specialists. Default starting point for multi-agent systems. Output is the coordinator's; no L5 aggregator required.
- **Sequential pipeline** — agent A's output is agent B's input. Output is the terminal agent's; no L5 aggregator required.
- **Hub-and-spoke** — central hub mediates information flow but does not decide; specialists do not communicate directly. Output handling depends on hub design.
- **Peer-to-peer / round-robin** — equal agents converse with role specialization, no central authority. Output requires L5 aggregator (judge, vote, last-writer-wins).
- **Graph orchestration** — agents as nodes in a directed graph with explicit control-flow edges. LangGraph and Microsoft Agent Framework implement this canonically. Output handling is graph-design-specific.
- **Blackboard** — shared knowledge store; agents read and write opportunistically; coordination is stigmergic. Output requires explicit termination criterion (often L5).
- **Swarm / decentralized** — no central coordinator; behavior is emergent. Mostly research-grade for LLM agents.

Two patterns that are commonly listed as topologies are, under this refinement, compositions of a peer topology with a specific L5 aggregator:

- **Debate** ≡ peer topology + judge or aggregator at L5. The defining feature is adversarial role assignment within the peer structure plus a designated arbiter.
- **Consensus committee / voting** ≡ peer topology + voting rule at L5 (majority, weighted, confidence-scored). The defining feature is parallel independent execution plus a fixed combination rule.
- **Market-based / contract-net** ≡ peer topology + bid/allocation rule at L5. The defining feature is utility-based task allocation.

A coordination law: *coordination cost > task decomposition benefit ⇒ system degrades*. Multi-agent is not a capability upgrade; it is complexity injection. The default should be single-agent until measured failure modes cannot be fixed at L1–L3.

### 3.7 L5 — Governance

Governance is the most frequently mismodeled layer in existing taxonomies. It is treated as a free-floating overlay or as a peer category to architectural patterns. Both framings are incorrect.

Governance is *interposition at layer boundaries*. Specific interception points:

- **Envelope → L1** — input filters, prompt-injection screens, schema validation on incoming data.
- **L2 → L1** — tool guardrails, action policy classifiers, confidence-gated execution, cost-aware routing.
- **L4 → output** — output validators, rule checks, hallucination screens, format enforcement.
- **Cross-cutting** — audit logging, traceability, retry/recovery, human-in-the-loop checkpointing, sandbox isolation.

A governance law: *governance does not change behavior generation; it constrains behavior admissibility*. It does not alter what the agent considers; it alters what the agent is permitted to emit or execute. Sizing should be matched to risk, not to architectural complexity.

### 3.8 Layer dependency graph

Conditionality is structural, not incidental:

- L2 exists iff L3 ≠ static.
- Topology is defined iff L4 cardinality > 1.
- L3, L4, L5 always exist.
- L1 always exists (an agent without L1 is not an agent).
- Envelope always exists.

This dependency graph eliminates empty cells in the design space and clarifies why some patterns appear to span multiple layers (they don't; they are compositions).

**An honest caveat on the strict ordering.** The framework places L1 below all higher layers because every higher-layer choice operates *through* L1's action and observation surface. This is the right structural commitment for the dominant case where L1 is fixed at deploy time. It is partially strained in three real edge cases:

1. **Dynamically-extending capability.** Systems where L1 can be extended at runtime — agents that discover and bind new tools via Model Context Protocol, that synthesize new tools through code generation, or that incorporate retrieval results as new latent capabilities — partially invert the dependency. The model is, in effect, modifying its own substrate.
2. **Self-modifying agents.** Systems where the agent rewrites its own prompts, plans, or scaffolding mid-run — sometimes called self-modifying or self-reflective beyond Reflexion-style critique — also strain the ordering, because the cognition layer is reaching upward to modify the control or capability layers.
3. **Joint-state multi-agent systems.** In some multi-agent reinforcement-learning configurations, cognition is not located in any single agent; it emerges from joint state. L2 in such systems is not localized to one node, and the L4-above-L2 framing is awkward.

The framework absorbs cases (1) and (2) by using `Env.adaptation: self-improving` plus an explicit annotation that L1 has model-controlled extension points. Case (3) is genuinely outside the framework's representational vocabulary and is acknowledged as a gap in §8.6. Naming these edge cases now, rather than discovering them under stress, is the difference between a structural framework and a marketing abstraction.

---

## 4. Formal specification

### 4.1 Tuple form

An agent system *S* is specified by:

```
S = ⟨E, L1, L2, L3, L4, L5⟩
```

where:

```
E   = ⟨trigger, adaptation⟩
       trigger    ∈ {sync, event, batch, continuous}
       adaptation ∈ {stateless, memory, self-improving}

L1  ⊆ {tools, code, retrieval, memory_io, environment}      (non-empty)

L3  ∈ {static, mixed, model-directed}

L2  ∈ {⊥, linear, reactive, plan-first, reflective, branching, stochastic, utility}
       L2 = ⊥  iff  L3 = static

L4  = ⟨n, topology⟩
       n ∈ ℕ⁺
       topology ∈ {⊥, hierarchical, pipeline, hub-spoke, peer, graph,
                   blackboard, swarm}
       topology = ⊥  iff  n = 1

       Aggregation (combining multiple agent outputs into one system output)
       belongs to L5, not to topology. Patterns commonly named at L4 that
       require aggregation — debate, consensus committee, market-based — are
       compositions of a peer or graph topology with an L5 aggregator.

L5  ⊆ {pre-input, in-loop, post-output, cross-cutting, aggregation}
```

### 4.2 Equivalence relation

Two systems are *architecturally identical* iff their tuples are equal. This is the structural test the previous taxonomies fail. Under this equivalence:

- Critic-Generator ≡ Evaluator-Optimizer (same tuple).
- Meta-Agent ≡ Supervisor-Worker (same tuple).
- Hierarchical Task Decomposition ≡ Orchestrator-Workers (same tuple under "compositional" reading).

### 4.3 Composition

A pattern is *primitive* if it specifies a value at exactly one layer (with the others left as parameters). It is a *composition* if it constrains values at two or more layers simultaneously.

Examples:

- *ReAct* fixes L2 = reactive and L3 = model-directed. Primitive at the cognition/control intersection (a base configuration).
- *Reflexion* fixes L2 = reflective and Envelope.adaptation = memory. Composition.
- *Plan-Act-Reflect* fixes L2 = plan-first and L5 = in-loop critic (or Envelope.adaptation = memory if critique persists). Composition.
- *Self-RAG* fixes L1 ⊇ {retrieval} and L5 ⊇ {in-loop verification}. Composition.

### 4.4 Specification completeness claim

*Inferred claim, not proven.* The conjecture is that every architectural distinction between named agent patterns in the current literature is captured by some choice in this tuple. The validation in §5 is by enumeration over a sample of approximately 25 widely cited patterns. A formal proof of completeness is not offered; the framework is open to falsification by exhibition of a pattern that requires a dimension not represented.

---

## 5. Validation by pattern compression

### 5.1 Method

Each named pattern in a representative sample is mapped to a tuple. Three outcomes are possible:

- **Unique tuple** — the pattern is a primitive variant or a base configuration.
- **Tuple matches an existing pattern** — the pattern is a synonym; one name should be retired or treated as alias.
- **Tuple combines values across multiple layers** — the pattern is a composition; it is not primitive and should be decomposed.

The validation succeeds if every pattern resolves to one of these three outcomes, and if no pattern requires a dimension outside the framework.

### 5.2 Worked compressions

Notation: only non-default values are listed. Defaults are sync trigger, stateless adaptation, single agent, minimal governance.

| Pattern                      | Compression                                                                                                  | Verdict     |
|------------------------------|---------------------------------------------------------------------------------------------------------------|-------------|
| Augmented LLM                | L1: tools/retrieval/memory                                                                                    | Primitive (L1 base) |
| Prompt chaining              | L3: static                                                                                                    | Primitive (L3 base) |
| Routing                      | L3: static (with classifier)                                                                                  | Primitive (L3 variant) |
| Parallelization (sectioning) | L3: static (DAG with parallel branches)                                                                       | Primitive (L3 variant) |
| Parallelization (voting)     | L3: static + L5: post-output aggregator                                                                       | Composition |
| Orchestrator-workers         | L3: mixed + L4: hierarchical                                                                                  | Composition |
| Evaluator-optimizer          | L3: mixed + L5: in-loop critic                                                                                | Composition |
| ReAct                        | L2: reactive + L3: model-directed                                                                             | Primitive (canonical config) |
| Plan-and-execute             | L2: plan-first + L3: mixed                                                                                    | Composition |
| ReWOO                        | L2: plan-first + L1: tools (batched)                                                                          | Variant of plan-first |
| Reflection (single-pass)     | L2: reflective                                                                                                | Primitive (L2 variant) |
| Reflexion                    | L2: reflective + Env.adaptation: memory                                                                       | Composition |
| Self-consistency / Best-of-N | L2: stochastic                                                                                                | Primitive (L2 variant) |
| Tree of Thoughts             | L2: branching                                                                                                 | Primitive (L2 variant) |
| Graph of Thoughts            | L2: branching (DAG)                                                                                           | Primitive (L2 variant) |
| Chain-of-Verification        | L2: reflective (verification-shaped)                                                                          | Variant of reflective |
| CodeAct                      | L1: code execution                                                                                            | Primitive (L1 variant) |
| Self-RAG                     | L1: retrieval + L5: in-loop verification                                                                      | Composition |
| Hierarchical / supervisor    | L4: hierarchical                                                                                              | Primitive (L4 base) |
| Sequential pipeline          | L4: pipeline                                                                                                  | Primitive (L4 variant) |
| Hub-and-spoke                | L4: hub-spoke                                                                                                 | Primitive (L4 variant) |
| Blackboard                   | L4: blackboard                                                                                                | Primitive (L4 variant) |
| Debate                       | L4: peer + L5: aggregator/judge                                                                               | Composition (revised — see §3.6) |
| Consensus committee          | L4: peer + L5: voting aggregator                                                                              | Composition (revised — see §3.6) |
| Market-based                 | L4: peer + L5: bid/allocation aggregator                                                                      | Composition (revised — see §3.6) |
| Graph orchestration          | L4: graph                                                                                                     | Primitive (L4 variant) |
| Swarm                        | L4: swarm                                                                                                     | Primitive (L4 variant) |
| Plan-Act-Reflect             | L2: plan-first + L5: in-loop critic                                                                           | Composition |
| Meta-Agent                   | L4: hierarchical                                                                                              | Synonym of supervisor |
| Critic-Generator             | L3: mixed + L5: in-loop critic                                                                                | Synonym of evaluator-optimizer |
| Role-Based Teams             | L4: hierarchical (with role-specialized workers)                                                              | Synonym of supervisor |
| Hybrid Human-Agent           | L5: cross-cutting HITL (continuous)                                                                           | Contested compression — see note below |
| Digital Twin agent           | L1: simulator environment                                                                                     | Application of L1 |
| Event-driven workflow        | Env.trigger: event + L3: static                                                                               | Composition |
| Map-reduce                   | L3: static (parallel decompose + aggregate)                                                                   | Synonym of parallelization with aggregator |
| RAG agent                    | L1: retrieval                                                                                                 | L1 augmentation, not a pattern |
| Self-improving agent         | Env.adaptation: self-improving                                                                                | Envelope choice |

**Note on Hybrid Human-Agent.** The compression treats continuous human-agent collaboration as HITL applied without gates. This is contestable. HITL is gate-shaped (intercept, approve, release); continuous collaboration is dialog-shaped (mutual influence over time). The framework currently lacks a clean way to distinguish these. A defensible alternative is to treat dialog-shaped collaboration as L4 cardinality 2 with the human as one node and an L5 cross-cutting interaction protocol — but this awkwardly classifies humans as agents. The classification is provisional pending a clearer treatment of human participation; the limit is acknowledged in §8.7.

### 5.3 Outcome of compression

Of the 37 patterns examined, every entry resolves to one of {primitive, synonym, composition, capability augmentation, envelope choice}. None require a dimension outside the framework. Approximately one-third are compositions or synonyms, supporting the claim that current pattern catalogs over-enumerate.

The single ambiguous case in this exercise was *utility-driven cognition*, which is well-defined in classical agent theory but rarely instantiated in current LLM agents. Its inclusion is conservative: it occupies a real coordinate even if production examples are sparse.

**A methodological honesty.** The 37 patterns above were selected by the author, and each was mapped by the author. This is non-adversarial validation. A pattern catalog whose author chooses both the test set and the mappings cannot, on its own, falsify the framework — at most, it can demonstrate internal consistency. The next subsection addresses this by attempting an adversarial test: patterns chosen specifically because they are expected to stress the framework's assumptions. Where the framework absorbs them cleanly, the absorption is a meaningful (if still limited) confirmation. Where it does not, the failure is reported.

### 5.4 Adversarial stress test

The compression in §5.2 demonstrates that widely-cited patterns fit the framework. A stronger test is whether patterns deliberately chosen to violate the framework's assumptions can also be expressed. Three candidates are examined.

**Test 1 — Self-modifying agents.** Class definition: agents that rewrite their own prompts, tools, or scaffolding mid-execution. Examples include agents that synthesize new tools at runtime via code generation, agents that edit their own system prompt based on observed performance, and agents that modify their own planning policy during a single user-facing session.

The framework's strict commitment is that L1 is below L3: the model decides actions through a fixed action surface. Self-modifying agents partially invert this — the model is acting on its own action surface.

*Absorption*: partial. The framework can represent self-modifying agents by allowing L1 to have model-controlled extension points and by setting `Env.adaptation: self-improving` if modifications persist. But the strict ordering claim ("higher layers operate through L1") becomes weaker — L3 is now operating on L1, not just through it. The framework's vocabulary survives; the strict structural claim does not. *This is a reportable weakness.*

**Test 2 — Streaming / incremental agents.** Class definition: agents that emit partial output continuously and refine it as new context arrives, rather than producing a complete output per run. Examples include live-transcription summarization agents, real-time translation agents with re-revision, and "thinking out loud" agents that publish reasoning continuously.

The framework's commitment is that a run has a clear input-output structure. Streaming systems have neither a discrete input event nor a discrete output event.

*Absorption*: partial. `Env.trigger: continuous` plus `Env.adaptation: memory-augmented` captures the deployment shape. But the cognition layer (L2) and governance layer (L5) are designed for run-bounded interception; they do not cleanly accommodate "interception within a stream of partial commitments." The framework can describe the system; it cannot describe the within-stream interception semantics that govern when output is committed and when it is revisable. *This is a second reportable weakness.*

**Test 3 — Joint-cognition multi-agent systems.** Class definition: multi-agent systems where reasoning is not located in any single agent but emerges from joint state. Examples include multi-agent reinforcement learning systems with shared policy networks, communicating agents with learned protocols, and swarm systems where individual agents have minimal cognition and the collective exhibits emergent reasoning.

The framework's commitment is that L2 cognition is per-agent. Joint cognition violates this: there is no single agent whose L2 fully captures the system's reasoning.

*Absorption*: fails. The framework can describe the topology (L4: swarm) and the augmentations (L1), but it cannot describe joint cognition as a first-class structural property. The framework's L2-per-agent assumption is wrong for this class. *This is a third reportable weakness, and the most fundamental of the three.*

**What the stress test shows.** The framework absorbs widely-cited patterns cleanly and absorbs two adversarial cases (self-modifying, streaming) with acknowledged loss of structural strength. It fails to absorb joint-cognition multi-agent systems. This is consistent with the framework's stated scope (single-agent or independently-cognizing multi-agent systems). It is also a real boundary, and one that any future revision must address if the framework is to claim coverage of the full multi-agent literature.

The honest summary: the framework is a structurally correct decomposition for the dominant class of agent systems built today (sequential, single-cognizer, run-bounded). It strains at three known edges. Practitioners working at those edges should treat the framework as informative but partial.

---

## 6. Failure modes by layer

The framework predicts where systems fail by associating failure modes with layers. The diagnostic claim: an observed failure can usually be localized to one layer, and the corresponding fix is at that layer or below.

| Symptom                                | Most likely layer  | Fix direction                                                            |
|----------------------------------------|--------------------|---------------------------------------------------------------------------|
| Hallucinated facts in output           | L2 (cognition)     | Add reflective step or CoVe; or increase L1 retrieval grounding           |
| Wrong actions taken                    | L1 (capability)    | Tighten action space; remove or restrict tools; add schema validation     |
| Inconsistent behavior across runs      | L3 (control)       | Move from model-directed to mixed or static                                |
| Coordination chaos                     | L4 (coordination)  | Reduce cardinality or change topology toward hierarchical/graph            |
| Cost spiraling                         | L3 + L2            | Reduce model-directed control; replace reflective with single-pass check  |
| Latency spikes                         | L2 + L4            | Reduce loop depth; flatten coordination                                   |
| Policy violations in output            | L5 (governance)    | Add post-output validator; tighten in-loop guardrails                     |
| Repeated identical errors              | Env.adaptation     | Add memory augmentation (Reflexion-style)                                 |
| Unable to handle async load            | Env.trigger        | Move from sync to event-driven or batch                                   |
| Brittle planning                       | L2 plan-first      | Add replanning trigger; or move to reactive                                |
| Confidently wrong consensus            | L4 peer + L5 aggregator | Diversify samplers (L4); strengthen judge or aggregation rule (L5)         |

This is not a failure-mode catalog. It is a localization heuristic.

### 6.1 Cross-layer failures

Some failures cross layers and resist single-layer fixes:

- **Drift in long-horizon model-directed loops.** The agent loses track of objective. Cause: L2 (context bloat) and L3 (no programmatic re-anchoring). Fix typically involves both layers: re-anchor periodically (L3 mixed) or summarize context (L2).
- **Multi-agent stalling.** Agents in peer-to-peer or blackboard topologies fail to converge. Cause: L4 (no termination criterion) and L5 (no aggregator). Fix typically requires both.
- **Capability over-reach.** Agent uses a tool aggressively where a simpler approach would suffice. Cause: L1 (over-broad action space) and L2 (no cost signal in cognition). Fix at L1 first.

### 6.2 Failure law

A general law inferred from observation, stated as falsifiable claim: *most agent-system failures in production trace to L1 misspecification or premature L3 escalation, not to L2 cognition choice*. Engineers focus on L2 because the literature focuses there, but the dominant failure mass is below.

---

## 7. Selection logic

A bottom-up build protocol follows directly from the dependency graph.

**Step 1. Pin the envelope.** What triggers the system? What persists between runs? These are deployment commitments and constrain everything downstream.

**Step 2. Pin L1.** What does the agent need to act on and observe? Specify the action space and observation space concretely. Most production failures trace to over- or under-specified capability surface, so this step deserves disproportionate attention.

**Step 3. Pin L3.** Default to static control. Escalate to mixed when programmatic decomposition stops fitting the task. Escalate to model-directed only when mixed cannot either. The progression is asymmetric: each step adds cost and reduces debuggability, so escalation should require a measured failure of the lower setting, not an a priori preference.

**Step 4. Pin L2 (if applicable).** Only if L3 admits a model loop. Default reactive (ReAct). Move to plan-first when planning horizon dominates cost. Add reflective when failures repeat. Branching and stochastic-stabilized are last resorts; their cost rarely justifies the gain in production.

**Step 5. Pin L4.** Stay single-agent until a measured failure cannot be fixed at L1–L3. If multi-agent is forced, default to hierarchical or graph topology. Treat blackboard, debate, swarm, and market-based as research-grade unless the problem genuinely matches their structure.

**Step 6. Apply L5.** Size governance to risk, not to architectural complexity. Place interception at the layer boundaries where the risk is — input boundary for prompt injection, action boundary for tool safety, output boundary for hallucination, cross-cutting for audit and HITL.

The order matters. L1 is below L3 in the dependency graph, so it must be pinned first. L2 is conditional on L3, so it follows. L4 is independent of L2 and L3 in principle but in practice should be deferred until single-agent capacity is measured. L5 is last because it interposes at boundaries that must already be defined.

---

## 8. Limits and open questions

The framework is a structural specification. It does not solve every problem in agent-system design, and several known gaps deserve explicit acknowledgement.

### 8.1 Inter-system composition

The framework describes a single agent system. When two agent systems compose across organizational or trust boundaries — Model Context Protocol (MCP) federation, agent-to-agent (A2A) protocols, agent marketplaces — neither the layer stack nor the envelope captures the composition cleanly. *Speculative*: a federation tier external to the envelope may be needed if inter-agent protocols mature into a standard architectural layer. As of late 2025 / early 2026, the protocols themselves are still being defined; premature formalization would likely freeze the wrong abstractions.

### 8.2 Dynamic execution traces

The framework describes static architecture. Real runs cross layers within a single trace: an event triggers at the envelope, hits a static workflow at L3, which calls a model-directed subagent at L3 inside one node, which uses code execution at L1, whose output is validated at L5. The framework describes *components*, not *traces*. A complete account would require an execution-trace formalism — closer to typed dataflow or process algebra — layered on top. This is beyond the scope of taxonomic work.

### 8.3 Governance sub-stack

L5 is compressed. A serious governance taxonomy would itself decompose into input-boundary, action-boundary, output-boundary, audit, escalation, and recovery dimensions, each with its own interception semantics. Treating it as a single layer is a deliberate simplification that loses precision in regulated contexts (finance, healthcare, public sector). For such contexts, the framework requires extension into a governance sub-framework before operational use.

### 8.4 Empirical validation absent

This is a structural framework, validated by analytical compression of named patterns and by an adversarial stress test (§5.4). It is not validated by empirical measurement of failure-mode prediction accuracy or design-time cost reduction.

A clarifying note on what such validation could even look like: a structural framework is not directly falsifiable by measurement of system performance, because the framework does not predict performance. It is falsifiable in two more specific ways:

1. **Pattern non-absorption.** Exhibition of a pattern in actual production use that requires a dimension the framework does not represent. The §5.4 stress test attempted this internally; an external attempt by a different reviewer would be stronger evidence.
2. **Failure-mode localization accuracy.** The §6 claim that observed failures localize to specific layers is empirically testable: take a corpus of post-mortems from production agent systems, classify each failure under the framework's layers, and check whether the prescribed fix direction (§6 table) was the fix that worked. A high accuracy supports the framework; a low accuracy weakens the localization claim without necessarily weakening the structural decomposition.

Neither test has been conducted. The first requires adversarial submissions; the second requires a corpus of post-mortems with known root causes and resolutions. Both are tractable but non-trivial. Treating empirical validation as a generic "future work" item understates its difficulty — for structural frameworks, the validation methods themselves require careful design.

### 8.5 Boundary of L1 and the memory question

The boundary of the capability surface is not always clean. Memory in particular sits at the boundary between L1 (memory I/O as capability) and Envelope (adaptation horizon as memory persistence).

The decision rule used in this paper:

- **Within-run memory** (scratchpad, observation buffer, intermediate state used by the current run only) is L1.
- **Cross-run memory** (state that persists between user-facing requests; vector stores; episodic logs; long-term semantic memory) is Envelope.adaptation.

This rule has a known boundary case: **Reflexion's cross-attempt critique store**. Reflexion stores critiques across attempts within what may be a single user-facing session. If "session" and "run" are the same unit, this is L1; if they are different, this is Envelope.adaptation. The paper currently classifies it as Envelope.adaptation (§5.2 compression) on the grounds that the critique persists across what would otherwise be independent attempts. A reasonable alternative classification places it at L1, treating each attempt as a sub-step of one run. Both classifications are defensible. The framework does not resolve this ambiguity; users should pick a consistent convention and apply it across their system.

A second boundary case: **dynamically-extending capability** (per §3.8 caveat). When new tools appear at runtime, are they "L1 augmented mid-run" or "Envelope.adaptation triggered within a run"? The cleanest treatment is to say L1 admits a mutable subset of bindings, and the binding mutation is itself an L3 model-directed action. This adds machinery the §4.1 formal spec does not yet include; the present paper acknowledges the gap rather than resolves it.

### 8.6 Joint cognition in multi-agent systems

The §5.4 stress test identified joint-cognition multi-agent systems — where reasoning is emergent rather than located in any single agent — as outside the framework's representational vocabulary. This is a real limit. The framework assumes L2 is per-agent and L4 is a topology over individually-cognizing agents. Multi-agent reinforcement learning, learned-protocol communication, and emergent swarm cognition all violate this assumption.

A future revision could address this by introducing a "collective cognition" property at L4 — a flag that says cognition is distributed rather than per-agent — but this would require a substantial reworking of the L2/L4 split. The present paper does not attempt this.

### 8.7 Human participation in agent systems

The classification of Hybrid Human-Agent collaboration in §5.2 is provisional. HITL is gate-shaped (intercept, approve, release); continuous human-agent collaboration is dialog-shaped (mutual influence over time, with neither party in fixed control). The framework currently lacks a clean way to distinguish these modes.

Three possible treatments, none fully satisfactory:

1. Treat the human as an L4 node (cardinality 2). This awkwardly classifies humans as agents but represents the dialog structure faithfully.
2. Treat continuous collaboration as L5 cross-cutting HITL without gates (the §5.2 compression). This represents the interception structure but loses the dialog dynamics.
3. Introduce a separate "co-agent" tier outside the L1–L5 stack. This adds machinery for a modest gain.

The framework's current treatment is option 2. Better treatments may emerge as human-AI collaboration becomes a more developed sub-field; for now, the limitation is acknowledged.

---

## 9. Comparison to prior frameworks

### 9.1 Anthropic (Schluntz & Zhang, 2024)

The Anthropic framework occupies L3 (static and mixed control) and partially L4 (orchestrator-workers as compositional pattern). It does not address L1, L2, L5, or the envelope as peer concerns. The frameworks are compatible: Anthropic's six patterns are L3 base configurations and one L4 base configuration in AAWDS terms. AAWDS extends Anthropic's framework along the dimensions Anthropic deliberately scoped out.

### 9.2 Classical agent theory (BDI, hierarchical agents)

Classical agent theory (Belief-Desire-Intention, hierarchical task networks) addresses L2 (cognition shapes including utility-driven and BDI loops) and L4 (hierarchical multi-agent systems). It predates LLM-based agents and operates with different L1 assumptions (symbolic action spaces, formal planning languages). AAWDS inherits L2 utility-driven cognition and L4 hierarchical topology from this lineage. The frameworks are largely compatible at L2 and L4; they differ at L1 (symbolic vs. natural-language interfaces) and at L3 (classical agents are typically model-directed in AAWDS terms, but the model is symbolic, not statistical).

### 9.3 Recent multi-agent surveys

The 2025 hierarchical multi-agent systems survey (anonymous, arXiv:2508.12683) provides extensive L4 detail — coordination mechanisms, communication protocols, hierarchical structures. AAWDS L4 is consistent with this work but compresses it. For deep multi-agent design, the survey's finer distinctions are appropriate; AAWDS's coarser L4 categories are sufficient for cross-layer architectural reasoning.

### 9.4 Industry pattern catalogs

Industry pattern catalogs (typically 5–15 patterns in technical blogs and framework documentation) consistently exhibit the layer conflation described in §1. AAWDS does not reject these catalogs but relocates their entries. The catalogs remain useful as terminology references; AAWDS provides the structural relationships between terms that the catalogs lack.

### 9.5 Worked cross-framework comparison: where does Reflection sit?

A general claim of compatibility is weak. A sharper test is whether the frameworks agree on where a *specific* pattern belongs.

Take Reflection (single-pass critique before output). Different frameworks classify it differently:

| Framework                  | Where Reflection sits                            | What this implies                              |
|----------------------------|--------------------------------------------------|------------------------------------------------|
| Anthropic compositional    | Evaluator-optimizer workflow (L3 mixed in AAWDS terms) | Reflection is a workflow shape — orchestrated externally |
| Andrew Ng's four patterns  | A standalone agentic design pattern              | Reflection is a primitive capability            |
| ReAct/Reflexion literature | An L2 cognitive loop variant                     | Reflection is a property of how the model thinks |
| AAWDS                      | L2 reflective (single-pass) or composition with L5 in-loop critic | Both placements are valid depending on control locus |

The disagreement is not cosmetic. It reflects different views on whether reflection is *something the program orchestrates around the model* (Anthropic), *something the model does inside its loop* (ReAct lineage), or *a category of agentic capability* (Ng). AAWDS reveals the disagreement as a control-locus question: when the program inserts a critique step at fixed points (L3 mixed + L5 in-loop critic), it is one configuration; when the model itself introduces critique inside an open loop (L2 reflective), it is a different configuration. The other frameworks conflate these.

This is the kind of disagreement AAWDS is designed to make visible. The prior frameworks each have an internally coherent answer; AAWDS shows that the answers are about different objects.

### 9.6 What AAWDS does not improve on

The framework offers no advantage over Anthropic's framework for purely compositional workflow design (where AAWDS adds machinery without adding precision). It offers no advantage over the classical BDI literature for utility-driven or symbolic-planning agents (where AAWDS's L2 categories are coarser). It offers no advantage over the multi-agent survey literature for fine-grained coordination-mechanism analysis (where AAWDS compresses what specialists would want disaggregated). The framework's value is at the *intersection* of these literatures — when a system requires reasoning across layers that the specialist literatures treat in isolation.

---

## 10. Conclusion

The AI agent literature has produced a vocabulary that cannot support unambiguous architectural specification. The defect is structural: pattern catalogs treat decisions at incompatible levels of abstraction as peers and miss base dimensions (trigger, adaptation) that determine production behavior.

The framework specified here — a five-layer execution stack inside a deployment envelope, with governance as boundary interposition — corrects the structural defect by:

1. Forcing capability below cognition below control below coordination, in a strict dependency graph.
2. Locating trigger and adaptation outside the per-execution stack.
3. Treating governance as interposition at layer transitions, not as a separate overlay tier.
4. Making conditionality (L2 conditional on L3; topology conditional on cardinality) explicit rather than incidental.

Validation by pattern compression of 37 widely cited patterns shows every entry resolving to a primitive variant, a synonym, a composition, a capability augmentation, or an envelope choice. None require a dimension outside the framework.

The framework's contribution is structural and analytical, not empirical. It does not predict which configuration is best for a given problem; that requires measurement. It does provide a vocabulary in which architectural choices can be specified, compared, and falsified — which is the precondition for any empirical work to follow.

The framework's known limits — inter-system composition, dynamic execution traces, governance sub-stack granularity, absence of empirical validation — are stated explicitly. They are not deferred refinements. They mark the boundary of what a taxonomic framework of this scope can offer.

**Statement of position.** This is a hypothesis about the structure of the agent-system design space, offered for refinement and falsification. The strongest test is whether it survives the introduction of a pattern requiring a dimension it does not represent. Such a pattern would justify either an additional layer or a redrawing of layer boundaries. Until such a pattern is exhibited, the framework should be treated as a working specification, not as a settled standard.

---

## References and primary sources

The following are the canonical sources for the patterns and concepts referenced. Standard scholarly citations should be verified against primary publications; the attributions below reflect community consensus.

1. **Schluntz, E. & Zhang, B.** (2024). *Building Effective Agents.* Anthropic. Available: anthropic.com/research/building-effective-agents.
2. **Yao, S. et al.** (2022). *ReAct: Synergizing Reasoning and Acting in Language Models.* arXiv:2210.03629.
3. **Shinn, N. et al.** (2023). *Reflexion: Language Agents with Verbal Reinforcement Learning.* arXiv:2303.11366.
4. **Yao, S. et al.** (2023). *Tree of Thoughts: Deliberate Problem Solving with Large Language Models.* arXiv:2305.10601.
5. **Besta, M. et al.** (2023). *Graph of Thoughts: Solving Elaborate Problems with Large Language Models.* arXiv:2308.09687.
6. **Xu, B. et al.** (2023). *ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models.* arXiv:2305.18323.
7. **Dhuliawala, S. et al.** (2023). *Chain-of-Verification Reduces Hallucination in Large Language Models.* arXiv:2309.11495.
8. **Anonymous** (2025). *A Taxonomy of Hierarchical Multi-Agent Systems: Design Patterns, Coordination Mechanisms, and Industrial Applications.* arXiv:2508.12683.
9. **Anthropic** (2024). *Anthropic Cookbook: patterns/agents.* GitHub: anthropics/anthropic-cookbook.
10. **Wooldridge, M. & Jennings, N.** (1995). *Intelligent Agents: Theory and Practice.* The Knowledge Engineering Review, 10(2). [Classical agent theory background.]
11. **Russell, S. & Norvig, P.** *Artificial Intelligence: A Modern Approach.* [Standard reference for utility-based and BDI agent architectures.]

---

## Appendix A — Quick reference: the AAWDS tuple

```
S = ⟨E, L1, L2, L3, L4, L5⟩

E.trigger     ∈ {sync, event, batch, continuous}
E.adaptation  ∈ {stateless, memory, self-improving}

L1            ⊆ {tools, code, retrieval, memory_io, environment}    (non-empty)

L2            ∈ {⊥, linear, reactive, plan-first, reflective,
                 branching, stochastic, utility}                    (⊥ iff L3 = static)

L3            ∈ {static, mixed, model-directed}

L4            = ⟨n, topology⟩
                topology ∈ {⊥, hierarchical, pipeline, hub-spoke,
                            peer, graph, blackboard, swarm}        (⊥ iff n = 1)

L5            ⊆ {pre-input, in-loop, post-output, cross-cutting,
                 aggregation}

Note: aggregation patterns commonly named at L4 (debate, consensus,
market-based) are compositions: L4 peer or graph topology + L5
aggregation rule. See §3.6.
```

## Appendix B — Selection protocol checklist

1. Envelope: trigger? adaptation horizon?
2. L1: action space? observation space? (be specific; list tools and environments)
3. L3: static, mixed, or model-directed? (default static; escalate only on measured need)
4. L2 (if applicable): linear / reactive / plan-first / reflective / branching / stochastic / utility?
5. L4: cardinality? topology if > 1? (default 1; escalate only on measured need)
6. L5: which interception points? sized to what risk?

## Appendix C — Common composition recipes

| If you need to...                                   | Compose...                                       |
|------------------------------------------------------|--------------------------------------------------|
| Reduce repeated identical failures                  | L2: reflective + Env.adaptation: memory         |
| Improve output quality with measurable criteria     | L3: mixed + L5: in-loop critic                   |
| Decompose dynamic, unpredictable subtasks           | L3: mixed + L4: hierarchical                     |
| Stabilize stochastic answers                        | L2: stochastic                                   |
| Long, well-defined plans                            | L2: plan-first + L3: mixed                       |
| High-stakes ambiguous decisions                     | L4: peer (n>1) + L5: voting aggregator + post-output validator |
| Reactive systems with external triggers             | Env.trigger: event + L3: static                  |
| Continuous monitoring                               | Env.trigger: continuous + Env.adaptation: memory |

---

*End of working paper.*

---

## Revision history

**v0.2 (current).** Revisions following analytical review of v0.1.

- **Pattern count reconciled.** Abstract previously cited "~25 widely cited patterns"; §5.3 cited "~38 patterns examined"; the §5.2 table contains 37 entries. All references aligned to 37.
- **Counting rule clarified.** §1.3's "six independent decisions" was inconsistent with §4.1's seven-slot tuple. Replaced with explicit description of the tuple structure (envelope plus five execution layers; envelope decomposes into trigger and adaptation).
- **Consensus / debate asymmetry fixed.** v0.1 treated consensus committee as a primitive L4 topology while treating debate as a composition (L4 peer + L5 aggregator). The asymmetry was unjustified — both topologies require an aggregator. §3.6 and §4.1 refined to make aggregation a consistent L5 concern; consensus, debate, and market-based are now all classified as compositions of a peer topology with an L5 aggregator. §5.2 compression table updated accordingly.
- **Strict dependency caveat added.** §3.8 now acknowledges three edge cases (dynamically-extending capability, self-modifying agents, joint-cognition multi-agent systems) where the strict L1-below-L3 ordering strains. Two are partially absorbed; one is a real gap (§8.6).
- **Adversarial stress test added.** New §5.4 attempts to break the framework with patterns deliberately chosen to violate its assumptions. Results: self-modifying agents and streaming agents are partially absorbed with loss of structural strength; joint-cognition multi-agent systems are not absorbed. v0.1's compression test was non-adversarial; this addition addresses that.
- **Empirical validation discussion sharpened.** §8.4 now specifies what empirical validation could mean for a structural framework (pattern non-absorption testing and failure-mode localization accuracy on post-mortem corpora) and acknowledges that both are non-trivial to design.
- **Memory boundary clarified.** §8.5 now states an explicit decision rule (within-run = L1; cross-run = Envelope) and acknowledges Reflexion's cross-attempt critique store as a genuine boundary case admitting two defensible classifications.
- **New limit sections added.** §8.6 (joint cognition), §8.7 (human participation, with Hybrid Human-Agent flagged as contested compression in §5.2).
- **Envelope framing tightened.** §3.2 reframed: the envelope is *deployment-level* rather than *temporally outside* execution. Decisions are made at deploy time but have within-run consequences.
- **Comparative critique sharpened.** New §9.5 with a worked cross-framework comparison on the placement of Reflection (Anthropic vs. Ng vs. ReAct lineage vs. AAWDS), demonstrating that AAWDS makes visible a disagreement the prior frameworks conceal. New §9.6 acknowledges where AAWDS does *not* improve on prior frameworks.

**Items deliberately not changed.** The five-layer + envelope structure is unchanged. The selection logic in §7 is unchanged. The failure-mode localization heuristic in §6 is unchanged. The validation method (pattern compression) is unchanged in spirit but supplemented with the adversarial test rather than replaced.

**v0.1.** Initial release.