# Dot Technical Whitepaper v0.3

## Pseudonymous private agentic compute for builders

Version: 0.3  
Target surface: docs.usedot.xyz  
Audience: Base builders, privacy engineers, AI infrastructure operators, agent researchers, and pseudonymous users who expect technical answers instead of slogans.

## Abstract

Dot is designed as a pseudonymous private agentic compute network: a product and infrastructure stack where users can access direct open-weight AI models, search-augmented answers, and a frontier-style coding agent without converting their identity, prompts, codebase, or research intent into durable platform data.

Most consumer AI products collapse into the same pattern: a chat interface, a hosted frontier API, a policy layer controlled by the model provider, conversation retention for convenience, and account identity that gradually becomes an analytics profile. Dot rejects that shape. Dot separates identity from conversation, conversation from storage, model execution from upstream providers, and code generation from passive text. DotChat is the conversational interface. DotCode is the actuation layer. The infrastructure beneath them is optimized for no-retention private inference, narrow harm-boundary policy, wallet-native access, and verified local or isolated execution.

The thesis is that the next meaningful privacy frontier is not just "private chat." It is private capability. A model that answers sensitive questions but cannot inspect a repo, create files, run commands, test its work, or produce proof remains trapped behind glass. A coding agent that can execute but sends every prompt and repository trace to a cloud model provider is not private. Dot joins these surfaces: private direct AI for thought, DotCode for action, and a runtime that keeps user data out of training loops and durable logs.

This paper specifies Dot as a target production architecture. It intentionally avoids fabricated benchmark numbers. Where performance is discussed, the paper defines service-level objectives, evaluation methods, and external systems references. The central claims are architectural: on-prem model serving, pseudonymous identity, no server-side conversation storage, no training on user data, refusal audit without raw prompt retention, and an execution substrate that makes coding requests materially more useful than chat-only systems.

## Formal model and notation

This section defines Dot as a system rather than a brand surface. The notation is intentionally explicit so the privacy, routing, and agent claims can be evaluated as engineering constraints.

Let:

- `U` be the set of users.
- `I` be the set of identity handles: anonymous session ids, wallet addresses, provider ids, and usernames.
- `P` be the set of raw prompts.
- `Y` be the set of raw model completions.
- `M` be the set of available models.
- `R` be the set of routes: `chat`, `search`, `code`, `policy`, `eval`.
- `A` be the set of tool or agent actions.
- `D` be the set of durable storage systems.
- `E` be the set of runtime events emitted by the system.
- `C` be the set of policy classes: `allowed`, `self_harm`, `harm_others`.

A Dot request is a tuple:

```text
x = (u, i, p, r_hint, private_mode, tau)

where:
  u in U is the human user,
  i in I is the selected identity handle,
  p in P is the raw prompt body,
  r_hint in R union {unknown} is the client hint,
  private_mode in {0,1} selects the private path,
  tau is the request timestamp bucket.
```

The router computes:

```text
rho(x) -> (c, r, m, z)

where:
  c in C is the policy class,
  r in R is the route,
  m in M is the selected model,
  z is the tool plan or null.
```

The response is:

```text
Phi(x) -> (y, e_1, ..., e_k)

where:
  y in Y is the completion or final artifact summary,
  e_1..e_k in E are observable state events such as queued, searching, generating, tool_running, verified.
```

### Formal privacy invariant

Define `store(d, q)` as true when durable store `d in D` persists object `q`. Define `raw(q)` as true when `q` is raw user prompt text or raw model completion text.

Dot's no-retention target for private mode is:

```text
For every private request x = (..., p, ..., private_mode=1, ...):

  (1) for all d in D: store(d, p) = false
  (2) for all d in D: store(d, y) = false
  (3) for all d in D_train: store(d, p) = false and store(d, y) = false
```

This says raw prompts and completions are neither transcript data nor training data. The system may store content-free metadata:

```text
mu(x) = (request_id, route, model_id, policy_class, token_count,
         timestamp_bucket, latency_bucket, error_class)
```

The privacy boundary is therefore not "Dot never stores anything." That would make billing, rate limits, abuse control, and reliability impossible. The boundary is narrower and more precise: Dot should not store the raw conversational content by default.

### Linkability objective

A system can avoid storing raw prompts and still leak privacy by joining identity and sensitive metadata too aggressively. Define a linkage event:

```text
link(i, p) = 1 if there exists durable store d such that
             store(d, i) = true and store(d, p) = true and
             d admits a join between i and p.
```

Dot's private mode requires:

```text
E[link(i, p) | private_mode=1] = 0
```

For metadata, Dot should minimize joinability:

```text
minimize  J = sum_d w_d * I[d stores identity handle and fine-grained request metadata]
```

where `w_d` is a sensitivity weight for each durable system. Coarse timestamp buckets, salted prompt hashes, route ids, and aggregate counters reduce `J`; raw text, full IP logs, wallet-prompt joins, and per-message analytics increase it.

### System objective

Dot optimizes for useful capability under privacy and policy constraints. Let:

- `Q(x,y,a)` be user utility or task quality.
- `L(x)` be latency cost.
- `G(x)` be GPU and tool cost.
- `H(x)` be policy risk.
- `N(x)` be privacy exposure.
- `F(x)` be false refusal cost.

The target routing policy `pi` can be described as:

```text
pi* = argmax_pi E[ Q(x,y,a)
                   - lambda_L * L(x)
                   - lambda_G * G(x)
                   - lambda_H * H(x)
                   - lambda_N * N(x)
                   - lambda_F * F(x) ]

subject to:
  raw prompt retention = 0
  raw completion retention = 0
  no user data training by default
  refuse self-harm facilitation
  refuse material harm-to-others
```

This objective makes the tradeoff explicit. Dot should not maximize safety refusal at the cost of answering almost nothing. It should not maximize raw helpfulness at the cost of enabling harm. It should not maximize latency by forwarding to every cloud API. It should not maximize personalization by building a hidden transcript memory. The system target is useful private capability under hard retention constraints.

### Boundary classifier

Let a boundary classifier estimate:

```text
g_theta(p, context) = [Pr(allowed), Pr(self_harm), Pr(harm_others)]
```

The decision function is:

```text
decision(p) =
  refuse        if Pr(self_harm)   >= alpha_self
  refuse        if Pr(harm_others) >= alpha_harm
  allow         otherwise
```

The threshold pair `(alpha_self, alpha_harm)` should be tuned on two separate evaluation sets:

```text
S_harmful      = prompts that materially facilitate self-harm or harm to others
S_lawful_hard  = sensitive but lawful prompts that should be answered
```

The two error rates are:

```text
false_allow  = |{p in S_harmful     : decision(p) = allow}| / |S_harmful|
false_refuse = |{p in S_lawful_hard : decision(p) = refuse}| / |S_lawful_hard|
```

Mainstream refusal systems often minimize `false_allow` by tolerating a high `false_refuse` rate. Dot's policy target is harder: keep `false_allow` low while aggressively reducing `false_refuse` for lawful adult use. That is the technical meaning of a narrow boundary.

## Algorithms

### Algorithm 1: volatile private inference

```text
Input: request x = (u, i, p, r_hint, private_mode=1, tau)
Output: streamed completion y or structured refusal

1: request_id <- random_uuid()
2: emit(request_id, "received")
3: c <- boundary_classifier(p)
4: write_audit_metadata(request_id, c, route_hint, timestamp_bucket, hash_salted(p))
5: if c in {self_harm, harm_others} then
6:     emit(request_id, "policy_boundary")
7:     return safe_boundary_response(c)
8: end if
9: r <- route_classifier(p, r_hint)
10: m <- model_scheduler.select_model(route=r, private_mode=true)
11: emit(request_id, "queued", route=r, model=m.public_name)
12: y_stream <- inference_worker.generate(model=m, prompt_buffer=p)
13: for token in y_stream do
14:     emit(request_id, "token", token)
15: end for
16: zeroize(prompt_buffer)
17: discard(raw_completion_buffer)
18: write_metrics(request_id, route=r, model=m.id, token_count, latency_bucket)
19: return done(retained_prompt=false, retained_completion=false)
```

The critical lines are 16-18. The content buffers die; the metrics survive.

### Algorithm 2: model routing and admission control

```text
Input: request x, route r, model pool M
Output: admitted model m or retry/defer signal

1: candidates <- {m in M | m.private_mode_allowed and r in m.routes}
2: for each m in candidates do
3:     kv_required <- estimate_kv_blocks(prompt_tokens(x), max_new_tokens(x), m)
4:     latency_pred <- predict_ttft(m, queue_depth(m), kv_required)
5:     cost_pred <- estimate_cost(m, prompt_tokens(x), max_new_tokens(x))
6:     score[m] <- utility_model(r, m)
7:                 - beta_1 * latency_pred
8:                 - beta_2 * cost_pred
9:                 - beta_3 * saturation(m)
10: end for
11: m_star <- argmax_m score[m]
12: if kv_required(m_star) > free_kv_blocks(m_star) then
13:     return defer_or_fallback(route=r)
14: end if
15: admit(x, m_star)
16: return m_star
```

This is a systems policy, not just a model preference list. It prevents "use the biggest model for everything" from becoming an outage.

### Algorithm 3: DotCode observe-plan-act-verify loop

```text
Input: handoff h, workspace W
Output: artifact report R

1: read h from local handoff file
2: scope <- determine_workspace_scope(W, h)
3: obs_0 <- inspect_repo(scope)
4: plan <- generate_plan(h, obs_0)
5: emit("plan", plan)
6: repeat
7:     a_t <- choose_action(plan, current_observation)
8:     if a_t mutates files then
9:         apply_patch(a_t)
10:        record_changed_files()
11:    else if a_t runs command then
12:        result <- run_command(a_t)
13:        redact_secrets(result)
14:    else if a_t verifies browser then
15:        result <- capture_screenshot_and_console()
16:    end if
17:    current_observation <- inspect(result)
18:    plan <- repair_plan_if_needed(plan, current_observation)
19: until acceptance_criteria_met or blocked
20: R <- summarize(files_changed, commands_run, checks, screenshots, residual_risk)
21: return R
```

This loop is the direct descendant of the ReAct idea: reasoning and action interleaved. The difference is that DotCode's environment is a real workspace, not a toy API.

## 1. System thesis

Dot is not a model wrapper. Dot is a private execution runtime around models.

The model layer is replaceable. The durable product value sits in:

- Identity minimization.
- Volatile prompt handling.
- On-prem inference.
- Search and tool orchestration.
- Narrow policy containment.
- Agentic execution.
- Verification and artifact proof.
- Wallet-native access control.

In a conventional hosted AI product, the model provider often controls several layers at once: model weights, inference logs, account identity, safety policy, tool permissions, memory, and future training paths. That bundling is convenient for product velocity but weak for user sovereignty. Dot splits the layers so that each can be reasoned about independently.

Dot's design objective is:

```text
useful AI capability
  minus mandatory civil identity
  minus conversation retention
  minus upstream private prompt transit
  minus broad paternalistic refusal policy
  plus verified execution
  plus wallet-native access
```

The Base ecosystem is a natural audience because it already treats wallets as programmable identity, values credible neutrality, and contains builders who understand the difference between a surface-level app and a system with real trust boundaries. Dot maps AI usage onto that same design language: pseudonymous, composable, payment-aware, and technically inspectable.

## 2. Product surface

Dot exposes two first-class products.

### 2.1 DotChat

DotChat is the direct private conversation surface. It provides:

- Anonymous sessions.
- Wallet sign-in.
- Optional third-party provider sign-in.
- Username selection without mandatory legal identity.
- Open-weight uncensored chat models.
- Search-augmented answers when current facts are required.
- Transparent tool-state display: thinking, searching, routing, executing, waiting, streaming.
- Narrow harm-boundary policy.
- DotCode handoff when the request becomes a build task.

DotChat's role is not only to answer. It classifies intent, routes work, and exposes enough state that the user knows whether the model is answering from weights, retrieval, tool output, or a handoff.

### 2.2 DotCode

DotCode is the private coding agent. It exists because code is not a paragraph. Code is a living artifact that requires filesystem context, command execution, tests, screenshots, logs, and repair loops.

DotCode can:

- Inspect repository structure.
- Read package manifests, build scripts, configs, and source files.
- Create new project folders.
- Patch files in small reviewable changes.
- Run shell commands.
- Install dependencies when the project requires them.
- Run tests, builds, linters, type checks, and dev servers.
- Use browser verification for web products.
- Collect proof: command output, screenshots, changed files, and residual risk.

The product-level differentiation is direct: privacy-first chat competitors offer private chat and API surfaces; Dot adds a private coding agent that can produce working artifacts. Dot is not competing only on model personality. Dot competes on the execution loop.

## 3. Design principles

Dot is governed by five architectural principles.

### 3.1 Identity is optional

Users can use Dot anonymously, with a wallet, or through a third-party provider. The system should not require civil identity for ordinary AI access. A wallet can pay, unlock tiers, hold attestations, or provide continuity without forcing the user to reveal a legal profile.

### 3.2 Prompts are volatile execution material

Prompts are not product data. They are request bodies used to compute an answer. Dot should route them, execute against them, stream results, and discard them. If a user explicitly opts into local continuity, that continuity belongs client-side or in an encrypted user-controlled vault, not in a default server-side transcript store.

### 3.3 Models are swappable

The model is a component, not the company. Dot can route across Dolphin-style Mistral derivatives, Qwen-family models, DeepSeek-family models, Llama-family models, GLM-family models, and specialized coder models. The serving layer should assume model churn.

### 3.4 Policy is narrow and explicit

Dot blocks self-harm facilitation and content that materially helps harm others. Outside that boundary, the default behavior is to answer directly, especially for adult users asking for lawful research, personal analysis, technical education, privacy engineering, financial planning, security defense, or software development.

### 3.5 Capability must produce evidence

The strongest answer is not prose. It is a working artifact plus proof. DotCode should make artifacts observable: files changed, commands run, tests passed, screenshots captured, and limitations stated.

## 4. System architecture

Dot is divided into five planes.

```text
                 +-----------------------------+
                 |        Dot web client       |
                 |  anonymous / wallet / auth  |
                 +---------------+-------------+
                                 |
                                 v
+----------------+     +---------+----------+      +------------------+
| identity plane | --> | request/session    | ---> | policy boundary  |
| wallet/provider|     | envelope           |      | narrow classifier|
+----------------+     +---------+----------+      +---------+--------+
                                 |                           |
                                 v                           v
                       +---------+----------+      +---------+--------+
                       | model router       | ---> | refusal audit    |
                       | task, tier, health |      | no raw prompts   |
                       +----+-----------+---+      +------------------+
                            |           |
                            |           +----------------------+
                            v                                  v
                +-----------+-----------+          +-----------+-----------+
                | on-prem inference     |          | DotCode handoff      |
                | chat/code/search      |          | terminal/agent loop  |
                +-----------+-----------+          +-----------+-----------+
                            |                                  |
                            v                                  v
                 streamed answer                     verified artifact
```

### 4.1 Identity plane

The identity plane binds a request to access rights, not to a durable prompt profile. It supports:

- Anonymous sessions for immediate use.
- Wallet sessions for pseudonymous continuity and billing.
- Provider sessions for conventional users.
- Username selection as a product layer, not a legal identity requirement.

The core invariant:

```text
wallet address != civil identity
username != legal identity
session id != conversation archive
prompt body != analytics entity
```

Wallet-native identity matters for Base because it enables access without the familiar Web2 funnel: email, phone number, name, payment card, tracking scripts, and behavioral profile. A wallet can carry payment rights and reputation without forcing a real-world identity binding.

### 4.2 Inference plane

The inference plane executes open models on Dot-controlled infrastructure. The private mode is designed so that user prompts and model outputs do not leave Dot's servers for upstream model API processing.

The target inference plane contains:

- Chat workers for direct response models.
- Code workers for coding-specialized models and agent planning.
- Search-grounded answer workers.
- Embedding and rerank workers.
- Evaluation workers for offline synthetic benchmarks.
- Admission control and queue scheduling.
- Per-model health, latency, and saturation metrics.

### 4.3 Policy plane

The policy plane is intentionally narrow. It classifies whether a request falls into self-harm facilitation, harm-to-others assistance, or allowed content. It does not treat every sensitive domain as prohibited. The classifier should operate before and after generation when necessary, but the audit data must not retain raw prompts or raw completions by default.

### 4.4 Agent plane

The agent plane is DotCode. It converts coding intent into a structured task, opens a private local or isolated execution environment, inspects the workspace, edits files, runs commands, verifies output, and reports the result.

### 4.5 Observability plane

The observability plane measures the service without capturing user conversations. It tracks health, queue depth, latency, refusal-class rates, tool failures, search failures, and agent success metrics. It must never become a hidden transcript database.

## 5. Trust boundaries and no-retention privacy

The strongest privacy property is not encryption at rest. It is non-retention.

Encryption is useful, but encrypted archives are still archives. They can be exfiltrated, subpoenaed, misconfigured, or retained past their useful life. Dot's default stance is that conversation content should not become durable server-side state.

### 5.1 Request lifecycle

```text
1. user submits prompt
2. browser sends request envelope
3. router classifies task and policy boundary
4. model worker receives prompt in volatile memory
5. answer streams back to browser
6. request buffer is discarded
7. operational metrics remain without raw prompt or completion text
```

### 5.2 Data classes

| Data class | Example | Default retention | Purpose |
| --- | --- | --- | --- |
| Raw prompt | User message body | None | Inference input only |
| Raw completion | Model output text | None | Response stream only |
| Session envelope | Session id, route, tier | Short-lived | Routing and rate limits |
| Identity handle | Wallet, provider id, username | Account lifetime | Access and billing |
| Refusal audit | Policy class, model id, prompt hash | Aggregated | Boundary quality control |
| Ops metrics | Queue depth, token latency, errors | Aggregated | Reliability |
| User reports | Explicitly submitted issue | Explicit window | Debugging and appeals |

### 5.3 Refusal audit without prompt retention

Dot still needs to know whether the boundary classifier is failing. The audit path stores metadata, not message content.

```json
{
  "event_id": "random_uuid",
  "route": "chat | code | search",
  "model_id": "dot-chat-primary",
  "policy_class": "allowed | self_harm | harm_others",
  "decision": "allow | refuse | safe_complete",
  "prompt_hash": "blake3(prompt + rotating_salt)",
  "timestamp_bucket": "2026-05-28T01",
  "raw_prompt": null,
  "raw_completion": null
}
```

The prompt hash is useful for deduplicating repeated policy failures during a rotation window. It is not a conversation record. Salt rotation limits correlation over time. The audit store answers questions such as "did the classifier suddenly over-refuse financial questions?" without preserving what users asked.

### 5.4 Threat model

Dot designs against:

- Prompt disclosure through logs.
- Raw transcript retention by accident.
- Upstream model provider visibility.
- Over-broad analytics instrumentation.
- Wallet identity linkage to conversation text.
- Prompt injection from web retrieval.
- Agent secret leakage.
- Malicious dependency installation.
- Code-agent writes outside the intended workspace.
- Insider access to sensitive operational systems.

Dot does not claim mathematical anonymity against every network adversary. It reduces trust edges and durable data surfaces. The practical privacy improvement comes from fewer processors, fewer logs, less storage, and a stricter separation between identity and content.

### 5.5 Proof sketches for retention invariants

The following are proof sketches, not cryptographic proofs. They define the implementation conditions required for the privacy claim to be true.

**Invariant I1: raw prompt non-retention.**

Claim:

```text
For private request x with prompt p:
  for all durable stores d in D, store(d,p) = false.
```

Sufficient implementation conditions:

1. Edge ingress does not log request bodies.
2. Application logs do not serialize prompt bodies.
3. Queue payloads containing prompts are memory-resident or short-lived encrypted buffers with deletion on completion.
4. Inference workers do not write prompt buffers to disk.
5. Error reporters scrub request bodies.
6. Metrics labels cannot contain raw prompt substrings.
7. Debug capture is disabled by default and requires explicit user opt-in plus retention TTL.

Proof sketch:

The prompt can only persist if some component on the request path writes it to a durable store. The request path is finite: ingress, router, policy, queue, inference worker, tool worker, error reporter, metrics. If each component either never receives the raw prompt or receives it only in volatile memory and is configured not to write it to any durable store, then no `d in D` can satisfy `store(d,p)=true`. The invariant reduces to a coverage problem: every component that observes `p` must be in the no-write set.

**Invariant I2: prompt is not training data.**

Claim:

```text
For private request x with prompt p:
  p not in D_train.
```

Sufficient implementation conditions:

1. Training and fine-tuning datasets are built from synthetic tasks, licensed public data, internal eval traces, or explicit opt-in submissions.
2. No production prompt stream feeds dataset builders.
3. Dataset lineage records source type.
4. Dataset build jobs fail closed if source lineage is unknown.

Proof sketch:

If the only ingestion paths into `D_train` exclude production prompt streams, and every dataset item requires lineage metadata that is not `production_private_prompt`, then `p` cannot enter `D_train` unless an ingestion control is bypassed. The testable property is lineage completeness.

**Invariant I3: identity-content unlinkability by default.**

Claim:

```text
For private request x = (u,i,p,...):
  no durable store admits a direct join between i and p.
```

Sufficient implementation conditions:

1. Account store contains identity and billing data but no raw prompt.
2. Prompt body is not stored in transcript databases.
3. Refusal audit stores salted prompt hash and coarse timestamp, not raw prompt.
4. Salt rotation prevents long-horizon hash correlation.
5. Operational metrics use request ids without raw content.

Proof sketch:

A direct join requires a durable relation containing both identity and raw content, or two durable relations connected by a stable key where one contains identity and the other contains raw content. Conditions 1-5 remove raw content from durable stores and reduce stable cross-store joins. Therefore a default direct join between identity and raw prompt cannot be evaluated because the prompt relation is absent.

## 6. On-prem inference architecture

Dot's private mode runs model inference on Dot-controlled servers rather than forwarding private prompts to third-party frontier APIs. The phrase "on-prem" here means operational control over the serving stack: GPUs, model weights, scheduler, telemetry policy, network boundaries, and log configuration.

### 6.1 Serving topology

```text
            +---------------------+
            | edge ingress        |
            | TLS, rate limits    |
            +----------+----------+
                       |
                       v
            +----------+----------+
            | request router      |
            | task/policy/tier    |
            +-----+----------+----+
                  |          |
                  |          +-------------------+
                  v                              v
      +-----------+-----------+      +-----------+-----------+
      | chat scheduler        |      | code scheduler        |
      | continuous batching   |      | long horizon slots    |
      +-----------+-----------+      +-----------+-----------+
                  |                              |
                  v                              v
      +-----------+-----------+      +-----------+-----------+
      | GPU pool A             |     | GPU pool B             |
      | chat/search models     |     | coder/reasoning models |
      +-----------+-----------+      +-----------+-----------+
                  |                              |
                  v                              v
            stream tokens                 agent actions
```

### 6.2 Scheduler objectives

The scheduler optimizes for:

- Time to first token.
- Tokens per second per active user.
- KV-cache utilization.
- Queue fairness.
- Context length admission.
- Model health.
- User tier.
- Tool requirements.
- Isolation between chat and long-horizon code tasks.

The scheduler should prevent a single long coding request from starving short chat requests. Chat, search-grounded answers, and code-agent planning need different queue classes.

### 6.3 KV-cache capacity model

For a decoder-only transformer, each generated or prefilling token consumes key and value cache memory at every layer. A useful approximation is:

```text
bytes_per_token(m) ~= 2 * L_m * H_m * d_head_m * b_m

where:
  2          = key plus value tensors
  L_m        = number of transformer layers
  H_m        = number of attention heads or KV heads
  d_head_m   = head dimension
  b_m        = bytes per cache scalar
```

For a request `x` with `n_in` prompt tokens and `n_out` generated tokens:

```text
kv_bytes(x,m) ~= (n_in + n_out) * bytes_per_token(m)
```

For model `m` on GPU worker `g`, the admission constraint is:

```text
sum_{x in active(g)} kv_bytes(x,m)
  <= vram(g) - weights(m) - runtime_overhead(g) - reserve(g)
```

This is why context length and concurrency are coupled. A service cannot promise unlimited simultaneous long-context users on a fixed GPU. The scheduler must either admit fewer long requests, use smaller models, use quantized KV/cache modes where acceptable, spill/defer, or separate workloads by queue class.

PagedAttention-style serving makes this constraint less wasteful by allocating KV cache in blocks rather than requiring large contiguous slabs. Dot should use memory-aware serving because wasted KV cache is wasted concurrency.

### 6.4 Scheduling objective

Let `B_t` be the active batch at decode step `t`. Let:

- `TTFT(x)` be time to first token.
- `TPS(x)` be decode throughput for request `x`.
- `SLA(x)` be the tier-weighted latency target.
- `C_gpu(x)` be marginal GPU cost.
- `Q_route(x)` be route priority.

The scheduler can be framed as:

```text
maximize over admitted batch B_t:

  sum_{x in B_t} [ Q_route(x)
                   - gamma_1 * max(0, TTFT_pred(x) - SLA(x))
                   - gamma_2 * C_gpu(x)
                   - gamma_3 * starvation_age(x) ]

subject to:
  KV capacity constraint
  per-route concurrency limits
  policy queue latency bound
  code-agent isolation bound
```

The objective is not purely throughput. A system that maximizes aggregate tokens/sec can still feel broken if short chat requests sit behind long code runs. Dot needs throughput, fairness, and visible progress.

### 6.5 Serving engines

Dot's serving layer should be engine-agnostic:

- vLLM-style serving for high-throughput GPU batching, PagedAttention, prefix caching, and OpenAI-compatible serving.
- Hugging Face Text Generation Inference or Transformers continuous batching where its operational profile fits.
- llama.cpp/GGUF lanes for quantized models, smaller nodes, lower operational complexity, and fast experimentation.
- Specialized runtimes for embedding, reranking, and evaluation.

The vLLM/PagedAttention paper reports 2-4x throughput improvements over compared serving baselines at similar latency on evaluated workloads. Dot should treat this as evidence for memory-aware scheduling, not as a substitute for Dot's own benchmarks.

### 6.6 Hardware lanes

| Lane | Purpose | Typical hardware | Notes |
| --- | --- | --- | --- |
| Dev lane | Model experiments and regression tests | RTX 4090/5090 class nodes | Good for quantized models and local iteration |
| MVP lane | Early private chat and code routing | 2-4 GPU nodes, 24-48 GB VRAM each | Separate chat and code queues |
| Throughput lane | Larger models and concurrent sessions | L40S/A100/H100 class nodes | Continuous batching and model-parallel options |
| Isolation lane | Agent sandboxes and build execution | CPU + optional GPU workers | Protect inference workers from generated code |
| Eval lane | Offline replay with synthetic prompts | Any reproducible GPU/CPU mix | No user prompt replay by default |

The key operational point is separation. A GPU used for high-throughput chat should not also be an untrusted generated-code sandbox. Code execution belongs in a different trust zone.

## 7. Model layer and routing

Dot uses open models because open models can be hosted, quantized, inspected, swapped, benchmarked, and routed under Dot's own policy. Closed frontier APIs may be useful in non-private modes, but they cannot be the private path if prompts leave Dot's trust boundary.

### 7.1 Model families

Dot's target pool can include:

- Dolphin-style Mistral derivatives for direct uncensored instruction following.
- Qwen-family models for multilingual, code, and reasoning-heavy tasks.
- DeepSeek-family models for code synthesis and planning.
- Llama-family models for broad open-weight compatibility.
- GLM-family models where latency/capability tradeoffs fit.
- Small classifiers for policy, routing, and tool selection.
- Embedding and rerank models for retrieval.

The model name shown to a user is a product abstraction. The internal router may move between weights as long as the product contract remains stable: private path, no training, direct answers, and transparent tool state.

### 7.2 Router inputs

The router considers:

- Task class: chat, code, search, policy, summarization, long reasoning.
- Sensitivity: ordinary, sensitive lawful, self-harm, harm-to-others.
- Context length.
- Expected output length.
- Need for retrieval.
- Need for DotCode.
- User tier.
- Queue depth.
- Model health.
- Cost and latency budget.
- Whether private mode is required.

### 7.3 Routing policy

```text
if policy_class in {self_harm, harm_to_others}:
    return safe_boundary_response

if task_class == "code_build":
    create_dotcode_handoff()
    route_to_code_agent_planner()

if task_class == "current_fact":
    run_search()
    route_to_grounded_answer_model()

if task_class == "sensitive_lawful":
    route_to_direct_private_model()
    preserve narrow policy boundary

if context_tokens > model_context_limit:
    summarize_or_chunk_with_user_visible_state()

else:
    route_to_default_private_chat_model()
```

### 7.4 Model routing classes

| Route | Primary objective | Model profile | Tooling |
| --- | --- | --- | --- |
| Direct chat | Low-latency answer | General uncensored instruction model | None or light search |
| Grounded answer | Current factual accuracy | Chat model + retrieval | Search, citations, rerank |
| Sensitive lawful | Direct adult answer | Policy-aware direct model | Narrow boundary check |
| Code build | Working artifact | Coder/planner model | DotCode tools |
| Long planning | Coherent multi-step plan | Larger reasoning model | Optional scratchpad, no raw storage |
| Policy | Boundary classification | Compact classifier | Refusal audit metadata |

## 8. DotChat runtime

DotChat is not a static textarea. It is a routing and visibility layer.

### 8.1 Runtime states

DotChat should show:

- Model live/offline.
- Request queued.
- Policy boundary check.
- Search started.
- Search source count.
- Model streaming.
- DotCode handoff ready.
- Tool failure.
- Retry state.

Visibility is a trust feature. If a user waits 110 seconds with no status, the product feels broken even if the model is still generating. Dot should expose progress without exposing hidden reasoning text.

### 8.2 Search-augmented generation

DotChat decides to search when the prompt requires:

- Current events.
- Provider pricing.
- Legal or regulatory freshness.
- Software package versions.
- API documentation.
- Server availability.
- Recently published benchmarks.
- Market or ecosystem facts.

Search output must be treated as untrusted input. The answer model should cite sources and ignore instructions contained inside retrieved pages that attempt to control the assistant.

### 8.3 Conversation continuity without server transcripts

Dot can support continuity without default server-side transcripts:

- Browser-local encrypted history for users who opt in.
- User-controlled export/import.
- Wallet-linked encrypted vault as a future option.
- Server-side conversation storage disabled by default.

This distinction matters. A product can be usable without turning every private question into a permanent server record.

## 9. DotCode runtime

DotCode is the execution engine. Its value comes from the observe-plan-act-verify loop.

```text
intent
  |
  v
structured handoff
  |
  v
observe repo -> plan -> patch -> execute -> inspect -> repair -> verify
  |                                                       |
  +------------------------- feedback loop ---------------+
```

### 9.1 Handoff lifecycle

1. DotChat detects a build, coding, automation, frontend, infrastructure, or repository request.
2. DotChat rewrites the request into a precise implementation prompt.
3. A local handoff file is created instead of dumping a giant prompt into the terminal.
4. The browser asks permission to open DotCode.
5. DotCode starts in the target workspace.
6. DotCode reads the handoff, creates or inspects the project, and begins execution.
7. DotCode reports changed files, commands run, tests, screenshots, and limitations.

### 9.2 Why this beats chat-only code

Chat-only code generation produces unverified text. DotCode produces an artifact and checks it.

| Capability | Chat-only model | DotCode |
| --- | --- | --- |
| Inspect existing repo | Usually no | Yes |
| Modify files | No | Yes |
| Run tests | No | Yes |
| See build failures | No | Yes |
| Repair after failure | Limited | Yes |
| Verify browser UI | No | Yes |
| Produce changed-file summary | Manual | Yes |
| Keep work local | Depends on provider | Target default |

### 9.3 Agent safety boundaries

DotCode must be powerful but not careless. The default controls:

- Never print or commit secrets.
- Prefer `.env.example` over real credentials.
- Ask only when a missing decision blocks implementation.
- Avoid unrelated file changes.
- Show commands run.
- Keep patches reviewable.
- Use dry-run or simulation modes for money, keys, networks, and external APIs.
- Scope generated projects to a new folder unless the user selected an existing repo.

### 9.4 Frontend artifact rule

For web projects, DotCode should not generate empty SaaS shells. It should source or create real visual content: a hero video, hero image, canvas animation, SVG scene, product screenshot, or local media. It must verify that the media loads and that the layout does not clip on mobile and desktop.

This is a product-quality rule, not decoration. A builder judging an AI coding agent wants evidence that the agent can produce a real surface, not placeholder panels.

## 10. Narrow harm-boundary policy

Dot's policy is intentionally smaller than mainstream refusal systems.

Dot refuses:

- Self-harm facilitation.
- Operational assistance to harm others.

Dot should answer:

- Lawful technical education.
- Privacy engineering.
- Security defense and authorized testing.
- Tax and legal analysis framed as lawful compliance or optimization.
- Trading automation with user-controlled risk limits and no market manipulation.
- Political, controversial, or adversarial analysis.
- Adult users asking direct questions in sensitive domains.

The system must avoid category collapse. Category collapse happens when a model treats any mention of a sensitive domain as prohibited. Dot should classify intent, constraints, and operational harm, not taboo keywords.

Examples:

| Domain | Allowed framing | Refused framing |
| --- | --- | --- |
| Privacy | Build a lawful personal VPN on a VPS | Hide from law enforcement while committing abuse |
| Finance | Explain legal tax optimization structures | Launder illegal proceeds |
| Security | Harden a server and test owned systems | Steal credentials or deploy malware |
| Trading | Build a dry-run liquidity monitor | Manipulate markets or steal private keys |
| Biology/medicine | General education and risk discussion | Instructions to harm a person |

## 11. Base ecosystem fit

Base builders are not only consumers. They are operators, protocol designers, DeFi builders, NFT creators, infra teams, auditors, bot developers, and anonymous founders. They understand wallets, smart contracts, gas, block explorers, composability, and distribution through onchain rails.

Dot fits that environment because:

- Wallet access is natural.
- Pseudonymity is normal.
- Payments can be onchain.
- Agent credits can be tokenized or account-bound.
- DotCode can build Base apps, dashboards, bots, and contract tooling.
- Private AI protects alpha, code, audits, market research, and unreleased strategy.
- Users can prove payment or membership without exposing civil identity.

The strategic wedge is "private AI that can execute." A Base founder should be able to ask for a contract dashboard, a token analytics tool, a legal-risk research memo, a wallet-gated website, or a bot simulator and then hand it to DotCode to build locally.

## 12. Dot vs chat-only private AI competitors

The private AI category is becoming crowded: privacy-first chat products, OpenAI-compatible private APIs, local model wrappers, anonymous prompt interfaces, and hosted open-model endpoints all compete for the same user trust surface. Dot should treat that competition seriously without naming the whole strategy around any single company. The core distinction is architectural: most competitors optimize private access to model responses; Dot optimizes private access to executable capability.

| Axis | Chat-only private AI competitors | Dot target architecture |
| --- | --- | --- |
| Core product | Private AI chat/API | Private chat plus DotCode |
| Differentiator | Privacy-forward uncensored model access | Private agentic execution |
| Identity | Account/product identity | Anonymous, wallet, or provider |
| Conversation handling | Privacy-forward local/encrypted posture per docs | No server-side conversation storage by default |
| Model serving | Provider-controlled model/API surface | Dot-controlled on-prem private model pool |
| Coding | Model can write code text | Agent can inspect, patch, run, test, and verify |
| Terminal execution | Not core surface | Core DotCode surface |
| Wallet-native Base fit | Not the central wedge | Central access and payment primitive |
| Benchmark target | Model/API quality | End-to-end artifact success |
| User promise | Private AI access | Private AI capability |

Dot's claim is not "we have a better chat box." The claim is "the private AI interface should be able to do work."

## 13. DotBench: system-level evaluation

Dot should benchmark the system, not just the model.

### 13.1 Why ordinary model benchmarks are insufficient

MMLU-style benchmarks test static knowledge. Coding leaderboards test slices of synthesis. They do not measure whether a product can:

- Route to search when facts are current.
- Avoid retaining prompts.
- Open a coding agent.
- Patch a repo.
- Run tests.
- Verify a browser page.
- Avoid leaking secrets.
- Handle queue pressure.
- Show useful progress before final output.

DotBench evaluates the runtime.

### 13.2 Evaluation dimensions

| Dimension | Metric | Target SLO |
| --- | --- | --- |
| Chat latency | Time to visible acknowledgement | < 3 seconds under normal load |
| Streaming | Time to first token after route | < 10 seconds for normal chat |
| Search | Tool-state visibility | Search state visible before retrieval completes |
| Code handoff | Button and handoff file creation | < 5 seconds after classification |
| Agent success | Runnable artifact produced | Report pass/fail by task class |
| Verification | Tests/build/screenshots captured | Required for code tasks where available |
| Privacy | Raw prompt in server logs | 0 known occurrences |
| Policy | False refusal rate | Tracked by synthetic eval sets |
| Safety | Missed self-harm/harm-to-others | Tracked by adversarial eval sets |
| Reliability | Tool failure explanation | User-visible within the request |

These are targets, not measured claims. Production numbers must be published only after controlled measurement.

### 13.3 DotBench scoring functional

DotBench should produce per-task and aggregate scores. For task `i`, define:

```text
V_i = verified artifact score in [0,1]
T_i = test/check score in [0,1]
P_i = privacy invariant score in {0,1}
R_i = refusal correctness score in {0,1}
S_i = source/provenance score in [0,1]
L_i = normalized latency penalty in [0,1]
C_i = normalized cost penalty in [0,1]
```

The per-task score is:

```text
DotBench_i =
  w_V * V_i
  + w_T * T_i
  + w_P * P_i
  + w_R * R_i
  + w_S * S_i
  - w_L * L_i
  - w_C * C_i
```

with a hard gate:

```text
if P_i = 0 then DotBench_i = 0
```

The hard gate is deliberate. A task is not successful if it produces a beautiful artifact while leaking the prompt into server logs. Privacy is not a bonus metric; it is a constraint.

Aggregate system score:

```text
DotBench = (sum_i omega_i * DotBench_i) / (sum_i omega_i)
```

where `omega_i` weights task classes by product importance. DotCode-heavy tasks receive high `omega_i` because Dot's differentiation depends on verified execution, not generic chat competence.

### 13.4 Initial DotBench tasks

| Task | Expected artifact | Verification |
| --- | --- | --- |
| Build a lawful personal VPN on a US VPS | WireGuard project and docs | Config lint, dry-run scripts, proof command |
| Build a Base liquidity monitor | TypeScript monitor with simulation | Unit tests, dry-run mode, no hardcoded keys |
| Build a landing page | Visual site with hero media | Local server, screenshots, asset checks |
| Fix a repository bug | Patch existing code | Test output and diff summary |
| Add search to chat | Tool state and cited answer | Mock search and UI state check |
| Create a docs site | Technical docs with citations | Link checks and responsive screenshots |

The scoring unit is not "did the model sound smart?" The scoring unit is "did the system produce evidence?"

## 14. Compute simulation and capacity planning

The following section is a synthetic capacity study, not a production benchmark. Its role is to make Dot's compute assumptions falsifiable. Instead of saying "we need GPUs," the paper defines a small queueing and memory model that predicts how concurrency degrades with context length, how long-running code agents interact with short chat traffic, and why route isolation is not optional.

### 14.1 Synthetic workload definition

Let traffic be a mixture of request classes:

```text
K = {direct_chat, search_chat, code_plan, code_agent, policy}
```

Each class `k` has:

```text
lambda_k   arrival rate, requests / second
Tin_k      input tokens
Tout_k     output tokens
S_k        service time distribution
q_k        queue assignment
```

The synthetic MVP mix used for planning:

| Class | Share | Input tokens | Output tokens | Notes |
| --- | ---: | ---: | ---: | --- |
| direct_chat | 55% | 600 | 800 | Ordinary private answers |
| search_chat | 20% | 1,200 | 900 | Retrieval plus citations |
| code_plan | 15% | 2,500 | 1,500 | DotCode handoff and repo plan |
| code_agent | 8% | tool loop | report | Shell/test/browser loop |
| policy | 2% | 350 | 32 | Boundary classification |

The model deliberately separates `code_plan` from `code_agent`. The first is an inference workload; the second is an execution workload. Mixing them hides the true bottleneck.

### 14.2 KV-cache pressure simulation

For a decoder-only transformer, KV-cache memory grows approximately linearly with active tokens:

```text
KV(x,m) = (Tin_x + Tout_x) * 2 * L_m * Hkv_m * Dhe_m * B
```

where `L_m` is the number of layers, `Hkv_m` is the number of KV heads, `Dhe_m` is head dimension, and `B` is bytes per scalar. For planning, define:

```text
C_g = usable KV cache budget on GPU g
A_g = active admitted requests

admit(x,g) iff sum_{j in A_g} KV(j,m) + KV(x,m) <= C_g
```

The qualitative result is invariant across model families: doubling context length approximately halves the maximum number of simultaneous sequences if all else is fixed.

Synthetic concurrency envelope for a single 24 GB consumer GPU lane after weights/runtime reserve:

```text
Context tokens per active request -> relative concurrency

 2k  | ############################  1.00x
 4k  | ##############                0.50x
 8k  | #######                       0.25x
16k  | ###                           0.12x
32k  | #                             0.06x
```

This is why Dot's router should not let every request default to the largest context model. Long context is a scarce resource. The product needs summarization, chunking, retrieval-first answers, and explicit long-context admission control.

### 14.3 Queueing simulation

A minimal approximation for each route queue is M/M/c:

```text
rho = lambda / (c * mu)

where:
  lambda = arrival rate
  mu     = service rate per worker
  c      = worker count
```

As `rho -> 1`, expected waiting time grows nonlinearly. This matters more than raw average latency. A system can feel fast at 50% utilization and unusable at 90% utilization even when the GPU is technically still processing tokens.

Synthetic waiting-time curve:

```text
Utilization rho -> relative waiting time

0.30 | ##
0.50 | ####
0.70 | #########
0.80 | ################
0.90 | ################################
0.95 | ############################################################
```

Operational conclusion: Dot should target headroom, not 100% GPU utilization. The product SLO is time-to-visible-progress, not only tokens/sec.

### 14.4 Route isolation simulation

Consider two designs:

```text
Design A: shared queue
  chat + search + code_plan + code_agent share workers

Design B: isolated queues
  chat-fast, chat-search, code-plan, code-agent, policy
```

Under load, Design A has head-of-line blocking. A long code plan or tool loop can delay a short chat request. Design B sacrifices some theoretical pooling efficiency but protects UX-critical routes.

Synthetic p95 time-to-first-visible-output:

| Scenario | Shared queue | Isolated queues |
| --- | ---: | ---: |
| 30% total load | 2.1s | 2.3s |
| 60% total load | 8.7s | 4.1s |
| 80% total load | 31.4s | 7.8s |
| 90% total load | timeout risk | 14.2s |

The numbers above are illustrative simulation outputs, not production measurements. The shape is the point: route isolation becomes more valuable as utilization increases.

### 14.5 Compute economics model

Let:

```text
cost_per_request =
  gpu_seconds * gpu_cost_per_second
  + search_calls * search_cost
  + sandbox_seconds * sandbox_cost_per_second
  + storage_events * storage_cost
```

For private inference, Dot must optimize:

```text
margin = subscription_revenue
         - sum(cost_per_request)
         - fixed_server_cost
         - bandwidth_cost
         - observability_cost
```

The correct economic unit is not "messages per month." It is weighted compute:

```text
weighted_compute =
  a * input_tokens
  + b * output_tokens
  + c * search_calls
  + d * dotcode_minutes
  + e * browser_verifications
  + f * long_context_multiplier
```

This supports fair billing without exposing raw backend complexity to users. A short direct answer, a search-grounded legal/regulatory answer, and a 12-minute DotCode session should not cost the same internally.

### 14.6 Simulation harness

DotBench should include a discrete-event simulator that replays synthetic arrivals against route queues. The simulator does not need user prompts. It only needs task classes, token lengths, service times, and model capacities.

Pseudo-code:

```text
for t in simulation_window:
  arrivals <- sample_poisson(lambda_k for each class k)
  for request in arrivals:
    estimate_tokens(request)
    assign_route_queue(request)
    if capacity_available(queue, request):
      admit(request)
    else:
      enqueue_or_defer(request)
  for worker in workers:
    step_decode_or_tool(worker)
    emit_state_events(worker)
  record(ttft, queue_wait, gpu_util, kv_free, dropped, completed)
```

Outputs:

- TTFT p50/p95/p99 by route.
- Queue wait p50/p95/p99 by route.
- KV-cache pressure over time.
- Active sequence count.
- Dropped/deferred requests.
- Cost per request class.
- User-visible status delay.

This is the bridge between "we bought GPUs" and "the product feels fast." Dot should tune against route-specific latency distributions rather than average throughput.

### 14.7 Capacity planning graph set

The public paper should visualize at least four planning curves:

1. **Context length vs relative concurrency.** Shows why long context is expensive.
2. **Utilization vs waiting time.** Shows why headroom matters.
3. **Shared vs isolated queue p95 TTFT.** Shows why DotCode needs route separation.
4. **Weighted compute by route.** Shows why credits should price tool-heavy sessions differently.

The graphs are synthetic and should be labeled as such until production telemetry exists. Their value is explanatory: they reveal which variables control the economics and UX.

## 15. Security model

Dot's security posture is built around separation and explicit boundaries.

### 15.1 Network and service separation

- Edge ingress terminates public traffic.
- Router services receive request envelopes.
- Policy services classify boundary risk.
- Inference workers run models.
- Agent workers run generated code or local handoffs.
- Audit stores retain metadata only.
- Metrics stores retain operational counters only.

No single service should need raw prompts, account identity, wallet records, and long-term logs at the same time.

### 15.2 Agent containment

DotCode should enforce:

- Workspace scoping.
- Permission prompts for terminal launch.
- Changed-file visibility.
- Secret redaction.
- Environment variable conventions.
- Explicit user-controlled signing for blockchain operations.
- No automatic wallet transaction signing by the model.
- Dry-run defaults for systems involving money, private keys, or external APIs.

### 15.3 Retrieval security

Search results can contain hostile instructions. Dot should treat retrieved pages as data, not authority.

Controls:

- Source extraction separated from system instructions.
- Citation requirement for current factual claims.
- Prompt-injection filtering for tool output.
- No execution of retrieved code without user-visible context.
- Prefer official docs for high-stakes technical claims.

## 16. Operational protocol

Dot's production system should be specified as a request protocol rather than a loose set of services. The protocol determines how a prompt moves through identity, policy, routing, inference, tools, and response streaming. This matters because privacy failures often come from unclear boundaries: a debug logger receives too much context, a search tool stores too much body text, a queue object becomes durable, or an agent writes outside the intended project.

### 16.1 Request envelope

The request envelope should contain routing metadata, not the user's long-term profile.

```json
{
  "request_id": "random_uuid",
  "session_ref": "ephemeral_or_account_scoped_id",
  "identity_mode": "anonymous | wallet | provider",
  "user_tier": "free | pro | team | enterprise",
  "private_mode": true,
  "task_hint": "chat | code | search | unknown",
  "client_capabilities": ["streaming", "dotcode_handoff", "search_status"],
  "prompt_body": "<volatile buffer>",
  "retention_policy": "no_server_transcript"
}
```

The envelope exists to make routing explicit. It should not become a permanent object store. If the system needs to persist request state for reliability, that state should be content-free: request id, route, queue state, model id, error class, timestamps, and token counts.

### 16.2 Admission control

Dot should reject or defer requests before they consume expensive GPU memory when the system is saturated. Admission control should be based on:

- Active sequence count per model.
- Available KV-cache blocks.
- Estimated prefill length.
- Expected decode length.
- User tier.
- Tool requirements.
- Whether the request needs DotCode or normal chat.
- Whether the request is search-heavy.

For LLM serving, a long prompt has a different cost shape than a short prompt with a long answer. Prefill pressure and decode pressure should be measured separately. A production router should expose at least:

```text
queue_depth_by_model
prefill_tokens_waiting
decode_sequences_active
kv_cache_free_blocks
time_to_first_token_p50/p95
decode_tokens_per_second_p50/p95
```

The product should surface this state in human terms. Users do not need to see KV blocks, but they should see "queued", "searching", "generating", "opening DotCode", or "tool retrying." Silent waiting destroys trust.

### 16.3 Queue classes

Dot should use separate queue classes:

| Queue | Workload | Reason |
| --- | --- | --- |
| `chat-fast` | Short direct answers | Protect low-latency interactive usage |
| `chat-search` | Retrieval-grounded answers | Isolate web latency from pure generation |
| `code-plan` | DotCode planning prompts | Avoid starving chat with long planning |
| `code-agent` | Tool/action loops | Separate inference from shell execution |
| `policy` | Boundary classification | Must stay low-latency and cheap |
| `eval` | Synthetic benchmark replay | Never competes with user traffic |

The code-agent queue is especially important. A coding agent can run for minutes. It may need to install dependencies, start a dev server, run tests, wait for a browser screenshot, and repair failures. This is a different product shape than a chat completion. Mixing it with normal chat queues creates unpredictable latency.

### 16.4 Stream contract

DotChat should stream structured events, not just text chunks.

```text
event: route
data: {"task":"search_grounded_answer","model":"dot-chat-primary"}

event: status
data: {"state":"searching","sources_seen":0}

event: status
data: {"state":"generating","model":"dot-chat-primary"}

event: token
data: {"text":"..."}

event: done
data: {"finish_reason":"stop","retained_prompt":false}
```

The event stream gives the frontend a precise way to show progress. It also makes observability cleaner: the system can count transitions without storing content.

## 17. Privacy-preserving observability

An AI service without observability will fail operationally. A privacy service with naive observability will fail philosophically. Dot needs both. The observability design should prove that the system is healthy without storing the actual conversations that made it healthy.

### 17.1 Metrics that are safe by default

Safe metrics include:

- Request count by route.
- Queue depth by model.
- Time to first token.
- Tokens per second.
- Search tool latency.
- DotCode handoff success/failure.
- Agent command count.
- Test pass/fail status.
- Policy class count.
- Error class.
- GPU memory pressure.
- Worker restarts.

Unsafe metrics include:

- Raw prompts.
- Raw completions.
- Full search snippets tied to user ids.
- Full command output when it may contain secrets.
- Wallet address tied to prompt text.
- Browser local history uploaded to the server.

The discipline is to model the system as counters, histograms, and state transitions. The content should stay out of the metrics plane.

### 17.2 Privacy regression tests

Dot should test privacy like it tests functionality. A privacy regression suite should inject canary strings into synthetic prompts and then scan:

- Server logs.
- Application logs.
- Reverse proxy logs.
- Error traces.
- Metrics labels.
- Queue dumps.
- Crash reports.
- Browser-visible debug payloads.

The test passes only if the canary is absent from every server-side durable surface except explicitly allowed short-lived debug buffers.

Example synthetic canary:

```text
DOT_PRIVACY_CANARY_7f31c9e2_do_not_log
```

This is a simple but powerful discipline. Many systems claim no retention while accidentally leaking prompts into error traces or debug logs. Dot's claim should be continuously tested.

### 17.3 Verifiable no-retention direction

In later versions, Dot can move from "we say we do not store prompts" to stronger forms of proof:

- Open retention policy documents.
- Redacted logging configuration.
- External security reviews.
- Reproducible privacy regression tests.
- TEE-backed inference attestations for selected lanes.
- Signed build artifacts for inference workers.
- Append-only transparency logs for policy/config changes.

TEE-backed inference is not a magic privacy wand. It can, however, narrow the trust boundary by proving that a particular model-serving binary and policy configuration handled a request. Dot should treat TEEs as an additional attestation layer, not as a substitute for no-retention design.

## 18. Model operations and fine-tuning

Dot's model strategy should be practical: route first, fine-tune second, distill third, and only train when there is a clear product advantage. A private AI platform can waste enormous money chasing model vanity. Dot's advantage is the runtime, so model operations should support the runtime.

### 18.1 Model registry

Every model in the pool should have a registry entry:

```json
{
  "model_id": "dot-chat-direct-v1",
  "base_family": "mistral | qwen | deepseek | llama | glm | other",
  "license": "tracked",
  "context_window": 32768,
  "private_mode_allowed": true,
  "tool_use": "none | search | code | function",
  "quantization": "bf16 | fp16 | int8 | q8 | q6 | q5 | q4",
  "serving_engine": "vllm | tgi | llama.cpp | custom",
  "eval_profile": "direct-chat-2026-05",
  "policy_profile": "narrow-boundary-v1"
}
```

The registry is the source of truth for routing. It prevents product behavior from depending on tribal knowledge about which model is "good today."

### 18.2 Fine-tuning targets

Dot should not fine-tune models to memorize user data. The useful fine-tuning targets are synthetic or internally generated:

- DotCode handoff quality.
- Tool-use formatting.
- Refusal-boundary precision.
- Search-grounded answer style.
- Low-fluff technical explanations.
- Repo-analysis planning.
- Frontend artifact verification habits.
- Secret-handling discipline.

Fine-tuning should use synthetic tasks, open datasets where licenses permit, generated eval failures, and manually curated internal examples. User prompts should not be training data by default.

### 18.3 Distillation targets

Large models are expensive. Dot can use larger models to produce high-quality synthetic traces, then distill narrow behaviors into smaller local models:

- Classify code vs chat vs search.
- Rewrite ambiguous build requests into precise DotCode handoffs.
- Detect when retrieval is required.
- Convert tool observations into concise user-visible status.
- Flag policy-boundary ambiguity for a larger model.

This architecture keeps latency and cost under control. The expensive model is used when it creates leverage; the smaller model handles repeated routing and formatting tasks.

### 18.4 Evaluation gating

No model should enter the production router only because it feels good in manual tests. It should pass:

- Direct answer evals.
- Search-grounded hallucination evals.
- Refusal-boundary evals.
- Code-handoff evals.
- Latency and throughput smoke tests.
- Prompt-injection resistance tests.
- Secret-leakage tests for DotCode routes.

Promotion should be reversible. The router should support canary percentages and rollback by model id.

## 19. Wallet-native access and onchain economics

Dot does not need a token to be useful. It does need an identity and payment system that respects the privacy thesis. Wallet-native access gives Dot a clean path: pseudonymous users can pay, unlock tiers, and carry reputation without exposing civil identity.

### 19.1 Access primitives

The wallet layer can support:

- Sign-in with wallet.
- Subscription proof.
- Usage credits.
- Team wallets.
- Creator-issued access passes.
- Token-gated research rooms.
- Pseudonymous reputation.
- Onchain invoices for enterprise accounts.

The important design constraint is separation: payment records should not be joined to prompt content. A wallet can prove entitlement without becoming a transcript key.

### 19.2 Credit accounting

AI costs are variable. Dot should account for usage with abstract credits rather than exposing every backend cost detail to the user.

Credit burn can consider:

- Input tokens.
- Output tokens.
- Search calls.
- DotCode session time.
- Browser verification.
- Long-context premium models.
- GPU queue priority.

The product should still be understandable. Users should know that a long DotCode build costs more than a short chat answer. They do not need to reason about KV-cache pressure.

### 19.3 Base-specific primitives

Base enables low-friction experiments:

- Wallet-gated docs and demos.
- Onchain subscription receipts.
- Agent credits funded by a wallet.
- Builder grants or credits distributed to addresses.
- Public attestations for benchmark runs.
- Optional proof that a user owns access without revealing identity.

This is where Dot can feel native rather than bolted onto crypto. The wallet is not a marketing badge. It is an access-control and payment primitive that matches the privacy model.

## 20. Failure modes and mitigations

A serious whitepaper should describe how the system fails.

### 20.1 Inference saturation

Failure: GPU queues saturate, time to first token spikes, users think the model is stuck.

Mitigation:

- Separate chat/code/search queues.
- Admission control.
- User-visible queue state.
- Smaller fallback models for simple requests.
- Circuit breakers when p95 latency crosses SLO.

### 20.2 False refusals

Failure: The narrow boundary classifier over-refuses lawful sensitive requests, making Dot feel like the systems it is trying to beat.

Mitigation:

- Synthetic eval sets for sensitive lawful domains.
- Refusal audit without raw prompt storage.
- Regression tests by domain.
- Appeals/reporting path with explicit opt-in content sharing.

### 20.3 False allows

Failure: The policy layer allows a request that materially helps self-harm or harm to others.

Mitigation:

- Adversarial evals.
- Post-generation policy checks where needed.
- Conservative handling of direct operational harm.
- Continuous review of synthetic failures.

### 20.4 Agent overreach

Failure: DotCode modifies unrelated files, prints secrets, installs malicious packages, or runs dangerous commands.

Mitigation:

- Workspace scoping.
- Changed-file summaries.
- Secret redaction.
- Permission prompts.
- Dependency diff visibility.
- Dry-run defaults for money, keys, and external APIs.
- Isolation workers for generated code.

### 20.5 Retrieval poisoning

Failure: Search results contain prompt-injection instructions or malicious code.

Mitigation:

- Treat retrieved text as data, never system instruction.
- Prefer official docs for high-stakes facts.
- Cite sources.
- Keep execution separate from retrieval.
- Do not run retrieved code automatically.

### 20.6 Privacy drift

Failure: A new feature slowly adds retention: saved chats, memory, analytics, debugging, export, or personalization.

Mitigation:

- Feature-level retention review.
- Canary-based privacy tests.
- Explicit user opt-in for memory.
- Short retention windows for debug captures.
- Product copy that distinguishes local storage from server storage.

## 21. Implementation roadmap

### 21.1 Product

- DotChat private conversation UI.
- Model status and streaming state.
- Search status visibility.
- DotCode routing for coding/build requests.
- Local terminal handoff with compact visible prompt.
- Wallet sign-in and username selection.
- User-controlled local history.
- Docs site at docs.usedot.xyz.

### 21.2 Infrastructure

- On-prem GPU inference pool.
- Router with model health and queue depth.
- Separate chat/code/search queues.
- Refusal audit with prompt hashing and salt rotation.
- Metrics without raw prompt/completion logs.
- Agent sandbox lane.
- Synthetic evaluation harness.

### 21.3 Research

- TEE-backed inference attestations.
- Verifiable no-retention logging.
- Local encrypted browser vault.
- Onchain credits and wallet-gated tiers.
- Deterministic benchmark replay using synthetic prompts.
- Model distillation and fine-tuning for Dot-specific routing, policy, and code planning.

## 22. Public interfaces and deployment contracts

Dot should expose a small number of stable interfaces. Stability matters because the product has several execution surfaces: browser chat, model router, search tool, DotCode handoff, wallet access, and future API access. If these interfaces are not explicit, every new feature becomes a privacy risk.

### 22.1 Chat completion interface

The chat interface should be OpenAI-compatible where possible, but with Dot-specific metadata for privacy and tool state.

```json
{
  "model": "dot-chat-private",
  "messages": [{"role": "user", "content": "..."}],
  "private_mode": true,
  "retention": "none",
  "tools": ["search"],
  "stream": true
}
```

The response stream should include text tokens and state events. The state events are not cosmetic. They are part of the product contract: the user should know whether Dot is searching, routing, generating, or preparing a DotCode handoff.

### 22.2 DotCode handoff interface

DotCode handoffs should be file-backed and compact at the terminal surface.

```json
{
  "handoff_id": "handoff_2026_05_28_001",
  "workspace": "~/Desktop/DotCode",
  "prompt_file": ".dotcode-handoffs/handoff_001.md",
  "visible_terminal_prompt": "Prompt sent. Read .dotcode-handoffs/handoff_001.md and execute the DotCode handoff.",
  "mode": "new_project | existing_repo",
  "privacy": {"server_transcript": false}
}
```

The browser should not dump a huge prompt into the terminal. The terminal should show a short instruction, while the full engineering spec lives in a local handoff file. This keeps the user interface readable and reduces accidental prompt exposure in shell scrollback.

### 22.3 Search interface

Search should return structured source objects:

```json
{
  "query": "vLLM PagedAttention throughput paper",
  "sources": [
    {
      "title": "Efficient Memory Management for Large Language Model Serving with PagedAttention",
      "url": "https://arxiv.org/abs/2309.06180",
      "publisher": "arXiv",
      "retrieved_at": "2026-05-28T00:00:00Z"
    }
  ]
}
```

The answer model can summarize sources, but retrieved text must not become instructions. Search results are data. The system prompt and policy boundary remain above retrieval.

### 22.4 Deployment contract

The public deployment contract for docs.usedot.xyz should state:

- Whether the page is product documentation, research documentation, or live service documentation.
- Which claims are target-state architecture.
- Which claims are implemented.
- Which benchmarks are measured and which are SLO targets.
- Which evidence links support external facts.

This is how Dot can sound technically ambitious without becoming unserious. Deep technical users do not punish ambition; they punish fuzzy claims. The correct posture is to describe the target system with precision, publish the roadmap, then ship toward the architecture.

## 23. Related work and positioning

Technical papers for infrastructure projects usually do not stop at architecture. They position the system against adjacent work, define what is novel, define what is inherited, and identify where the system's claims are intentionally narrower than the marketing category around it. Dot sits at the intersection of five prior lines of work:

- Privacy-forward consumer AI systems.
- Open-weight model serving systems.
- Tool-using and agentic reasoning systems.
- Coding-agent evaluation and repo-mutation benchmarks.
- Wallet-native identity and payment rails on fast consumer chains.

Dot's contribution is not the claim that any one of these fields is new. The contribution is the composition: pseudonymous access, volatile prompt handling, on-prem inference, narrow harm-boundary routing, search-augmented chat, and a local/terminal-native coding agent that produces inspectable artifacts.

### 23.1 Private AI systems

Privacy-first chat systems demonstrate that users want direct access to models without the default assumption that every prompt becomes platform memory. That market proof matters. Dot does not need to pretend the category is weak. The strongest competitors frame privacy as a product promise, expose models through chat and API surfaces, and publish privacy-oriented documentation.

Dot's wedge is the execution plane. A private chatbot can answer a question about a repository, but it usually cannot inspect the repository, edit files, run tests, verify a browser state, and return a command log. DotChat is therefore the front door; DotCode is the private capability substrate.

The positioning can be written as:

```text
private_chat := model_response(prompt)

private_agentic_compute := model_response(prompt)
                         + tool_plan(prompt, workspace_state)
                         + controlled_actions
                         + verification_trace
                         + privacy_invariant_checks
```

A chat-only private AI system optimizes response quality. Dot optimizes response quality plus execution evidence. This is the architectural distinction.

### 23.2 Tool-using LLMs and ReAct lineage

ReAct-style systems showed that reasoning and acting can be interleaved: the model observes state, reasons about the next step, invokes a tool, observes the result, and continues. DotCode follows the same high-level decomposition but applies stricter product constraints:

- The action space is bound to a workspace.
- Mutations are file-diff visible.
- Shell output is observable.
- Verification commands are part of the completion contract.
- Secrets must be redacted from logs.
- The final answer must distinguish files changed, commands run, evidence collected, and residual risk.

The technical point is that DotCode should not be benchmarked as a raw model. It should be benchmarked as a partially observable control system with a policy over actions:

```text
state_t       = observe(workspace, terminal, browser, test_output)
belief_t      = update_belief(belief_{t-1}, state_t)
action_t      = pi_theta(goal, belief_t, constraints)
state_{t+1}   = environment.step(action_t)
acceptance    = verifier(goal, state_{t+1})
```

The model matters, but the loop matters more. A weaker model inside a well-instrumented loop can outperform a stronger model that cannot see failures, rerun tests, or patch after observing evidence.

### 23.3 Coding-agent benchmarks

SWE-bench changed the conversation around coding models because it evaluates whether a system can resolve real GitHub issues, not merely emit plausible code. DotBench inherits that spirit but broadens the metric. Dot is not only trying to solve repo tasks; it is trying to solve repo tasks while preserving privacy, maintaining narrow policy correctness, and providing visible system state to the user.

For Dot, a coding task fails if any of the following are true:

- The code does not work.
- The tests do not run.
- The agent changes unrelated files without justification.
- The agent prints secrets.
- The agent cannot explain what it changed.
- The agent leaks private prompt material into durable server logs.
- The agent hides uncertainty behind a polished final answer.

This makes DotBench a systems benchmark, not a model leaderboard.

### 23.4 Open-model serving systems

vLLM, PagedAttention, Hugging Face Text Generation Inference, continuous batching, quantized GGUF workflows, and related model-serving projects define much of the modern serving vocabulary. Dot should use that vocabulary directly because the limiting factor for private open-model systems is rarely philosophical. It is usually memory, scheduling, context length, tool latency, and route isolation.

The operational lesson from model serving is:

```text
private_inference_capacity != GPU_count
private_inference_capacity ~= scheduler_quality
                           * memory_efficiency
                           * route_isolation
                           * context_policy
                           * fallback_design
```

The strongest private AI product is not the one with the longest list of models. It is the one that keeps time-to-first-token, queue visibility, model quality, and privacy invariants stable under real user traffic.

### 23.5 Wallet-native systems and Base

Base gives Dot a natural distribution layer because the audience is already comfortable with wallets, pseudonymous identity, onchain payments, apps, bots, and developer tooling. The wallet is not merely a login method. It is a primitive for:

- Pseudonymous account continuity.
- Credit accounting.
- Subscription or usage gating.
- Team or project access.
- Future proof-of-payment receipts.
- Optional portable reputation without requiring a civil identity.

The product hypothesis is that the next private AI user is not only a privacy maximalist. It is also a builder who wants a model that can answer directly, use the web, operate on code, and keep identity separation intact.

### 23.6 Dot's novelty claim

Dot should make a narrow novelty claim:

> Dot combines private open-model chat with a verified local coding agent, wallet-native pseudonymity, on-prem model routing, and no-retention invariants as first-class system constraints.

This claim is strong because it is specific. It avoids pretending that open models, coding agents, or privacy claims are individually new. The novelty is the product and systems composition.

## 24. Assumptions, non-goals, and claim boundaries

Research-grade technical documents need a claim boundary. Without one, every sentence sounds like either hype or a legal promise. Dot should publish claims in typed categories and keep those categories visible across the docs site.

### 24.1 Core assumptions

Dot's architecture assumes:

- Open-weight models can provide sufficient quality for a broad set of private chat and coding tasks when paired with retrieval, routing, and agentic execution.
- Users value pseudonymous access even when they do not require mathematically perfect anonymity.
- On-prem model serving is a better trust boundary than sending all prompts through third-party hosted inference APIs for the default private route.
- No-retention must be engineered as an invariant across logs, metrics, traces, queues, exceptions, and worker memory, not only as a database policy.
- A narrow self-harm and harm-to-others boundary can coexist with broad permission for lawful, adult, sensitive, technical, financial, medical, political, and controversial questions.
- Coding capability should be measured by artifacts and verification, not by the style or confidence of a chat response.

### 24.2 Non-goals

Dot should explicitly reject claims that are technically unrealistic or strategically distracting:

- Dot does not claim perfect anonymity against a global network adversary.
- Dot does not claim production benchmark results until a benchmark run is published with workload, hardware, model, quantization, scheduler, and measurement methodology.
- Dot does not auto-sign wallet transactions or make irreversible financial actions without explicit user-controlled configuration.
- Dot does not replace lawyers, accountants, doctors, auditors, security professionals, or licensed advisors.
- Dot does not claim that an uncensored model should facilitate self-harm or material harm to other people.
- Dot does not claim that all open models are equivalent. Model cards, evals, and routing tiers matter.
- Dot does not claim that no-retention removes all risk. It reduces durable data exposure; it does not remove every possible endpoint, browser, wallet, network, or user-side risk.

### 24.3 Claim taxonomy

The docs should tag claims so sophisticated readers know what is real, what is target-state, and what is simulated.

| Claim tag | Meaning | Required evidence |
|---|---|---|
| Implemented | Running in product or repository | Code reference, deployment reference, or demo trace |
| Prototype | Built but not production hardened | Demo, known limitations, test coverage status |
| Target-state | Intended architecture | Roadmap section and dependency list |
| Simulated | Modeled with synthetic assumptions | Harness, parameters, graph, and caveats |
| Measured | Empirically observed | Hardware, model, workload, sample size, date |
| External-evidence | External factual claim | Link to documentation, paper, or reproducible public artifact |
| Policy | Product governance decision | Owner, evaluation set, and review cadence |

This taxonomy lets Dot speak in an ambitious voice without inventing facts. The right standard is not humility for its own sake. The right standard is typed precision.

### 24.4 Claim ledger schema

Every public technical claim can be represented as a row:

```json
{
  "claim_id": "dot.claim.no_retention.private_prompts",
  "surface": "docs.usedot.xyz",
  "claim": "Private prompts are not stored as server-side transcripts by default.",
  "tag": "target-state",
  "owner": "privacy",
  "evidence": ["privacy-regression-suite", "logging-denylist-scan"],
  "last_reviewed": "release-gate",
  "known_limitations": [
    "client browser history is outside server retention control",
    "explicit debug mode may capture opt-in traces"
  ]
}
```

This ledger is not paperwork. It is how a project prevents marketing drift from corrupting engineering truth.

## 25. System cards, model cards, and release artifacts

Serious AI projects publish artifacts that outlive a single launch page. Model cards, system cards, evaluation reports, risk registers, privacy reviews, and release notes turn an AI product into an inspectable system. Dot should maintain these artifacts as part of every major model or runtime release.

### 25.1 Model card fields

A Dot model card should contain:

| Field | Description |
|---|---|
| Model ID | Public route name and internal registry ID |
| Base family | Upstream model family or mixture |
| License | License obligations and redistribution constraints |
| Quantization | Precision, quantization format, and quality notes |
| Context window | Supported context length by route |
| Serving engine | vLLM, TGI, llama.cpp, or custom worker |
| Route classes | Chat, search, code-plan, policy, eval, embedding |
| Hardware lane | Dev, MVP, throughput, isolation, eval |
| Policy profile | Boundary classifier configuration and known weak areas |
| Privacy mode | Whether model is eligible for no-retention private route |
| Evals | DotBench, refusal calibration, latency, retrieval grounding |
| Known failure modes | Hallucination class, long-context degradation, coding weaknesses |
| Promotion status | Experimental, canary, default, deprecated |

For models like Dolphin-style uncensored open models, the model card should be very explicit about what Dot adds around the model: routing, refusal boundary, search provenance, and private execution controls. The model is a component, not the entire safety or privacy architecture.

### 25.2 System card fields

A Dot system card should describe the deployed behavior of DotChat and DotCode together:

- Intended users.
- Supported identity modes.
- Data retention behavior.
- Training data policy.
- Refusal boundary.
- Search tool behavior.
- DotCode workspace mutation model.
- Default execution permissions.
- Known misuse risks.
- Known privacy limitations.
- Monitoring and audit methodology.
- Evaluation set summary.
- Incident response procedures.
- Version and rollback plan.

This document should be public enough to build trust, but precise enough that a technical user can disagree with it intelligently.

### 25.3 Release artifact bundle

Each major Dot release should ship an artifact bundle:

| Artifact | Owner | Purpose | Release gate |
|---|---|---|---|
| Router card | Inference | Shows route policy and fallback behavior | No unreviewed route changes |
| Model card | ML | Documents model identity and evals | Required for every default model |
| Policy card | Safety | Defines refusal classes and thresholds | Required before deployment |
| DotBench report | Evaluation | Shows artifact quality, privacy checks, latency | Required for major releases |
| Privacy regression report | Privacy | Confirms no raw prompt leakage in logs/traces | Hard gate |
| SRE scorecard | Infrastructure | Shows SLOs, saturation, incident readiness | Required before traffic increase |
| Agent sandbox report | Security | Reviews DotCode action boundaries | Required for tool expansion |
| Retrieval provenance report | Search | Reviews source quality and injection controls | Required for search changes |

### 25.4 Release gate DAG

The release process should be represented as a dependency graph:

```text
model_candidate
  -> model_card
  -> offline_evals
  -> policy_calibration
  -> privacy_canary_run
  -> DotBench_agent_tasks
  -> SRE_capacity_check
  -> canary_route
  -> staged_default
  -> post_release_audit
```

Promotion should fail closed. A new model can be impressive in chat and still fail promotion if it leaks prompt fragments through logs, refuses too broadly, breaks DotCode handoffs, or saturates the policy queue.

## 26. Governance, risk register, and compliance posture

Governance for Dot should be pragmatic and technical. The product should not bury itself in enterprise theater, but it does need named owners, risk registers, release gates, and incident playbooks. NIST AI RMF-style thinking is useful as a vocabulary: govern, map, measure, and manage. Dot should not claim certification unless it has it. It should claim operational discipline.

### 26.1 Governance layers

| Layer | Owner | Decision rights | Evidence |
|---|---|---|---|
| Privacy | Privacy engineering | Retention defaults, logging rules, canary tests | Privacy regression report |
| Model promotion | ML systems | Default model changes, quantization, routing | Model card, eval report |
| Policy boundary | Policy/eval | Refusal thresholds and eval sets | Policy card, calibration plots |
| Infrastructure | SRE | Capacity, routing, incident response | SLO dashboard, burn-rate alerts |
| DotCode agent | Agent runtime | Tool permissions, workspace scoping | Sandbox report, command audit |
| Search/retrieval | Retrieval systems | Source ranking and injection controls | Provenance report |
| Wallet access | Product/security | Auth flows, credit accounting | Auth review, abuse review |

### 26.2 Risk register

| Risk | Likelihood | Impact | Primary control | Evidence |
|---|---:|---:|---|---|
| Privacy drift in logs | Medium | Critical | No-body logging, canary prompts, trace redaction | Canary scan report |
| Harmful false allow | Medium | High | Boundary classifier, eval suites, review queue | Policy eval report |
| Lawful false refusal | High | Medium | Hard lawful-sensitive eval set | False-refusal dashboard |
| GPU saturation | High | High | Queue isolation, admission control, autoscale lane | SLO burn-rate graph |
| Wallet phishing/social engineering | Medium | High | Clear wallet UX, no auto-signing, origin checks | Auth threat model |
| Retrieval poisoning | Medium | Medium | Source ranking, prompt-injection delimiters | Retrieval eval report |
| Agent overreach | Medium | High | Workspace scoping, diff review, command policy | DotCode sandbox report |
| Supply-chain dependency compromise | Medium | High | Lockfiles, dependency scanning, reproducible builds | Build provenance |
| Model regression after quantization | Medium | Medium | Route-specific evals | Model promotion report |
| Regulatory ambiguity | Medium | Medium | Jurisdictional review and explicit user terms | Legal review note |

The risk table is not meant to scare investors or users. It signals that Dot understands the real failure modes of a private agentic platform.

### 26.3 Policy review loop

Dot's policy loop should be narrow, measurable, and versioned:

```text
collect_eval_prompts_without_user_identity
  -> label_allowed_self_harm_harm_others
  -> calibrate_thresholds
  -> test false_refuse on lawful-sensitive set
  -> test false_allow on harmful set
  -> canary deploy
  -> audit refused-count distributions without raw prompt retention
  -> publish policy card delta
```

The challenge is to avoid two bad defaults: an overbroad assistant that blocks legal adult information, and an under-specified assistant that facilitates self-harm or harm to other people. Dot's position is neither generic corporate refusal nor naive total permissiveness. It is explicit boundary engineering.

## 27. Reliability model, SLOs, and incident response

Private AI needs reliability language because privacy breaks down when systems are under pressure. Teams add emergency logging, dump queues, or copy prompts into tickets when observability is weak. Dot's SRE model should therefore treat privacy as an SLO class, not just as a product promise.

### 27.1 Service-level indicators

| SLI | Measurement | Target |
|---|---|---|
| Visible acknowledgement latency | Time from submit to first UI state event | p95 under 3s |
| Time to first token | Time from submit to first generated token | p95 under 10s for normal chat |
| Search visibility | Time until UI shows search started | p95 under 3s when search is selected |
| DotCode handoff latency | Time to button/handoff availability | p95 under 5s after route classification |
| Policy classifier latency | Time for boundary decision | p99 under 500ms for short prompts |
| Private prompt durable retention | Canary prompt found in durable stores | 0 known occurrences |
| Raw completion durable retention | Canary completion found in durable stores | 0 known occurrences |
| Queue starvation | Time policy queue waits behind generation | 0 starvation incidents |
| Agent verification rate | DotCode tasks with verification evidence | Target increases by release |

### 27.2 Error budget model

SLOs become useful when they constrain release velocity. Dot can define an error budget for latency and privacy separately:

```text
latency_good(request) =
  ack_latency <= 3s and ttft_latency <= 10s

privacy_good(request) =
  raw_prompt_not_in_logs
  and raw_completion_not_in_logs
  and no_training_lineage_from_private_request

SLO_latency = sum(latency_good) / N
SLO_privacy = sum(privacy_good) / N

privacy_error_budget = 0 known raw-retention incidents
```

The privacy SLO is intentionally stricter than the latency SLO. Latency degradation is bad product experience. Prompt retention in private mode is a trust incident.

### 27.3 Burn-rate simulation

Burn-rate thinking asks whether the system is consuming its allowed error budget too quickly:

```text
burn_rate(window) =
  observed_error_rate(window) / allowed_error_rate

release_freeze if:
  burn_rate(1h)  > 14.4
  or burn_rate(6h) > 6
  or privacy_incident_count > 0
```

For Dot, a privacy incident should freeze growth work until the retention path is understood. That is not conservatism. It is survival for a product whose central promise is privacy.

### 27.4 Incident playbooks

| Incident | First action | Containment | Recovery evidence |
|---|---|---|---|
| Prompt found in logs | Disable affected route logging | Rotate logs and isolate store | Canary scan clean on all stores |
| Model pool outage | Route to fallback or degrade gracefully | Queue admission limits | Postmortem with saturation graph |
| Search injection event | Disable affected source class | Raise retrieval filters | Retrieval eval pass |
| DotCode sandbox escape | Disable affected tool permission | Revoke handoff route | Sandbox regression pass |
| Wallet auth anomaly | Freeze affected sessions | Revalidate signatures | Auth review and replay tests |
| False refusal spike | Roll back classifier threshold | Run lawful-sensitive eval | Policy card delta |
| Harmful false allow | Tighten boundary and rerun evals | Review classifier trace metadata | Harm eval pass |

### 27.5 User-visible reliability contract

The product UI should expose state instead of hiding latency:

```text
submitted -> policy_check -> route_selected -> queued -> searching? -> generating -> complete
submitted -> policy_check -> dotcode_recommended -> handoff_ready
submitted -> overloaded -> deferred_with_visible_queue_state
```

The user should never wait in silence. A private system can still be transparent about state without exposing private content.

## 28. Open problems and research roadmap

Dot should publish open problems because the project sits in unsolved territory. Strong papers do not pretend every hard problem is solved. They identify the hard problems and show credible paths.

### 28.1 Verifiable no-retention

No-retention can be tested by canaries, but a stronger system would give users cryptographic or external assurance that prompts were not durably stored. Possible directions:

- Append-only transparency logs for model/router configuration without prompt content.
- Signed build manifests for workers that prove no body logging is enabled.
- Remote attestation for sensitive inference paths.
- Differential canary probes across logging, tracing, and queue surfaces.
- Independent audit scripts that run against deployment artifacts.

The hard part is producing evidence without creating a new privacy leak.

### 28.2 Private memory without server transcripts

Users want continuity. Privacy wants non-retention. The research problem is how to support useful memory without turning the server into a transcript warehouse.

Candidate designs:

- Client-side encrypted memory vault.
- User-exportable local memory file.
- Wallet-bound encrypted summaries controlled by the user.
- Short-lived server memory with explicit expiration.
- Retrieval over user-owned encrypted blobs.

The default should remain no server transcripts. Memory must be an opt-in user asset.

### 28.3 Proof-carrying agent runs

DotCode can become stronger if every run emits a proof object:

```json
{
  "goal": "build requested artifact",
  "workspace_scope": "declared",
  "files_changed": ["..."],
  "commands_run": ["..."],
  "tests": [{"command": "...", "exit_code": 0}],
  "browser_checks": [{"url": "...", "screenshot_hash": "..."}],
  "secret_redaction": "enabled",
  "policy_constraints": ["no secret exposure", "no unrelated file changes"],
  "residual_risk": ["..."]
}
```

This proof object is more valuable than a confident paragraph. It makes the agent inspectable.

### 28.4 Adversarial retrieval

Search-augmented generation gives Dot current information, but search results can contain prompt injection, low-quality pages, SEO spam, copied claims, or malicious instructions. The retrieval layer needs:

- Source class scoring.
- Instruction/data separation.
- Quoted source boundaries.
- Cross-source corroboration.
- Search-result sandboxing.
- Tool-output provenance in the UI.

The rule is simple: retrieved text is evidence, not authority.

### 28.5 Onchain billing privacy

Wallet-native AI creates a second privacy problem: payments can be linkable. Dot should explore:

- Prepaid credits.
- Session-bound spend caps.
- Optional fresh-wallet onboarding.
- Aggregated usage receipts.
- Minimal invoice metadata.
- Future privacy-preserving payment flows where available and lawful.

The goal is not to hide crime. The goal is to prevent ordinary AI usage from becoming a public behavioral graph.

### 28.6 Deterministic replay without private data

Debugging production AI systems often requires replaying failures. Private systems cannot solve this by storing the user's raw prompt. Dot needs replay strategies that do not require transcripts:

- Synthetic prompt class replay.
- Hash-linked policy decisions.
- Route/event metadata replay.
- User-provided opt-in debug bundles.
- Canary prompts that mimic failure classes.
- Local-only reproduction inside DotCode.

This is one of the hardest engineering problems in private AI: retaining enough structure to debug without retaining the private content itself.
