Fundamentals
QA overview
The private QA stack is meant to exercise OmeniaClaw in a more realistic, channel-shaped way than a single unit test can.
Current pieces:
extensions/qa-channel: synthetic message channel with DM, channel, thread, reaction, edit, and delete surfaces.extensions/qa-lab: debugger UI and QA bus for observing the transcript, injecting inbound messages, and exporting a Markdown report.extensions/qa-matrix, future runner plugins: live-transport adapters that drive a real channel inside a child QA gateway.qa/: repo-backed seed assets for the kickoff task and baseline QA scenarios.- Mantis: before and after live verification for bugs that need real transports, browser screenshots, VM state, and PR evidence.
Command surface
Every QA flow runs under pnpm OmeniaClaw qa <subcommand>. Many have pnpm qa:*
script aliases; both forms are supported.
| Command | Purpose |
|---|---|
qa run |
Bundled QA self-check without --qa-profile; taxonomy-backed maturity profile runner with --qa-profile smoke-ci or --qa-profile release. |
qa suite |
Run repo-backed scenarios against the QA gateway lane. Aliases: pnpm OmeniaClaw qa suite --runner multipass for a disposable Linux VM. |
qa coverage |
Print the YAML scenario-coverage inventory (--json for machine output). |
qa parity-report |
Compare two qa-suite-summary.json files and write the agentic parity report, or use --runtime-axis --token-efficiency to write Codex-vs-OmeniaClaw runtime parity and token-efficiency reports from one runtime-pair summary. |
qa character-eval |
Run the character QA scenario across multiple live models with a judged report. See Reporting. |
qa manual |
Run a one-off prompt against the selected provider/model lane. |
qa ui |
Start the QA debugger UI and local QA bus (alias: pnpm qa:lab:ui). |
qa docker-build-image |
Build the prebaked QA Docker image. |
qa docker-scaffold |
Write a docker-compose scaffold for the QA dashboard + gateway lane. |
qa up |
Build the QA site, start the Docker-backed stack, print the URL (alias: pnpm qa:lab:up; :fast variant adds --use-prebuilt-image --bind-ui-dist --skip-ui-build). |
qa aimock |
Start only the AIMock provider server. |
qa mock-openai |
Start only the scenario-aware mock-openai provider server. |
qa credentials doctor / add / list / remove |
Manage the shared Convex credential pool. |
qa matrix |
Live transport lane against a disposable Tuwunel homeserver. See Matrix QA. |
qa telegram |
Live transport lane against a real private Telegram group. |
qa discord |
Live transport lane against a real private Discord guild channel. |
qa slack |
Live transport lane against a real private Slack channel. |
qa whatsapp |
Live transport lane against real WhatsApp Web accounts. |
qa mantis |
Before and after verification runner for live transport bugs, with Discord status-reactions evidence, Crabbox desktop/browser smoke, and Slack-in-VNC smoke. See Mantis and Mantis Slack Desktop Runbook. |
Profile-backed qa run reads membership from taxonomy.yaml, then dispatches
the resolved scenarios through qa suite. --surface and
--category filter the selected profile instead of defining separate lanes.
The resulting qa-evidence.json includes a profile scorecard summary with
selected-category counts and missing coverage IDs; the individual evidence
entries remain the source of truth for the tests, coverage roles, and results.
Taxonomy feature coverage IDs are exact proof targets, not aliases. Primary
scenario coverage fulfills matching IDs; secondary coverage stays advisory.
Coverage IDs use dotted namespace.behavior form with lowercase
alphanumeric/dash segments; profile, surface, and category IDs may still use
the existing dashed or dotted taxonomy IDs.
Slim evidence omits per-entry execution and sets evidenceMode: "slim";
smoke-ci defaults to slim, and --evidence-mode full restores full entries:
pnpm OmeniaClaw qa run \ --qa-profile smoke-ci \ --category agent-runtime-and-provider-execution.agent-turn-execution \ --provider-mode mock-openai \ --output-dir .artifacts/qa-e2e/smoke-ci-profile-dispatchUse smoke-ci for deterministic no-live-service proof and release for the
Stable/LTS proof lane. When a command also needs an OmeniaClaw root profile, put
the root profile before the QA command:
pnpm OmeniaClaw --profile work qa run --qa-profile smoke-ciOperator flow
The current QA operator flow is a two-pane QA site:
- Left: Gateway dashboard (Control UI) with the agent.
- Right: QA Lab, showing the Slack-ish transcript and scenario plan.
Run it with:
pnpm qa:lab:upThat builds the QA site, starts the Docker-backed gateway lane, and exposes the QA Lab page where an operator or automation loop can give the agent a QA mission, observe real channel behavior, and record what worked, failed, or stayed blocked.
For faster QA Lab UI iteration without rebuilding the Docker image each time, start the stack with a bind-mounted QA Lab bundle:
pnpm OmeniaClaw qa docker-build-imagepnpm qa:lab:buildpnpm qa:lab:up:fastpnpm qa:lab:watchqa:lab:up:fast keeps the Docker services on a prebuilt image and bind-mounts
extensions/qa-lab/web/dist into the qa-lab container. qa:lab:watch
rebuilds that bundle on change, and the browser auto-reloads when the QA Lab
asset hash changes.
For a local OpenTelemetry signal smoke, run:
pnpm qa:otel:smokeThat script starts a local OTLP/HTTP receiver, runs the otel-trace-smoke QA
scenario with the diagnostics-otel plugin enabled, then asserts traces,
metrics, and logs are exported. It decodes the exported protobuf trace spans
and checks the release-critical shape:
OmeniaClaw.run, OmeniaClaw.harness.run, a latest GenAI semantic-convention
model-call span, OmeniaClaw.context.assembled, and OmeniaClaw.message.delivery
must be present. The smoke forces
OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental, so the model-call
span must use the {gen_ai.operation.name} {gen_ai.request.model} name;
model calls must not export StreamAbandoned on successful turns; raw diagnostic IDs and
OmeniaClaw.content.* attributes must stay out of the trace. The raw OTLP
payloads must not contain the prompt sentinel, response sentinel, or QA session
key. It writes otel-smoke-summary.json next to the QA suite artifacts.
For a collector-backed OpenTelemetry smoke, run:
pnpm qa:otel:collector-smokeThat lane puts a real OpenTelemetry Collector Docker container in front of the same local receiver. Use it when changing endpoint wiring, collector compatibility, or OTLP export behavior that the in-process receiver could mask.
For the protected Prometheus scrape smoke, run:
pnpm qa:prometheus:smokeThat alias runs the docker-prometheus-smoke QA scenario with
diagnostics-prometheus enabled, verifies unauthenticated scrapes are rejected,
then checks the authenticated scrape includes release-critical metric families
without prompt content, response content, raw diagnostic identifiers, auth
tokens, or local paths.
To run both observability smokes back to back, use:
pnpm qa:observability:smokeFor the collector-backed OpenTelemetry lane plus the protected Prometheus scrape smoke, use:
pnpm qa:observability:collector-smokeObservability QA stays source-checkout only. The npm tarball intentionally omits
QA Lab, so package Docker release lanes do not run qa commands. Use
pnpm qa:otel:smoke, pnpm qa:prometheus:smoke, or
pnpm qa:observability:smoke from a built source checkout when changing
diagnostics instrumentation.
For a transport-real Matrix smoke lane, run:
pnpm OmeniaClaw qa matrix --profile fast --fail-fastThe full CLI reference, profile/scenario catalog, env vars, and artifact layout for this lane live in Matrix QA. At a glance: it provisions a disposable Tuwunel homeserver in Docker, registers temporary driver/SUT/observer users, runs the real Matrix plugin inside a child QA gateway scoped to that transport (no qa-channel), then writes a Markdown report, JSON summary, observed-events artifact, and combined output log under .artifacts/qa-e2e/matrix-<timestamp>/.
The scenarios cover transport behavior that unit tests cannot prove end to end: mention gating, allow-bot policies, allowlists, top-level and threaded replies, DM routing, reaction handling, inbound edit suppression, restart replay dedupe, homeserver interruption recovery, approval metadata delivery, media handling, and Matrix E2EE bootstrap/recovery/verification flows. The E2EE CLI profile also drives OmeniaClaw matrix encryption setup and verification commands through the same disposable homeserver before checking gateway replies.
Discord also has Mantis-only opt-in scenarios for bug reproduction. Use
--scenario discord-status-reactions-tool-only for the explicit status reaction
timeline, or --scenario discord-thread-reply-filepath-attachment to create a
real Discord thread and verify that message.thread-reply preserves a
filePath attachment. These scenarios stay out of the default live Discord lane
because they are before/after repro probes rather than broad smoke coverage.
The thread-attachment Mantis workflow can also add a logged-in Discord Web
witness video when MANTIS_DISCORD_VIEWER_CHROME_PROFILE_DIR or
MANTIS_DISCORD_VIEWER_CHROME_PROFILE_TGZ_B64 is configured in the QA
environment. That viewer profile is only for visual capture; the pass/fail
decision still comes from the Discord REST oracle.
CI uses the same command surface in .github/workflows/qa-live-transports-convex.yml. Scheduled and default manual runs execute the fast Matrix profile with live frontier credentials, --fast, and OmeniaClaw_QA_MATRIX_NO_REPLY_WINDOW_MS=3000. Manual matrix_profile=all fans out into the five profile shards so the exhaustive catalog can run in parallel while keeping one artifact directory per shard.
For transport-real Telegram, Discord, Slack, and WhatsApp smoke lanes:
pnpm OmeniaClaw qa telegrampnpm OmeniaClaw qa discordpnpm OmeniaClaw qa slackpnpm OmeniaClaw qa whatsappThey target a pre-existing real channel with two bots or accounts (driver + SUT). Required env vars, scenario lists, output artifacts, and the Convex credential pool are documented in Telegram, Discord, Slack, and WhatsApp QA reference below.
For a full Slack desktop VM run with VNC rescue, run:
pnpm OmeniaClaw qa mantis slack-desktop-smoke \ --gateway-setup \ --scenario slack-canary \ --keep-leaseThat command leases a Crabbox desktop/browser machine, runs the Slack live lane
inside the VM, opens Slack Web in the VNC browser, captures the desktop, and
copies slack-qa/, slack-desktop-smoke.png, and slack-desktop-smoke.mp4
when video capture is available back to the Mantis artifact directory. Crabbox
desktop/browser leases provide the capture tools and browser/native-build helper
packages up front, so the scenario should only install fallbacks on older
leases. Mantis reports total and per-phase timings in
mantis-slack-desktop-smoke-report.md so slow runs show whether time went into
lease warmup, credential acquisition, remote setup, or artifact copy. Reuse
--lease-id <cbx_...> after logging in to Slack Web manually through VNC;
reused leases also keep Crabbox's pnpm store cache warm. The default
--hydrate-mode source verifies from a source checkout and runs install/build
inside the VM. Use --hydrate-mode prehydrated only when the reused remote
workspace already has node_modules and a built dist/; that mode skips the
expensive install/build step and fails closed when the workspace is not ready.
With --gateway-setup, Mantis leaves a persistent OmeniaClaw Slack gateway
running inside the VM on port 38973; without it, the command runs the normal
bot-to-bot Slack QA lane and exits after artifact capture.
To prove native Slack approval UI with desktop evidence, run the Mantis approval checkpoint mode:
pnpm OmeniaClaw qa mantis slack-desktop-smoke \ --approval-checkpoints \ --credential-source convex \ --credential-role maintainerThis mode is mutually exclusive with --gateway-setup. It runs the Slack
approval scenarios, rejects non-approval scenario ids, waits at each pending and
resolved approval state, renders the observed Slack API message into
approval-checkpoints/<scenario>-pending.png and
approval-checkpoints/<scenario>-resolved.png, then fails if any checkpoint,
message evidence, acknowledgement, or rendered screenshot is missing or empty.
Cold CI leases may still show Slack sign-in in slack-desktop-smoke.png; the
approval checkpoint images are the visual proof for this lane.
The operator checklist, GitHub workflow dispatch command, evidence-comment contract, hydrate-mode decision table, timing interpretation, and failure handling steps live in Mantis Slack Desktop Runbook.
For an agent/CV style desktop task, run:
pnpm OmeniaClaw qa mantis visual-task \ --browser-url https://example.net \ --expect-text "Example Domain" \ --vision-model openai/gpt-5.5visual-task leases or reuses a Crabbox desktop/browser machine, starts
crabbox record --while, drives the visible browser through a nested
visual-driver, captures visual-task.png, runs OmeniaClaw infer image describe
against the screenshot when --vision-mode image-describe is selected, and
writes visual-task.mp4, mantis-visual-task-summary.json,
mantis-visual-task-driver-result.json, and mantis-visual-task-report.md.
When --expect-text is set, the vision prompt asks for a structured JSON
verdict and only passes when the model reports positive visible evidence; a
negative response that merely quotes the target text fails the assertion.
Use --vision-mode metadata for a no-model smoke that proves the desktop,
browser, screenshot, and video plumbing without calling an image-understanding
provider. Recording is a required artifact for visual-task; if Crabbox records
no non-empty visual-task.mp4, the task fails even when the visual driver
passed. On failure, Mantis keeps the lease for VNC unless the task had already
passed and --keep-lease was not set.
Before using pooled live credentials, run:
pnpm OmeniaClaw qa credentials doctorThe doctor checks Convex broker env, validates endpoint settings, and verifies admin/list reachability when the maintainer secret is present. It reports only set/missing status for secrets.
Live transport coverage
Live transport lanes share one contract instead of each inventing their own scenario list shape. qa-channel is the broad synthetic product-behavior suite and is not part of the live transport coverage matrix.
Live transport runners should import the shared scenario ids, baseline
coverage helpers, and scenario-selection helper from
OmeniaClaw/plugin-sdk/qa-live-transport-scenarios.
| Lane | Canary | Mention gating | Bot-to-bot | Allowlist block | Top-level reply | Restart resume | Thread follow-up | Thread isolation | Reaction observation | Help command | Native command registration |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Matrix | x | x | x | x | x | x | x | x | x | ||
| Telegram | x | x | x | x | |||||||
| Discord | x | x | x | x | |||||||
| Slack | x | x | x | x | x | x | x | x | |||
| x | x | x | x | x | x | x |
This keeps qa-channel as the broad product-behavior suite while Matrix,
Telegram, and other live transports share one explicit transport-contract checklist.
For a disposable Linux VM lane without bringing Docker into the QA path, run:
pnpm OmeniaClaw qa suite --runner multipass --scenario channel-chat-baselineThis boots a fresh Multipass guest, installs dependencies, builds OmeniaClaw
inside the guest, runs qa suite, then copies the normal QA report and
summary back into .artifacts/qa-e2e/... on the host.
It reuses the same scenario-selection behavior as qa suite on the host.
Host and Multipass suite runs execute multiple selected scenarios in parallel
with isolated gateway workers by default. qa-channel defaults to concurrency
4, capped by the selected scenario count. Use --concurrency <count> to tune
the worker count, or --concurrency 1 for serial execution.
Use --pack personal-agent to run the personal assistant benchmark pack. The
pack selector is additive with repeated --scenario flags: explicit scenarios
run first, then pack scenarios run in pack order with duplicates removed.
Use --pack observability when a custom QA runner already supplies the
OpenTelemetry collector setup and wants the OpenTelemetry and Prometheus
diagnostics smoke scenarios selected together.
The command exits non-zero when any scenario fails. Use --allow-failures when
you want artifacts without a failing exit code.
Live runs forward the supported QA auth inputs that are practical for the
guest: env-based provider keys, the QA live provider config path, and
CODEX_HOME when present. Keep --output-dir under the repo root so the guest
can write back through the mounted workspace.
Telegram, Discord, Slack, and WhatsApp QA reference
Matrix has a dedicated page because of its scenario count and Docker-backed homeserver provisioning. Telegram, Discord, Slack, and WhatsApp run against pre-existing real transports, so their reference lives here.
Shared CLI flags
These lanes register through extensions/qa-lab/src/live-transports/shared/live-transport-cli.ts and accept the same flags:
| Flag | Default | Description |
|---|---|---|
--scenario <id> |
- | Run only this scenario. Repeatable. |
--output-dir <path> |
<repo>/.artifacts/qa-e2e/<transport>-<timestamp> |
Where reports, summaries, evidence, transport-specific artifacts, and the output log are written. Relative paths resolve against --repo-root. |
--repo-root <path> |
process.cwd() |
Repository root when invoking from a neutral cwd. |
--sut-account <id> |
sut |
Temporary account id inside the QA gateway config. |
--provider-mode <mode> |
live-frontier |
mock-openai or live-frontier (legacy live-openai still works). |
--model <ref> / --alt-model <ref> |
provider default | Primary/alternate model refs. |
--fast |
off | Provider fast mode where supported. |
--credential-source <env|convex> |
env |
See Convex credential pool. |
--credential-role <maintainer|ci> |
ci in CI, maintainer otherwise |
Role used when --credential-source convex. |
Each lane exits non-zero on any failed scenario. --allow-failures writes artifacts without setting a failing exit code.
Telegram QA
pnpm OmeniaClaw qa telegramTargets one real private Telegram group with two distinct bots (driver + SUT). The SUT bot must have a Telegram username; bot-to-bot observation works best when both bots have Bot-to-Bot Communication Mode enabled in @BotFather.
Required env when --credential-source env:
OmeniaClaw_QA_TELEGRAM_GROUP_ID- numeric chat id (string).OmeniaClaw_QA_TELEGRAM_DRIVER_BOT_TOKENOmeniaClaw_QA_TELEGRAM_SUT_BOT_TOKEN
Scenarios (extensions/qa-lab/src/live-transports/telegram/telegram-live.runtime.ts):
telegram-canarytelegram-mention-gatingtelegram-mentioned-message-replytelegram-help-commandtelegram-commands-commandtelegram-tools-compact-commandtelegram-whoami-commandtelegram-status-commandtelegram-repeated-command-authorizationtelegram-other-bot-command-gatingtelegram-context-commandtelegram-current-session-status-tooltelegram-reply-chain-exact-markertelegram-stream-final-single-messagetelegram-long-final-reuses-previewtelegram-long-final-three-chunks
The implicit default set always covers canary, mention gating, native command replies, command addressing, and bot-to-bot group replies. mock-openai defaults also include deterministic reply-chain and final-message streaming checks. telegram-current-session-status-tool remains opt-in because it is only stable when threaded directly after canary, not after arbitrary native command replies. Use pnpm OmeniaClaw qa telegram --list-scenarios --provider-mode mock-openai to print the current default/optional split with regression refs.
Output artifacts:
telegram-qa-report.mdqa-evidence.json- evidence entries for the live transport checks, including profile, coverage, provider, channel, artifacts, result, and RTT fields.
Package Telegram runs use the same Telegram credential contract. Repeated RTT
measurement is part of the normal package Telegram live lane; the RTT
distribution is folded into qa-evidence.json under result.timing for the
selected RTT check.
OmeniaClaw_QA_CREDENTIAL_SOURCE=convex \pnpm test:docker:npm-telegram-liveWhen OmeniaClaw_QA_CREDENTIAL_SOURCE=convex is set, the package live wrapper
leases a kind: "telegram" credential, exports the leased group/driver/SUT bot
env into the installed-package run, heartbeats the lease, and releases it on
shutdown. The package wrapper defaults to 20 RTT checks of
telegram-mentioned-message-reply, a 30s RTT timeout, and Convex role
maintainer outside CI when Convex is selected. Override
OmeniaClaw_NPM_TELEGRAM_RTT_SAMPLES, OmeniaClaw_NPM_TELEGRAM_RTT_TIMEOUT_MS,
or OmeniaClaw_NPM_TELEGRAM_RTT_MAX_FAILURES to tune RTT measurement without
creating a separate RTT command or Telegram-specific summary format.
Discord QA
pnpm OmeniaClaw qa discordTargets one real private Discord guild channel with two bots: a driver bot controlled by the harness and a SUT bot started by the child OmeniaClaw gateway through the bundled Discord plugin. Verifies channel mention handling, that the SUT bot has registered the native /help command with Discord, and opt-in Mantis evidence scenarios.
Required env when --credential-source env:
OmeniaClaw_QA_DISCORD_GUILD_IDOmeniaClaw_QA_DISCORD_CHANNEL_IDOmeniaClaw_QA_DISCORD_DRIVER_BOT_TOKENOmeniaClaw_QA_DISCORD_SUT_BOT_TOKENOmeniaClaw_QA_DISCORD_SUT_APPLICATION_ID- must match the SUT bot user id returned by Discord (the lane fails fast otherwise).
Optional:
OmeniaClaw_QA_DISCORD_CAPTURE_CONTENT=1keeps message bodies in observed-message artifacts.OmeniaClaw_QA_DISCORD_VOICE_CHANNEL_IDselects the voice/stage channel fordiscord-voice-autojoin; without it, the scenario picks the first visible voice/stage channel for the SUT bot.
Scenarios (extensions/qa-lab/src/live-transports/discord/discord-live.runtime.ts:36):
discord-canarydiscord-mention-gatingdiscord-native-help-command-registrationdiscord-voice-autojoin- opt-in voice scenario. Runs by itself, enableschannels.discord.voice.autoJoin, and verifies the SUT bot's current Discord voice state is the target voice/stage channel. Convex Discord credentials may include optionalvoiceChannelId; otherwise the runner discovers the first visible voice/stage channel in the guild.discord-status-reactions-tool-only- opt-in Mantis scenario. Runs by itself because it switches the SUT to always-on, tool-only guild replies withmessages.statusReactions.enabled=true, then captures a REST reaction timeline plus HTML/PNG visual artifacts. Mantis before/after reports also preserve scenario-provided MP4 artifacts asbaseline.mp4andcandidate.mp4.
Run the Discord voice auto-join scenario explicitly:
pnpm OmeniaClaw qa discord \ --scenario discord-voice-autojoin \ --provider-mode mock-openaiRun the Mantis status-reaction scenario explicitly:
pnpm OmeniaClaw qa discord \ --scenario discord-status-reactions-tool-only \ --provider-mode live-frontier \ --model openai/gpt-5.5 \ --alt-model openai/gpt-5.5 \ --fastOutput artifacts:
discord-qa-report.mdqa-evidence.json- evidence entries for the live transport checks.discord-qa-observed-messages.json- bodies redacted unlessOmeniaClaw_QA_DISCORD_CAPTURE_CONTENT=1.discord-qa-reaction-timelines.jsonanddiscord-status-reactions-tool-only-timeline.pngwhen the status-reaction scenario runs.
Slack QA
pnpm OmeniaClaw qa slackTargets one real private Slack channel with two distinct bots: a driver bot controlled by the harness and a SUT bot started by the child OmeniaClaw gateway through the bundled Slack plugin.
Required env when --credential-source env:
OmeniaClaw_QA_SLACK_CHANNEL_IDOmeniaClaw_QA_SLACK_DRIVER_BOT_TOKENOmeniaClaw_QA_SLACK_SUT_BOT_TOKENOmeniaClaw_QA_SLACK_SUT_APP_TOKEN
Optional:
OmeniaClaw_QA_SLACK_CAPTURE_CONTENT=1keeps message bodies in observed-message artifacts.OmeniaClaw_QA_SLACK_APPROVAL_CHECKPOINT_DIRenables visual approval checkpoints for Mantis. The runner writes<scenario>.pending.jsonand<scenario>.resolved.json, then waits for matching.ack.jsonfiles.OmeniaClaw_QA_SLACK_APPROVAL_CHECKPOINT_TIMEOUT_MSoverrides the checkpoint acknowledgement timeout. The default is120000.
Scenarios (extensions/qa-lab/src/live-transports/slack/slack-live.runtime.ts):
slack-canaryslack-mention-gatingslack-allowlist-blockslack-top-level-reply-shapeslack-restart-resumeslack-thread-follow-upslack-thread-isolationslack-approval-exec-native- opt-in native Slack exec approval scenario. Requests an exec approval through the gateway, verifies the Slack message has native approval buttons, resolves it, and verifies the resolved Slack update.slack-approval-plugin-native- opt-in native Slack plugin approval scenario. Enables exec and plugin approval forwarding together so plugin events are not suppressed by exec approval routing, then verifies the same pending/resolved native Slack UI path.
Output artifacts:
slack-qa-report.mdqa-evidence.json- evidence entries for the live transport checks.slack-qa-observed-messages.json- bodies redacted unlessOmeniaClaw_QA_SLACK_CAPTURE_CONTENT=1.approval-checkpoints/- only when Mantis setsOmeniaClaw_QA_SLACK_APPROVAL_CHECKPOINT_DIR; contains checkpoint JSON, acknowledgement JSON, and pending/resolved screenshots.
Setting up the Slack workspace
The lane needs two distinct Slack apps in one workspace, plus a channel both bots are members of:
channelId- theCxxxxxxxxxxid of a channel both bots have been invited to. Use a dedicated channel; the lane posts on every run.driverBotToken- bot token (xoxb-...) of the Driver app.sutBotToken- bot token (xoxb-...) of the SUT app, which must be a separate Slack app from the driver so its bot user id is distinct.sutAppToken- app-level token (xapp-...) of the SUT app withconnections:write, used by Socket Mode so the SUT app can receive events.
Prefer a Slack workspace dedicated to QA over reusing a production workspace.
The SUT manifest below intentionally narrows the bundled Slack plugin's production install (extensions/slack/src/setup-shared.ts:10) to the permissions and events covered by the live Slack QA suite. For the production-channel setup as users see it, see Slack channel quick setup; the QA Driver/SUT pair is intentionally separate because the lane needs two distinct bot user ids in one workspace.
1. Create the Driver app
Go to api.slack.com/apps → Create New App → From a manifest → pick the QA workspace, paste the following manifest, then Install to Workspace:
{ "display_information": { "name": "OmeniaClaw QA Driver", "description": "Test driver bot for OmeniaClaw QA Slack live lane" }, "features": { "bot_user": { "display_name": "OmeniaClaw QA Driver", "always_online": true } }, "oauth_config": { "scopes": { "bot": ["chat:write", "channels:history", "groups:history", "users:read"] } }, "settings": { "socket_mode_enabled": false }}Copy the Bot User OAuth Token (xoxb-...) - that becomes driverBotToken. The driver only needs to post messages and identify itself; no events, no Socket Mode.
2. Create the SUT app
Repeat Create New App → From a manifest in the same workspace. This QA app intentionally uses a narrower version of the bundled Slack plugin's production manifest (extensions/slack/src/setup-shared.ts:10): reaction scopes and events are omitted because the live Slack QA suite does not cover reaction handling yet.
{ "display_information": { "name": "OmeniaClaw QA SUT", "description": "OmeniaClaw QA SUT connector for OmeniaClaw" }, "features": { "bot_user": { "display_name": "OmeniaClaw QA SUT", "always_online": true }, "app_home": { "home_tab_enabled": true, "messages_tab_enabled": true, "messages_tab_read_only_enabled": false } }, "oauth_config": { "scopes": { "bot": [ "app_mentions:read", "assistant:write", "channels:history", "channels:read", "chat:write", "commands", "emoji:read", "files:read", "files:write", "groups:history", "groups:read", "im:history", "im:read", "im:write", "mpim:history", "mpim:read", "mpim:write", "pins:read", "pins:write", "usergroups:read", "users:read" ] } }, "settings": { "socket_mode_enabled": true, "event_subscriptions": { "bot_events": [ "app_home_opened", "app_mention", "channel_rename", "member_joined_channel", "member_left_channel", "message.channels", "message.groups", "message.im", "message.mpim", "pin_added", "pin_removed" ] } }}After Slack creates the app, do two things on its settings page:
- Install to Workspace → copy the Bot User OAuth Token → that becomes
sutBotToken. - Basic Information → App-Level Tokens → Generate Token and Scopes → add scope
connections:write→ save → copy thexapp-...value → that becomessutAppToken.
Verify the two bots have distinct user ids by calling auth.test on each token. The runtime distinguishes driver and SUT by user id; reusing one app for both will fail mention-gating immediately.
3. Create the channel
In the QA workspace, create a channel (e.g. #OmeniaClaw-qa) and invite both bots from inside the channel:
/invite @OmeniaClaw QA Driver/invite @OmeniaClaw QA SUTCopy the Cxxxxxxxxxx id from channel info → About → Channel ID - that becomes channelId. A public channel works; if you use a private channel both apps already have groups:history so the harness's history reads will still succeed.
4. Register the credentials
Two options. Use env vars for single-machine debugging (set the four OmeniaClaw_QA_SLACK_* variables and pass --credential-source env), or seed the shared Convex pool so CI and other maintainers can lease them.
For the Convex pool, write the four fields to a JSON file:
{ "channelId": "Cxxxxxxxxxx", "driverBotToken": "xoxb-...", "sutBotToken": "xoxb-...", "sutAppToken": "xapp-..."}With OmeniaClaw_QA_CONVEX_SITE_URL and OmeniaClaw_QA_CONVEX_SECRET_MAINTAINER exported in your shell, register and verify:
pnpm OmeniaClaw qa credentials add \ --kind slack \ --payload-file slack-creds.json \ --note "QA Slack pool seed" pnpm OmeniaClaw qa credentials list --kind slack --status all --jsonExpect count: 1, status: "active", no lease field.
5. Verify end to end
Run the lane locally to confirm both bots can talk to each other through the broker:
pnpm OmeniaClaw qa slack \ --credential-source convex \ --credential-role maintainer \ --output-dir .artifacts/qa-e2e/slack-localA green run completes in well under 30 seconds and slack-qa-report.md shows both slack-canary and slack-mention-gating at status pass. If the lane hangs for ~90 seconds and exits with Convex credential pool exhausted for kind "slack", either the pool is empty or every row is leased - qa credentials list --kind slack --status all --json will tell you which.
WhatsApp QA
pnpm OmeniaClaw qa whatsappTargets two dedicated WhatsApp Web accounts: a driver account controlled by the harness and a SUT account started by the child OmeniaClaw gateway through the bundled WhatsApp plugin.
Required env when --credential-source env:
OmeniaClaw_QA_WHATSAPP_DRIVER_PHONE_E164OmeniaClaw_QA_WHATSAPP_SUT_PHONE_E164OmeniaClaw_QA_WHATSAPP_DRIVER_AUTH_ARCHIVE_BASE64OmeniaClaw_QA_WHATSAPP_SUT_AUTH_ARCHIVE_BASE64
Optional:
OmeniaClaw_QA_WHATSAPP_GROUP_JIDenables group scenarios such aswhatsapp-mention-gatingandwhatsapp-group-allowlist-block.OmeniaClaw_QA_WHATSAPP_CAPTURE_CONTENT=1keeps message bodies in observed-message artifacts.
Scenario catalog (extensions/qa-lab/src/live-transports/whatsapp/whatsapp-live.runtime.ts):
- Baseline and group gating:
whatsapp-canary,whatsapp-pairing-block,whatsapp-mention-gating,whatsapp-top-level-reply-shape,whatsapp-restart-resume,whatsapp-group-allowlist-block. - Native commands:
whatsapp-help-command,whatsapp-status-command,whatsapp-commands-command,whatsapp-tools-compact-command,whatsapp-whoami-command,whatsapp-context-command,whatsapp-native-new-command. - Reply and final-output behavior:
whatsapp-tool-only-usage-footer,whatsapp-reply-to-message,whatsapp-reply-context-isolation,whatsapp-reply-delivery-shape,whatsapp-stream-final-message-accounting. - Inbound media and structured messages:
whatsapp-inbound-image-caption,whatsapp-audio-preflight,whatsapp-inbound-structured-messages,whatsapp-group-audio-gating. These send real WhatsApp image, audio, document, location, contact, and sticker events through the driver. - Outbound Gateway and message action coverage:
whatsapp-outbound-media-matrix,whatsapp-outbound-document-preserves-filename,whatsapp-outbound-poll,whatsapp-message-actions. - Access-control coverage:
whatsapp-access-control-dm-open,whatsapp-access-control-dm-disabled,whatsapp-access-control-group-open,whatsapp-access-control-group-disabled,whatsapp-group-allowlist-block. - Native approvals:
whatsapp-approval-exec-deny-native,whatsapp-approval-exec-native,whatsapp-approval-exec-reaction-native,whatsapp-approval-plugin-native. - Status reactions:
whatsapp-status-reactions.
The catalog currently contains 35 scenarios. The live-frontier default lane is
kept small at 8 scenarios for fast smoke coverage. The mock-openai default
lane runs 29 deterministic scenarios through the real WhatsApp transport while
mocking only model output. Approval scenarios and a few heavier/blocking checks
remain explicit by scenario id.
The WhatsApp QA driver observes structured live events (text, media,
location, reaction, and poll) and can actively send media, polls,
contacts, locations, and stickers. QA Lab imports that driver through the
@OmeniaClaw/whatsapp/api.js package surface instead of reaching into private
WhatsApp runtime files. Message content is redacted by default. Outbound
poll and upload-file coverage run through deterministic gateway poll and
message.action calls instead of model-prompt-only tool invocation.
Output artifacts:
whatsapp-qa-report.mdqa-evidence.json- evidence entries for the live transport checks.whatsapp-qa-observed-messages.json- bodies redacted unlessOmeniaClaw_QA_WHATSAPP_CAPTURE_CONTENT=1.
Convex credential pool
Telegram, Discord, Slack, and WhatsApp lanes can lease credentials from a shared Convex pool instead of reading the env vars above. Pass --credential-source convex (or set OmeniaClaw_QA_CREDENTIAL_SOURCE=convex); QA Lab acquires an exclusive lease, heartbeats it for the duration of the run, and releases it on shutdown. Pool kinds are "telegram", "discord", "slack", and "whatsapp".
Payload shapes the broker validates on admin/add:
- Telegram (
kind: "telegram"):{ groupId: string, driverToken: string, sutToken: string }-groupIdmust be a numeric chat-id string. - Telegram real user (
kind: "telegram-user"):{ groupId: string, sutToken: string, testerUserId: string, testerUsername: string, telegramApiId: string, telegramApiHash: string, tdlibDatabaseEncryptionKey: string, tdlibArchiveBase64: string, tdlibArchiveSha256: string, desktopTdataArchiveBase64: string, desktopTdataArchiveSha256: string }- Mantis Telegram Desktop proof only. Generic QA Lab lanes must not acquire this kind. - Discord (
kind: "discord"):{ guildId: string, channelId: string, driverBotToken: string, sutBotToken: string, sutApplicationId: string }. - WhatsApp (
kind: "whatsapp"):{ driverPhoneE164: string, sutPhoneE164: string, driverAuthArchiveBase64: string, sutAuthArchiveBase64: string, groupJid?: string }- phone numbers must be distinct E.164 strings.
The Mantis Telegram Desktop proof workflow holds one exclusive Convex
telegram-user lease for both the TDLib CLI driver and Telegram Desktop
witness, then releases it after publishing proof.
When a PR needs a deterministic visual diff, Mantis can use the same mock model
reply on main and on the PR head while the Telegram formatter or delivery
layer changes. Capture defaults are tuned for PR comments: standard Crabbox
class, 24fps desktop recording, 24fps motion GIF, and 1920px preview width.
Before/after comments should publish a clean bundle that contains only the
intended GIFs.
Slack lanes can also use the pool. Slack payload shape checks currently live in the Slack QA runner rather than the broker; use { channelId: string, driverBotToken: string, sutBotToken: string, sutAppToken: string }, with a Slack channel id like Cxxxxxxxxxx. See Setting up the Slack workspace for app and scope provisioning.
Operational env vars and the Convex broker endpoint contract live in Testing → Shared Telegram credentials via Convex (the section name predates the multi-channel pool; the lease semantics are shared across kinds).
Repo-backed seeds
Seed assets live in qa/:
qa/scenarios/index.yamlqa/scenarios/<theme>/*.yaml
These are intentionally in git so the QA plan is visible to both humans and the agent.
qa-lab should stay a generic YAML scenario runner. Each scenario YAML file is
the source of truth for one test run and should define:
- top-level
title scenariometadata- optional category, capability, lane, and risk metadata in
scenario - docs and code refs in
scenario - optional plugin requirements in
scenario - optional gateway config patch in
scenario - executable top-level
flowfor flow scenarios, orscenario.execution.kind/scenario.execution.pathfor Vitest and Playwright scenarios
The reusable runtime surface that backs flow is allowed to stay generic
and cross-cutting. For example, YAML scenarios can combine transport-side
helpers with browser-side helpers that drive the embedded Control UI through the
Gateway browser.request seam without adding a special-case runner.
Scenario files should be grouped by product capability rather than source tree
folder. Keep scenario IDs stable when files move; use docsRefs and codeRefs
for implementation traceability.
The baseline list should stay broad enough to cover:
- DM and channel chat
- thread behavior
- message action lifecycle
- cron callbacks
- memory recall
- model switching
- subagent handoff
- repo-reading and docs-reading
- one small build task such as Lobster Invaders
Provider mock lanes
qa suite has two local provider mock lanes:
mock-openaiis the scenario-aware OmeniaClaw mock. It remains the default deterministic mock lane for repo-backed QA and parity gates.aimockstarts an AIMock-backed provider server for experimental protocol, fixture, record/replay, and chaos coverage. It is additive and does not replace themock-openaiscenario dispatcher.
Provider-lane implementation lives under extensions/qa-lab/src/providers/.
Each provider owns its defaults, local server startup, gateway model config,
auth-profile staging needs, and live/mock capability flags. Shared suite and
gateway code should route through the provider registry instead of branching on
provider names.
Transport adapters
qa-lab owns a generic transport seam for YAML QA scenarios. qa-channel is the first adapter on that seam, but the design target is wider: future real or synthetic channels should plug into the same suite runner instead of adding a transport-specific QA runner.
At the architecture level, the split is:
qa-labowns generic scenario execution, worker concurrency, artifact writing, and reporting.- The transport adapter owns gateway config, readiness, inbound and outbound observation, transport actions, and normalized transport state.
- YAML scenario files under
qa/scenarios/define the test run;qa-labprovides the reusable runtime surface that executes them.
Adding a channel
Adding a channel to the YAML QA system requires exactly two things:
- A transport adapter for the channel.
- A scenario pack that exercises the channel contract.
Do not add a new top-level QA command root when the shared qa-lab host can own the flow.
qa-lab owns the shared host mechanics:
- the
OmeniaClaw qacommand root - suite startup and teardown
- worker concurrency
- artifact writing
- report generation
- scenario execution
- compatibility aliases for older
qa-channelscenarios
Runner plugins own the transport contract:
- how
OmeniaClaw qa <runner>is mounted beneath the sharedqaroot - how the gateway is configured for that transport
- how readiness is checked
- how inbound events are injected
- how outbound messages are observed
- how transcripts and normalized transport state are exposed
- how transport-backed actions are executed
- how transport-specific reset or cleanup is handled
The minimum adoption bar for a new channel:
- Keep
qa-labas the owner of the sharedqaroot. - Implement the transport runner on the shared
qa-labhost seam. - Keep transport-specific mechanics inside the runner plugin or channel harness.
- Mount the runner as
OmeniaClaw qa <runner>instead of registering a competing root command. Runner plugins should declareqaRunnersinOmeniaClaw.plugin.jsonand export a matchingqaRunnerCliRegistrationsarray fromruntime-api.ts. Keepruntime-api.tslight; lazy CLI and runner execution should stay behind separate entrypoints. - Author or adapt YAML scenarios under the themed
qa/scenarios/directories. - Use the generic scenario helpers for new scenarios.
- Keep existing compatibility aliases working unless the repo is doing an intentional migration.
The decision rule is strict:
- If behavior can be expressed once in
qa-lab, put it inqa-lab. - If behavior depends on one channel transport, keep it in that runner plugin or plugin harness.
- If a scenario needs a new capability that more than one channel can use, add a generic helper instead of a channel-specific branch in
suite.ts. - If a behavior is only meaningful for one transport, keep the scenario transport-specific and make that explicit in the scenario contract.
Scenario helper names
Preferred generic helpers for new scenarios:
waitForTransportReadywaitForChannelReadyinjectInboundMessageinjectOutboundMessagewaitForTransportOutboundMessagewaitForChannelOutboundMessagewaitForNoTransportOutboundgetTransportSnapshotreadTransportMessagereadTransportTranscriptformatTransportTranscriptresetTransport
Compatibility aliases remain available for existing scenarios - waitForQaChannelReady, waitForOutboundMessage, waitForNoOutbound, formatConversationTranscript, resetBus - but new scenario authoring should use the generic names. The aliases exist to avoid a flag-day migration, not as the model going forward.
Reporting
qa-lab exports a Markdown protocol report from the observed bus timeline.
The report should answer:
- What worked
- What failed
- What stayed blocked
- What follow-up scenarios are worth adding
For the inventory of available scenarios - useful when sizing follow-up work or wiring a new transport - run pnpm OmeniaClaw qa coverage (add --json for machine-readable output).
When choosing focused proof for a touched behavior or file path, run pnpm OmeniaClaw qa coverage --match <query>.
The match report searches scenario metadata, docs refs, code refs, coverage IDs, plugins, and provider requirements, then prints matching qa suite --scenario ... targets.
Every qa suite run writes top-level qa-evidence.json,
qa-suite-summary.json, and qa-suite-report.md artifacts for the selected
scenario set. Scenarios that declare execution.kind: vitest or
execution.kind: playwright run the matching test path and also write
per-scenario logs. Scenarios that declare execution.kind: script run the
evidence producer at execution.path through node --import tsx (with
${outputDir} and ${scenarioId} expanded in execution.args); the producer
writes its own qa-evidence.json, whose entries are imported into the suite
output and whose artifact paths are resolved relative to that producer
qa-evidence.json. When qa suite is reached through
qa run --qa-profile, the same qa-evidence.json also includes the profile
scorecard summary for the selected taxonomy categories.
Treat it as a discovery aid, not a gate replacement; the selected scenario still needs the right provider mode, live transport, Multipass, Testbox, or release lane for the behavior under test.
For character and style checks, run the same scenario across multiple live model refs and write a judged Markdown report:
pnpm OmeniaClaw qa character-eval \ --model openai/gpt-5.5,thinking=medium,fast \ --model openai/gpt-5.2,thinking=xhigh \ --model openai/gpt-5,thinking=xhigh \ --model anthropic/claude-opus-4-8,thinking=high \ --model anthropic/claude-sonnet-4-6,thinking=high \ --model zai/glm-5.1,thinking=high \ --model moonshot/kimi-k2.5,thinking=high \ --model google/gemini-3.1-pro-preview,thinking=high \ --judge-model openai/gpt-5.5,thinking=xhigh,fast \ --judge-model anthropic/claude-opus-4-8,thinking=high \ --blind-judge-models \ --concurrency 16 \ --judge-concurrency 16The command runs local QA gateway child processes, not Docker. Character eval
scenarios should set the persona through SOUL.md, then run ordinary user turns
such as chat, workspace help, and small file tasks. The candidate model should
not be told that it is being evaluated. The command preserves each full
transcript, records basic run stats, then asks the judge models in fast mode with
xhigh reasoning where supported to rank the runs by naturalness, vibe, and humor.
Use --blind-judge-models when comparing providers: the judge prompt still gets
every transcript and run status, but candidate refs are replaced with neutral
labels such as candidate-01; the report maps rankings back to real refs after
parsing.
Candidate runs default to high thinking, with medium for GPT-5.5 and xhigh
for older OpenAI eval refs that support it. Override a specific candidate inline with
--model provider/model,thinking=<level>. --thinking <level> still sets a
global fallback, and the older --model-thinking <provider/model=level> form is
kept for compatibility.
OpenAI candidate refs default to fast mode so priority processing is used where
the provider supports it. Add ,fast, ,no-fast, or ,fast=false inline when a
single candidate or judge needs an override. Pass --fast only when you want to
force fast mode on for every candidate model. Candidate and judge durations are
recorded in the report for benchmark analysis, but judge prompts explicitly say
not to rank by speed.
Candidate and judge model runs both default to concurrency 16. Lower
--concurrency or --judge-concurrency when provider limits or local gateway
pressure make a run too noisy.
When no candidate --model is passed, the character eval defaults to
openai/gpt-5.5, openai/gpt-5.2, openai/gpt-5, anthropic/claude-opus-4-8,
anthropic/claude-sonnet-4-6, zai/glm-5.1,
moonshot/kimi-k2.5, and
google/gemini-3.1-pro-preview when no --model is passed.
When no --judge-model is passed, the judges default to
openai/gpt-5.5,thinking=xhigh,fast and
anthropic/claude-opus-4-8,thinking=high.