openclaw-lighthouse

OpenAI Codex provider cooldown: all models failed (no available auth profile)

Summary

In some sessions, OpenClaw fails before replying when every configured model in the fallback chain depends on openai-codex, and openai-codex profiles are unavailable (cooldown/rate-limit state).

Observed codex-only variants include:

pure cooldown/unavailable profile errors
first-model timeout + second-model cooldown/unavailable profile errors

This behaves like a provider/profile availability failure, not a content failure.

Environment

OpenClaw version: not captured in the user report
OS: not captured in the user report
Channel: reported in direct chat workflow
Models in failed chain:
- openai-codex/gpt-5.3-codex
- openai-codex/gpt-5.3-codex-spark

Reproduction

Configure primary + fallback models under the same provider (openai-codex).
Trigger provider rate limiting or profile cooldown state (high request burst can cause this).
Send a normal user prompt.

Expected vs actual

Expected:
- A model in the chain responds, or fallback moves to another available provider.
Actual:
- Agent fails before reply with codex provider/profile unavailability errors.

Exact reported errors (redacted variants):

Variant A

⚠️ Agent failed before reply: All models failed (2):
openai-codex/gpt-5.3-codex: No available auth profile for openai-codex (all in cooldown or unavailable). (rate_limit) |
openai-codex/gpt-5.3-codex-spark: Provider openai-codex is in cooldown (all profiles unavailable) (rate_limit).
Logs: openclaw logs --follow

Variant B

⚠️ Agent failed before reply: All models failed (2):
openai-codex/gpt-5.3-codex: LLM request timed out. (unknown) |
openai-codex/gpt-5.3-codex-spark: No available auth profile for openai-codex (all in cooldown or unavailable). (rate_limit).
Logs: openclaw logs --follow

Findings

Core pattern: codex-only fallback can fully fail when profile availability collapses (cooldown/rate-limit).
A first-model timeout can appear before/alongside cooldown errors on downstream candidates.
Fallback is ineffective if all candidates share the same provider and that provider is unhealthy.
This is likely a resilience/config gap (fallback diversity), not a prompt-specific content issue.

Mitigation / Workaround

Add cross-provider fallback (for example one non-openai-codex model).
Reduce bursty traffic and add spacing between heavy runs.
Add/verify additional auth profiles for the provider if available.
Retry after cooldown window.
Capture supporting logs during failure:
- openclaw logs --follow
- openclaw status --all

Risk / Impact

Reliability: user gets no reply.
Operations: repeated outages if traffic spikes continue.
UX: appears as random total failure even with fallback configured.

Related local Lighthouse note (broader multi-provider overlap):
- All models failed: fallback chain exhausted by cooldown/auth failures
No upstream issue link added yet for this exact codex-only symptom in this note.

Next actions

Open/track upstream issue if this repeats with fresh logs.
Add provider-diverse fallback in production configs.
Validate behavior under controlled load test.