# Token Management Strategy

This document consolidates the API token management strategies for the three core external services: Keepa, XAI, and Amazon SP-API. These strategies are critical for maintaining system stability and controlling costs.

---

## 1. Keepa API: "Controlled Deficit" Optimization

**Strategy:** Aggressive consumption above a threshold, quick recovery below it.
**Implementation:** `keepa_deals/token_manager.py` (Distributed Token Bucket via Redis)

### The "Why"
Keepa's API allows the token balance to go negative (deficit spending) as long as the starting balance is positive. A strict "no deficit" policy causes massive delays. The "Controlled Deficit" strategy maximizes throughput by leveraging this allowance. To handle multiple concurrent workers (Backfiller + Upserter) without race conditions, the system uses a **Shared Redis State**.

### Dynamic Rate Adaptation
The `TokenManager` dynamically updates its `REFILL_RATE_PER_MINUTE` from the `refillRate` field in Keepa API responses.
*   **Default:** Starts with a conservative guess (e.g., 5/min).
*   **Adaptation:** Upon the first API sync, it learns the true rate (e.g., 20/min for upgraded plans).
*   **Benefit:** Users who upgrade their Keepa plan immediately see faster processing without code changes.

### The Algorithm (Distributed & Float-Safe)
Since Keepa tokens are floating-point numbers (e.g., 54.5), and multiple workers compete for them, the system uses an **Optimistic Locking** strategy with Redis `incrbyfloat`.

**Key Thresholds:**
*   **`MIN_TOKEN_THRESHOLD`:** Reduced to **1**. This aggressive setting ensures that as long as we have a positive balance (even 1.0), we can initiate a request.
*   **`MAX_DEFICIT` (Safety Limit):** Set to **-180**. If a projected request (Current Tokens - Cost) would push the balance below this limit, it is strictly blocked. This prevents hitting Keepa's hard lockout limit of -200, which would ban the API key for 24 hours.
*   **`BURST_THRESHOLD` (Recharge Target):**
    *   **High Refill Rates (>= 10/min):** Target **280** tokens.
    *   **Low Refill Rates (< 10/min):** Target **40** tokens. Waiting for 280 tokens at 5/min takes ~50 minutes, creating a perception of "system stall". Lowering the target to 40 allows for frequent, smaller bursts of activity ("Empty the Bucket" strategy).

**Logic Flow:**
1.  **Atomic Reservation:** The worker unconditionally decrements the shared Redis counter (`incrbyfloat -cost`).
2.  **Threshold Check:** It reads the new balance.
3.  **Aggressive Phase:** If the starting balance was positive (>= 1) AND the result is above `MAX_DEFICIT`, the operation proceeds.
4.  **Recharge Mode (Low Rate Protection):** If the Keepa refill rate is < 10/min AND the balance drops below `MIN_TOKEN_THRESHOLD`, the system enters **Recharge Mode**.
    *   **Action:** All requests are blocked.
    *   **Exit Condition:** Wait until tokens reach the dynamic `BURST_THRESHOLD` (40 or 280). This prevents "flapping" (oscillating between 0 and 5 tokens).
5.  **Recovery Phase (Revert):** If the reservation fails the check and Recharge Mode is not triggered, the worker **Reverts** the transaction and waits.
6.  **TokenRechargeError (Lock Release):** If the calculated wait time exceeds **60 seconds**, the TokenManager raises a `TokenRechargeError` instead of sleeping. The calling task (Smart Ingestor) catches this exception and **immediately releases the Redis lock**. This allows the Celery worker to process other tasks (like Janitor or Gating Checks) while waiting for tokens to recharge, preventing worker starvation.

### Resilience & Crash Recovery (Zombie Locks)
To prevent "Zombie Locks" (stale locks persisting after a crash or deployment), the system employs a "Brain Wipe" strategy during shutdown:
*   **Script:** `Diagnostics/kill_redis_safely.py` (invoked by `kill_everything_force.sh`).
*   **Action:** Connects to Redis, executes `FLUSHALL` (clears memory), then `SAVE` (forces disk sync).
*   **Result:** This ensures that when the system restarts, the token state and locks are completely reset, preventing "Task already running" errors.

### Task-Specific Buffers
*   **Smart Ingestor:**
    *   **Decoupled Batching:**
        *   **Peek (Discovery):** Uses a **Dynamic Batch Size** based on the refill rate. The system reserves **2 tokens per ASIN** (reduced from 5) for this lightweight check.
            *   **High Rate (>= 20/min):** Batch Size **50**. Optimized for speed.
            *   **Low Rate (< 20/min):** Batch Size **20**. Prevents deficit lockout.
            *   **Critically Low (< 10/min):** Batch Size **1**. Optimized to fit within the "Burst Threshold" (40 tokens) while ensuring completion (Cost: ~22 tokens).
        *   **Commit (Analysis):** Always uses batch size **5**. Full product data is expensive (20 tokens/ASIN). Small batches prevent "Deficit Shock" (instantly hitting -200) and allow granular control.
*   **API Wrapper (`keepa_api.py`):**
    *   **Rate Limit Protection:** Functions like `fetch_deals_for_deals` accept an optional `token_manager` argument.
    *   **Behavior:** If provided, the wrapper calls `request_permission_for_call` *before* the API request. This enforces a blocking wait if tokens are low, preventing `429 Too Many Requests` errors during high-frequency ingestion loops.

---

## 2. XAI API: Quotas & Model Selection

**Strategy:** Strict daily cap with aggressive local caching, utilizing the `grok-4-fast-reasoning` model for high-speed analysis.
**Implementation:** `keepa_deals/xai_token_manager.py`, `keepa_deals/xai_cache.py`, and `keepa_deals/ava_advisor.py`

### Model Selection
*   **Primary Model:** `grok-4-fast-reasoning`
*   **Use Cases:** Seasonality Classification, "List at" Price Reasonableness Check, Strategy Extraction, and "Advice from Ava".
*   **Why:** Provides the best balance of reasoning capability and speed for real-time and batch processing.

### Cost Control Mechanism
1.  **Daily Quota:**
    -   A JSON state file (`xai_token_state.json`) tracks `calls_today` and `last_reset_date`.
    -   Before any automated API call (e.g., price check), the manager checks if `calls_today < daily_limit` (default: 1000).
    -   If the limit is reached, the request is denied, and the system falls back to a default "Safe" assumption (e.g., assuming a price is reasonable to avoid rejecting valid deals).
2.  **Caching (`XaiCache`):**
    -   Results are cached in a local dictionary/JSON file.
    -   **Cache Key:** Composite key of `Title | Category | Season | Price`.
    -   **Hit:** If the key exists, the cached boolean result is returned immediately (0 cost).
    -   **Miss:** If not in cache and quota allows, the API is called, and the result is saved.

### Exception: Admin Features
*   **Features:** Guided Learning (`/learn`) and "Advice from Ava" (`/api/ava-advice`).
*   **Policy:** These features operate on-demand (user-triggered) and currently bypass the strict daily quota limits managed by `XaiTokenManager`, though they still consume the underlying API credit.

---

## 3. Amazon SP-API: LWA-Only Authentication

**Strategy:** Persistent, offline access using Refresh Tokens without AWS IAM complexity.
**Implementation:** `keepa_deals/amazon_sp_api.py`, `keepa_deals/sp_api_tasks.py`

### The "Why"
Modern Private Applications on Amazon SP-API (registered after Oct 2023) generally do not require AWS Signature Version 4 (SigV4) signing with IAM credentials. They rely solely on the **Login with Amazon (LWA)** Access Token. Removing the SigV4 requirement simplifies the architecture and eliminates `403 Forbidden` errors caused by IAM misconfiguration.

### Workflow
1.  **Initial Auth:**
    -   User authorizes the app in Seller Central.
    -   A **Refresh Token** is generated (manually or via OAuth).
2.  **Storage:**
    -   The `refresh_token`, `client_id`, and `client_secret` are stored securely (Env vars or DB `user_credentials` table).
3.  **Task Execution (Restriction Check):**
    -   **Token Refresh:** The system exchanges the Refresh Token for a short-lived `access_token` (valid for 1h).
    -   **API Call:** The `access_token` is passed in the `x-amz-access-token` HTTP header.
    -   **No Signing:** No AWS `AccessKey`/`SecretKey` is used or required.
    -   **Restriction Logic:** Calls `getListingsRestrictions` with the specific `conditionType` (e.g., `used_like_new`) to ensure accurate gating status.

### Environment Handling
*   **Sandbox vs. Production:** The system automatically detects if the token is valid for Sandbox or Production by probing the endpoints.
*   **Fallback:** If a Production call fails with 403, it logs the error but does not crash the worker. Items are marked with an error state (-1) in the `user_restrictions` table.
