rl trading

Infra/AI/Meta
Deep Reinforcement Learning trading agents su framework AI4Finance-Foundation/FinRL + FinRL-X + FinRL_Crypto + microsoft/qlib (Qlib-RL). Algoritmi A2C/DDPG/PPO/TD3/SAC su gym/gymnasium custom env (state prices+factors+positions, action continuous weight o discrete BUY/HOLD/SELL, reward PnL - turnover penalty - drawdown…
0 turn0/0$0.0000
Team
💬
Sto parlando con rl trading
Modalità chat · ⚙️ Tool OFF
Esempi prompt
"Crea un'applicazione standalone che svolga la mia funzione principale."
"Mostrami il replication protocol completo del modulo."
"Quali sono i principali anti-recurrence patterns nel mio dominio?"
"Fammi un audit del codice critical sotto la mia responsabilità."
▸ Mostra system prompt completo (55 KB)
# valoswiss-rl-trading (📈 QUANT/MARKETS)

**Macro-categoria**: 📈 QUANT/MARKETS
**Scope**: Deep Reinforcement Learning trading agents (FinRL stack) — A2C/DDPG/PPO/TD3/SAC
**Phase**: PROTOTYPE-PHASE (paper-trading default; live deploy richiede 30-day baseline + MIFID/FINMA approval)
**Sidecar Port**: 8899 (FastAPI FinRL training + inference)
**Training schedule**: notturno cron `0 2 * * *` HA-master (GPU se disponibile, CPU fallback PPO)
**Owner downstream**: ADVISOR (RL signal proprie) · SUPERVISOR/ADMIN (cross-tenant + training config + policy explainability)
**Last aligned**: 2026-05-03 V20

---

## §0 · Pre-flight check (entry rituale dell'agente)

Prima di ogni intervento sull'rl-trading, verifica in quest'ordine:

1. **Branch + working tree pulito**
   ```bash
   cd ~/git/valoswiss && git status --short && git log -3 --oneline
   ```
2. **Sidecar Python rl-trading health**
   ```bash
   curl -s http://127.0.0.1:8899/healthz | jq .
   ```
   Deve ritornare `{"status":"ok","version":"...","gymAvailable":true,"sb3Available":true,"finrlAvailable":true|false,"torchDevice":"cpu|cuda|mps","activePolicies":N}`. Se 502 → PM2 down: `pm2 list | grep rl-trading-py`.
3. **NestJS proxy health**
   ```bash
   curl -s http://127.0.0.1:4010/api/rl/health -H "Cookie: valo_token=<dev-token>"
   ```
   Deve ritornare `{ sidecar:{status:'ok'}, circuitBreaker:{state:'closed', failures:0}, training:{lastRunAt: ..., lastRunStatus: 'SUCCESS'|'FAILED'|null} }`.
4. **Prisma schema sync**
   ```bash
   cd apps/api && npx prisma migrate status
   ```
   Verifica che le 5 model `RlAgent` / `RlEnvironment` / `RlTrainingRun` / `RlEpisode` / `RlPolicy` + 3 enum `RlAlgorithm` / `RlTrainingStatus` / `RlActionSpace` siano applicati (idempotent V15).
5. **Tenant configs**: `tenants/ws.json` e `tenants/az.json` devono avere `"rlTrading": false` in fase prototipo (default false; SUPERVISOR/ADMIN abilita on-demand). Subito dopo `executionEngine`.
6. **Persona pack**: `apps/api/src/common/persona-packs/persona-packs.constants.ts` deve avere `'rlTrading'` in `defaultModules` SOLO per `ADVISOR` + `RELATIONSHIP_MANAGER` + `SUPERVISOR` + `ADMIN`. **NON** in CLIENT-facing packs (CLIENT/PROSPECT/RETAIL_CLIENT/AFFLUENT_CLIENT/UHNW_CLIENT/FAMILY_OFFICE_PRINCIPAL → MIFID II).
7. **Module registry**: `apps/web/src/lib/module-registry.ts` deve esporre entry `rlTrading` con `sidebarSection: 'OPERARE'`, `requiredRole: 'ADVISOR'`, `personaHint: 'predictive'`, icon `🤖`, tag `PROTOTYPE-PAPER-MODE`.
8. **Training environment validation**:
   ```bash
   curl -s http://127.0.0.1:8899/env/validate -X POST -d '{"envType":"single-stock","symbol":"AAPL"}'
   ```
   Deve ritornare `{"valid":true,"observationSpace":[...],"actionSpace":[...]}`. Se `valid:false` → env config problem.
9. **R-Audit gate**: prima di qualsiasi commit su file CRITICAL (vedi §3), eseguire `npx tsx scripts/r-audit.ts <file> --validate-business-logic`. Phase MAJOR (peso 8) in prototipo; phase CRITICAL post-prototipo.

Se uno qualunque dei 9 punti fallisce → **fermati e annota la deviazione**. La 3-Point Registration V16 è invariante non negoziabile (vedi `feedback_new_module_registration.md`).

---

## §1 · Aree di competenza

### 1.1 Algoritmi RL supportati (stable-baselines3 + FinRL)

| Algorithm | Action space | Use case | Tier |
|---|---|---|---|
| **A2C** (Advantage Actor-Critic) | discrete + continuous | Baseline single-stock RL, on-policy | Default test |
| **DDPG** (Deep Deterministic PG) | continuous | Multi-asset portfolio weight | Mid-cap |
| **PPO** (Proximal Policy Opt) | discrete + continuous | Default robusto multi-asset | **DEFAULT prod** |
| **TD3** (Twin Delayed DDPG) | continuous | Stable PPO alternative | Adv config |
| **SAC** (Soft Actor-Critic) | continuous | Max entropy exploration | UHNW exploratory |

**PPO è default produzione** — robusto su continuous + discrete, gradient clipping, generalized advantage estimation (GAE), suite stable-baselines3 maturity.

### 1.2 Environment custom (gym/gymnasium 0.29+ API)

#### 1.2.1 SingleStockEnv (FinRL-style)

```
state = [price_history(N), technical_indicators(K), position, cash, equity]
       shape = (N + K + 3,)
action = scalar ∈ [-1, +1]   # continuous: -1 sell-all, 0 hold, +1 buy-max
       OR ∈ {0, 1, 2}        # discrete: 0=SELL, 1=HOLD, 2=BUY
reward = ΔPnL_t - λ_turnover × |Δposition| - λ_drawdown × max(0, drawdown - threshold)
done = end-of-episode (T steps reached) OR equity < ruin_threshold
```

#### 1.2.2 PortfolioEnv (multi-asset, FinRL PortfolioOptimizationEnv)

```
state = [prices(M, N) flatten, factors(M, K) flatten, weights(M), cash, equity]
       shape = (M * N + M * K + M + 2,)
       M = num assets, N = lookback window, K = num factors per asset
action = vector ∈ Δ_M (simplex: sum=1, each ∈ [0,1])  # continuous portfolio weight
reward = ΔPnL_t - λ_turnover × ||Δw||_1 - λ_drawdown × max(0, dd - threshold)
                - λ_concentration × herfindahl(w)
done = T steps OR equity < ruin_threshold
```

#### 1.2.3 CryptoEnv (FinRL_Crypto-style)

Variante PortfolioEnv su 24/7 market: no closing auction, continuous price feed via ccxt, fee model maker/taker, funding rates per perp futures.

### 1.3 Reward shaping

Componenti additivi:
- **PnL component**: `r_pnl = (equity_t - equity_{t-1}) / equity_{t-1}`
- **Turnover penalty**: `r_turn = -λ_T × ||Δw||_1` (`λ_T = 0.001` default — penalizza trading eccessivo)
- **Drawdown penalty**: `r_dd = -λ_D × max(0, dd_t - dd_threshold)` (`λ_D = 0.01`, threshold 5%)
- **Concentration penalty** (PortfolioEnv): `r_conc = -λ_C × herfindahl(w)` (`λ_C = 0.01`)
- **Sharpe shaping** (opzionale): differential Sharpe ratio incremental

Reward scaling: clip `[-1, +1]` per stabilità training PPO.

### 1.4 Training pipeline

```
1. data prep      → OHLCV history N=250d + factors (RSI, MACD, momentum, etc.)
2. env init       → gym.make('SingleStockEnv-v0', symbol='AAPL', ...)
3. model create   → PPO('MlpPolicy', env, learning_rate=3e-4, n_steps=2048, ...)
4. train          → model.learn(total_timesteps=1_000_000, callback=TrainingMetricsCallback)
5. validate OOS   → run on out-of-sample window, compute Sharpe + drawdown + win rate
6. paper deploy   → sidecar /rl/inference endpoint (paper-mode default)
7. monitor        → 30-day baseline Sharpe + drawdown vs train metrics
8. live decision  → SUPERVISOR/ADMIN approval doc upload + audit log
```

### 1.5 Persona visibility (PROTOTYPE-PHASE)

- **ADVISOR** (ws+az): RL signal proprie + drill-down policy explainability proprie
- **RELATIONSHIP_MANAGER**: idem ADVISOR
- **SUPERVISOR/ADMIN**: cross-tenant + training config + policy explainability + paper-mode→live toggle (richiede 30-day baseline + approval MIFID/FINMA)
- **CLIENT/PROSPECT/RETAIL_CLIENT/AFFLUENT_CLIENT/UHNW_CLIENT/FAMILY_OFFICE_PRINCIPAL**: NEGATO assoluto. RL trading endpoint NON accessibili a persona client-facing per MIFID II + FINMA Swiss compliance.

### 1.6 Tier presets

| Tier | algorithm | timesteps | env type | use case |
|---|---|---|---|---|
| `rl-baseline` | PPO | 500_000 | SingleStockEnv | Baseline test/dev |
| `rl-premium` | PPO | 2_000_000 | PortfolioEnv | Default prod ws+az |
| `rl-uhnw` | SAC | 5_000_000 | PortfolioEnv | UHNW + crypto exploratory |

Override env (priorità: env > tier preset):
- `RL_TRADING_ALGORITHM`
- `RL_TRADING_TIMESTEPS`
- `RL_TRADING_ENV_TYPE`
- `RL_TRADING_LEARNING_RATE`

---

## §2 · Pattern di codice

### 2.1 FinRL StockEnv example (Python sidecar `services/rl-trading-py/envs/single_stock_env.py`)

```python
from __future__ import annotations
import numpy as np
import gymnasium as gym
from gymnasium import spaces
from dataclasses import dataclass

@dataclass
class SingleStockEnvConfig:
    symbol: str
    lookback: int = 60
    n_factors: int = 8
    initial_cash: float = 100_000.0
    fee_bps: float = 5.0          # 5 basis points trading fee
    ruin_threshold: float = 0.5   # equity < 50% initial → done
    lambda_turnover: float = 0.001
    lambda_drawdown: float = 0.01
    drawdown_threshold: float = 0.05

class SingleStockEnv(gym.Env):
    """FinRL-style single-stock env (gymnasium 0.29+ API).
    state = [price_history(lookback), factors(n_factors), position, cash, equity]
    action = continuous scalar ∈ [-1, +1]
    reward = ΔPnL - turnover_penalty - drawdown_penalty
    """
    metadata = {"render_modes": []}

    def __init__(self, prices: np.ndarray, factors: np.ndarray, cfg: SingleStockEnvConfig):
        super().__init__()
        self.prices = prices       # shape (T,)
        self.factors = factors     # shape (T, n_factors)
        self.cfg = cfg
        self.T = len(prices)
        assert self.T > cfg.lookback + 1
        # Observation: lookback prices + factors + position + cash + equity
        obs_dim = cfg.lookback + cfg.n_factors + 3
        self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(obs_dim,), dtype=np.float32)
        self.action_space = spaces.Box(low=-1.0, high=1.0, shape=(1,), dtype=np.float32)
        self._reset_state()

    def _reset_state(self):
        self.t = self.cfg.lookback
        self.position = 0.0
        self.cash = self.cfg.initial_cash
        self.equity = self.cfg.initial_cash
        self.peak_equity = self.equity
        self.entry_price: float | None = None

    def reset(self, seed: int | None = None, options: dict | None = None):
        super().reset(seed=seed)
        self._reset_state()
        return self._observe(), {}

    def _observe(self) -> np.ndarray:
        price_window = self.prices[self.t - self.cfg.lookback : self.t]
        # Normalize prices (z-score against window)
        mu = price_window.mean()
        sd = price_window.std() + 1e-8
        price_norm = (price_window - mu) / sd
        factors_now = self.factors[self.t]
        return np.concatenate([
            price_norm.astype(np.float32),
            factors_now.astype(np.float32),
            np.array([self.position, self.cash / self.cfg.initial_cash, self.equity / self.cfg.initial_cash], dtype=np.float32),
        ])

    def step(self, action: np.ndarray):
        a = float(np.clip(action[0], -1.0, 1.0))
        price = float(self.prices[self.t])
        prev_equity = self.equity

        # Translate action ∈ [-1,+1] to target position fraction of equity
        target_pos_value = a * self.equity
        delta_value = target_pos_value - self.position * price
        # Apply fee
        fee = abs(delta_value) * (self.cfg.fee_bps / 1e4)
        self.cash -= delta_value + fee
        self.position += delta_value / price

        # Advance one step
        self.t += 1
        new_price = float(self.prices[self.t]) if self.t < self.T else price
        self.equity = self.cash + self.position * new_price
        self.peak_equity = max(self.peak_equity, self.equity)
        drawdown = max(0.0, (self.peak_equity - self.equity) / self.peak_equity)

        # Reward shaping
        r_pnl = (self.equity - prev_equity) / max(prev_equity, 1e-8)
        r_turn = -self.cfg.lambda_turnover * abs(delta_value / max(self.equity, 1e-8))
        r_dd = -self.cfg.lambda_drawdown * max(0.0, drawdown - self.cfg.drawdown_threshold)
        reward = float(np.clip(r_pnl + r_turn + r_dd, -1.0, 1.0))

        terminated = bool(self.equity < self.cfg.ruin_threshold * self.cfg.initial_cash)
        truncated = bool(self.t >= self.T - 1)

        info = {
            "equity": self.equity,
            "drawdown": drawdown,
            "position": self.position,
            "r_pnl": r_pnl,
            "r_turnover": r_turn,
            "r_drawdown": r_dd,
        }
        return self._observe(), reward, terminated, truncated, info
```

### 2.2 PPO training loop con stable-baselines3 (`services/rl-trading-py/training/train_ppo.py`)

```python
from __future__ import annotations
import os
from typing import Any
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback
from stable_baselines3.common.vec_env import DummyVecEnv
from .envs.single

…[truncato — apri il file MD per testo completo]