Five leading large language models — GPT-5, Claude Sonnet 4.6, Gemini 2.5, Grok-4, and DeepSeek V3 — each received $10,000 to trade live crypto markets through GT Protocol's AI Hedge Fund experiment in 2026. After 90 days of real positioning, the results show AI excels at execution discipline but struggles with raw market timing. The real value lives in process, not prediction.
What we mean by 'AI predicting crypto'
Large language models don't see candlestick charts the way a trader does. They reason over text descriptions of market state — price action, recent volume, news headlines, sentiment indicators — and produce structured outputs: open this position, close that one, adjust this leverage. Prediction in this sense is not “BTC will hit $X next week.” It is sequential decision-making under uncertainty.
This matters because most public discussion conflates two things. One is forecasting price targets, which remains a coin-flip problem on any short horizon. The other is consistent positioning logic — when to enter, when to hold, when to cut. AI is meaningfully better at the second than the first.
Our AI Hedge Fund experiment tests this distinction directly. Each model trades the same instruments with the same starting capital. Differences in outcome reflect process quality, not luck on any single trade.
The setup: five LLMs, $10K each, real money
We gave five models identical conditions. Each received $10,000 USDT on Binance Futures, the same market data feed, and the same prompt structure: review current positions, market state, and your own past decisions, then output a trade plan with reasoning.
The models were GPT-5 (OpenAI), Claude Sonnet 4.6 (Anthropic), Gemini 2.5 (Google), Grok-4 (xAI), and DeepSeek V3. No human override. Each ran a decision tick every 30 minutes. Positions, P&L, and reasoning logs are public on the dashboard.
We chose USDT-margined futures because they let us measure conviction. A model that opens 5x leverage on BTC is making a stronger statement than one that buys spot ETH. We restricted instruments to liquid majors — BTC, ETH, SOL — so liquidity and slippage wouldn't confound the results.
The setup is deliberately boring. No proprietary signals, no fine-tuning, no fancy data. Just off-the-shelf reasoning models running against the live market on equal footing.
What the models did right
Three patterns showed up across all five models from week one.
They cut losses faster than retail traders. The median time to close a losing position was hours, not days. No model averaged down on a 5% loss. None held a 10% drawdown hoping for reversal. This single behavior accounts for most of the downside protection in the results.
They sized positions conservatively. Average position was a small fraction of available margin. Maximum leverage used stayed well below what the API allowed even on conviction trades. No model blew up.
They produced auditable reasoning. We could trace why each trade was opened, what the model expected, and what conditions would change its mind. Compare this to a human trader's “felt right” — the audit trail alone makes AI trading easier to debug and improve over time.
What the models did wrong
Two failure modes were consistent across all five.
They missed major regime shifts. When BTC broke from a range into a strong trend, models stayed in mean-reversion mode too long. They were optimized to fade extremes, and a trending market is the worst environment for that bias. The lag was measured in days, not minutes.
They over-corrected on news. A single Federal Reserve mention or exchange outage triggered position closes that, in hindsight, were premature. Models read text inputs literally; they don't have intuition for whether a headline is already priced in.
The spread between best and worst single-model performance was material. The same market produced very different outcomes depending on which model traded it. Variance across models matters more than the average. You can track live numbers on the AI Hedge Fund dashboard.
Can a multi-model committee beat any single AI?
We tested this with a five-model committee voting on each trade. A position opens only if at least three models agree on direction and size class. Over the same 90 days, the committee outperformed most individual models but not the best one.
The interesting finding wasn't absolute return. It was drawdown. Maximum committee drawdown was roughly half the best single model's drawdown. Voting filtered out the worst trades from each model without dampening the best ones much.
This matches academic ensemble research. Combining models reduces variance more than it reduces bias. For risk-sensitive capital, that's the right tradeoff.
Committees aren't free. Latency triples because you wait on five model calls. API cost is 5x. For serious capital in a 30-minute trading window, the risk reduction matters more than execution speed.
What this means for your trading
If you're running bots, AI is a useful layer for two specific jobs: deciding when to cut a losing position, and sizing a conviction trade conservatively. It is not yet a layer for picking entries from a chart.
The discipline piece is where most retail traders lose money. AI does it better than the vast majority of human traders, including experienced ones, simply because it doesn't get tired and doesn't get attached to positions.
For entries and overall strategy, classical bot logic still wins — DCA, grid, trend-following with clear rules. Combining them is what the GT App does: rule-based strategies for entries and exits, AI for risk management and dynamic position sizing.
If you want to test this yourself, GT App offers paper trading with the same AI risk-management layer used in our public experiments. Open GT App.
Frequently asked questions
Did any model lose money?
Yes, the worst-performing models finished the 90-day period in negative territory. Losses came from late regime detection rather than blow-ups. Even the worst model never lost more than single digits because position sizing stayed conservative throughout.
Can I just use ChatGPT or Claude to give me trading advice?
Not safely. The models tested ran inside a closed loop with explicit position state, market data feeds, and pre-defined output schemas. Free-form chat without that structure produces hallucinated trades and unbounded risk.
Is this real money?
Yes. Each model trades $10,000 USDT on live Binance Futures. The dashboard with positions and P&L is public.
Why no Llama, Mistral, or Qwen?
Cost-quality trade-off as of early 2026. The five chosen represent the top reasoning models with stable APIs at the experiment start. The mix will rotate as new models release and prove stable in production conditions.
Does the AI Hedge Fund accept investors?
Not currently. The fund is a transparent research experiment, not a product. The same AI risk-management layer is available inside GT App for individual traders to use with their own capital.
What is the simplest takeaway?
Use AI for discipline (when to cut, how much to risk). Use rule-based bot logic for entries. Combining the two beats either approach alone, which is why GT App runs them in parallel by default.