Output Stability
Cross-source consensus on Output Stability from 1 sources and 4 claims.
1 sources · 4 claims
Evidence quality
Highlighted claims
- Model consistency varied by topic rather than being uniform across domains. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
- GPT-5.2 showed marked instability in female urology. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
- Some models achieved fully reproducible results in specific domains, shown by zero standard deviation. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
- Anatomy was unstable for GPT-5.2 and Grok. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination