Output Stability

Cross-source consensus on Output Stability from 1 sources and 4 claims.

1 sources · 4 claims

Evidence quality

Model consistency varied by topic rather than being uniform across domains. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
GPT-5.2 showed marked instability in female urology. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
Some models achieved fully reproducible results in specific domains, shown by zero standard deviation. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
Anatomy was unstable for GPT-5.2 and Grok. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination