Subspecialty Performance
Cross-source consensus on Subspecialty Performance from 1 sources and 4 claims.
1 sources · 4 claims
Comparisons
Evidence quality
Highlighted claims
- Aggregate model accuracy hid important subspecialty-level weaknesses and instability. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
- A high total score did not guarantee reliability across all urology exam domains. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
- Paediatric urology and testicular cancer were the weakest collective areas. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
- The strongest collective domains were research methodology, andrology, urinary tract infections, and stone disease. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination