Subspecialty Performance

Cross-source consensus on Subspecialty Performance from 1 sources and 4 claims.

1 sources · 4 claims

Comparisons

Aggregate model accuracy hid important subspecialty-level weaknesses and instability. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
A high total score did not guarantee reliability across all urology exam domains. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
Paediatric urology and testicular cancer were the weakest collective areas. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination
The strongest collective domains were research methodology, andrology, urinary tract infections, and stone disease. — Performance of large language models (GPT-5.2, Gemini 3 Pro, Claude Sonnet 4.6 and Grok 4.1) on the Fellowship of The Royal College of Surgeons Urology Part A examination