Development and Validation of a Deep Learning Algorithm for Gleason Grading of Prostate Cancer from Biopsy Specimens

JAMA Network Open – Oncology – July 23, 2020

Department of Pathology and Laboratory Medicine, University of Tennessee Health Science Center, Memphis; Department of Pathology, Laboratory Medicine and Pathology, University Health Network and University of Toronto, Toronto, Ontario, Canada; Department of Pathology, El Camino Hospital, Mountain View, California; Tufts Medical Center, Boston, Massachusetts; Pathology and Laboratory Medicine Service, North Florida/ South Georgia Veterans Health System, Gainesville, Florida; Department of Pathology, Yale School of Medicine, New Haven, Connecticut.

Kunal Nagpal, MS; Davis Foote, BS; Fraser Tan, PhD; Yun Liu, PhD; Po-Hsuan Cameron Chen, PhD.

Importance

For prostate cancer, Gleason grading of the biopsy specimen plays a pivotal role in determining case management. However, Gleason grading is associated with substantial interobserver variability, resulting in a need for decision support tools improve the reproducibility of Gleason grading in routine clinical practice.

Objectives

To evaluate the ability of a deep learning system (DLS) to grade diagnostic prostate biopsy specimens.

Design, Setting and Participants

The DLS was evaluated using 752 deidentified digitized images of formalin-fixed paraffin-embedded prostate needle core biopsy specimens obtained from 3 institutions in the United States, including 1 institution not used for DLS development. To obtain the Gleason grade group (GG), each specimen was first reviewed by 2 expert urologic subspecialists from a multi-institutional panel of 6 individuals (years of experience: mean, 25 years; range, 18-34 years). A third subspecialist reviewed discordant cases to arrive at a majority opinion. To reduce diagnostic uncertainty, all subspecialists had access to an immunohistochemical-stained section and 3 histologic sections for every biopsied specimen. Their review was conducted from December 2018 to June 2019.

Main Outcomes and Measures

The frequency of the exact agreement of the DLS with the majority opinion of the subspecialists in categorizing each tumor-containing specimen as 1 of 5 categories: nontumor, GG1, GG2, GG3, or GG4-5. For comparison, the rate of agreement of 19 general pathologists’ opinions with the subspecialists’ majority opinions was also evaluated.

a) Lower absolute difference (higher agreement rate in Gleason pattern quantitation)

a) The higher value in the row. b) Agreement on 2 Gleason grading thresholds

Results

For grading tumor-containing biopsy specimens in the validation set (n = 498), the rate of agreement with subspecialists was significantly higher for the DLS (71.7%; 95% CI, 67.9%-75.3%) than for general pathologists (58.0%; 95% CI, .5%-61.4%) (P < .001). In sub analyses of biopsy specimens from an external validation set (n = 322), the Gleason grading performance of the DLS remained similar. For distinguishing nontumor from tumor containing biopsy specimens (n = 752), the rate of agreement with subspecialists was 94.3% (95% CI, 2.4%-95.9%) for the DLS and similar at 94.7% (95% CI, 92.8%-96.3%) for general pathologists (P = .58).

Conclusions

In this study, the DLS showed higher proficiency than general pathologists at Gleason grading prostate needle core biopsy specimens and generalized to an independent institution. Future research is necessary to evaluate the potential utility of using the DLS as a decision support tool in clinical workflows and to improve the quality of prostate cancer grading for therapy decisions.

Relevance to Healthcare Field

One use of artificial intelligence in healthcare is to provide clinical decision support to physicians in hopes of improving overall care for patients. Prostate cancer is the most common cancer diagnosis and the third leading cause of cancer death in American men. The treatment depends on the pathologic evaluation of the prostate biopsy. This article evaluates a Deep Learning System (DLS) that uses images to provide the Gleason grade group diagnostic of biopsy specimens without suffering from interobserver variability. The study demonstrated that the DLS results were as accurate as subspecialists’ (pathologist with urologic training) reports and more accurate than general pathologists’ reports. Artificial intelligence has proved to be a time-efficient method because patients will not have to wait for a consultation with another subspecialist if the biopsy is inconclusive. Therefore artificial intelligence (DLS in this case) is a prominent factor in medicine, as it not only serves as extra insight for accurate and faster grading/diagnosis but also reduces the chances of affecting the quality of life in men with low-risk prostate disease.

Michael Abramoff

1726-1769

Article of the Month – July 2021