KEYWORDS: Calibration, Tumors, Image segmentation, Algorithm development, Machine learning, Detection and tracking algorithms, Cancer detection, Breast cancer
Machine learning (ML) based whole slide imaging biomarkers have great potential to improve the efficiency and consistency of biomarker quantification, thereby facilitating the development of prognosis models for personalized medicine. Assessment methods in this area are still under-developed. Using the public TiGER (Tumor InfiltratinG lymphocytes in breast cancER) challenge data, we developed a deep neural network-based algorithm for automated tumorinfiltrating lymphocytes (TILs) scoring from whole slide images (WSIs) of biopsies and surgical resections of human epidermal growth factor receptor-2 positive (HER2+) and triple-negative breast cancer (TNBC) patients. The purpose of this study is to assess our model’s performance on a new independent dataset. Seven pathologists independently assessed 320 pre-selected regions of interests (ROIs) across 32 WSIs for TILs scoring. Our results show that there is substantial variability among pathologists in scoring TILs density. We also observed a systematic discrepancy between the ML-based TILs scoring and the pathologists’ manual scoring that led us to develop a calibration between the two. Our calibration reduced the discrepancy, increasing the intra-class-correlation coefficient (ICC) from 0.35 (95% CI [-0.062, 0.625]) for uncalibrated scores to 0.67 (95% CI [0.6, 0.736]) after calibration.
Studies have shown that the increased presence of tumor-infiltrating lymphocytes (TILs) is associated with better long-term clinical outcomes and survival, which makes TILs a potentially useful quantitative biomarker. In clinics, pathologists’ visual assessment of TILs in biopsies and surgical resections result in a quantitative score (TILs-score). The Tumor-infiltrating lymphocytes in breast cancer (TiGER) challenge is the first public challenge on automated TILs-scoring algorithms using whole slide images of hematoxylin and eosin-stained (H&E) slides of human epidermal growth factor receptor-2 positive (HER2+) and triple-negative breast cancer (TNBC) patients. We participated in the TiGER challenge and developed algorithms for tumor-stroma segmentation, TILs cell detection, and TILs-scoring. The whole slide images in this challenge are from three sources, each with apparent color variations. We hypothesized that color-normalization may improve the cross-source generalizability of our deep learning models. Here, we expand our initial work by implementing a color-normalization technique and investigate its effect on the performance of our segmentation model. We compare the segmentation performance before and after color-normalization by cross validating the models on the three datasets. Our results show a substantial increase in the performance of the segmentation model after color-normalization when trained and tested on different sources. This might potentially improve the model’s generalizability and robustness when applied to the external sequestered test set from the TiGER challenge.
Purpose: We describe registration accuracy studies of a custom hardware-software system called eeDAP that registers fields of view (FOVs) of a glass slide on a microscope to the digital presentations of regions of interest (ROIs) of whole slide images (WSI) of the same glass slide. In this manuscript, we describe the results of adding new hardware and use the results to size a larger pathologist data collection study.
Methods: We create a registration accuracy task by identifying a visually distinct target. This target will be the center of a WSI ROI and is expected to appear in the center of the microscope FOV. We examined the registration accuracy of 60 ROIs from six slides, alternating registration methods and slide order within each study. We measure the distance between the target and the FOV center (registration error) using an eye piece reticle ruler as the stage moves from target to target. We summarize each error as a success (≤ 5.0 µm) or failure (> 5.0 µm). We completed a multi-reader multicase (MRMC) analysis of the registration successes and failures to estimate the variance components due to the readers and the cases.
Results: When using eeDAP in-focus, accuracy was within 5 µm in more than 97% of the FOVs.
Conclusions: The eeDAP registration methods were robust to new hardware, and the MRMC analysis has provided variance components for sizing future registration accuracy studies to account for the variability from readers and cases.
Purpose: Validation of artificial intelligence (AI) algorithms in digital pathology with a reference standard is necessary before widespread clinical use, but few examples focus on creating a reference standard based on pathologist annotations. This work assesses the results of a pilot study that collects density estimates of stromal tumor-infiltrating lymphocytes (sTILs) in breast cancer biopsy specimens. This work will inform the creation of a validation dataset for the evaluation of AI algorithms fit for a regulatory purpose.Approach: Collaborators and crowdsourced pathologists contributed glass slides, digital images, and annotations. Here, “annotations” refer to any marks, segmentations, measurements, or labels a pathologist adds to a report, image, region of interest (ROI), or biological feature. Pathologists estimated sTILs density in 640 ROIs from hematoxylin and eosin stained slides of 64 patients via two modalities: an optical light microscope and two digital image viewing platforms.Results: The pilot study generated 7373 sTILs density estimates from 29 pathologists. Analysis of annotations found the variability of density estimates per ROI increases with the mean; the root mean square differences were 4.46, 14.25, and 26.25 as the mean density ranged from 0% to 10%, 11% to 40%, and 41% to 100%, respectively. The pilot study informs three areas of improvement for future work: technical workflows, annotation platforms, and agreement analysis methods. Upgrades to the workflows and platforms will improve operability and increase annotation speed and consistency.Conclusions: Exploratory data analysis demonstrates the need to develop new statistical approaches for agreement. The pilot study dataset and analysis methods are publicly available to allow community feedback. The development and results of the validation dataset will be publicly available to serve as an instructive tool that can be replicated by developers and researchers.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.