Machine learning models to predict the progression from early to late stages of papillary renal cell carcinoma

Abstract

Papillary Renal Cell Carcinoma (PRCC) is a heterogeneous disease with variations in disease progression and clinical outcomes. The advent of next generation sequencing techniques (NGS) has generated data from patients that can be analysed to develop a predictive model. In this study, we have adopted a machine learning approach to identify biomarkers and build classifiers to discriminate between early and late stages of PRCC from gene expression profiles. A machine learning pipeline incorporating different feature selection algorithms and classification models is developed to analyse RNA sequencing dataset (RNASeq). Further, to get a reliable feature set, we extracted features from different partitions of the training dataset and aggregated them into feature sets for classification. We evaluated the performance of different algorithms on the basis of 10-fold cross validation and independent test dataset. 10-fold cross validation was also performed on a microarray dataset of PRCC. A random forest based feature selection (varSelRF) yielded minimum number of features (104) and a best performance with area under Precision Recall curve (PR-AUC) of 0.804, MCC (Matthews Correlation Coefficient) of 0.711 and accuracy of 88% with Shrunken Centroid classifier on a test dataset. We identified 80 genes that are consistently altered between stages by different feature selection algorithms. The extracted features are related to cellular components - centromere, kinetochore and spindle, and biological process mitotic cell cycle. These observations reveal potential mechanisms for an increase in chromosome instability in the late stage of PRCC. Our study demonstrates that the gene expression profiles can be used to classify stages of PRCC.

Publication
Computers in Biology and Medicine