..

Variable selection in balance regression with applications to microbiome compositional data


Jing Ma, Paizhe Xie, Kristyn Pantoja, David E. Jones
[stat.ME]

Compositional data, where only relative abundances are available, are common in microbiome and other high-throughput sequencing studies. Log ratios between groups of variables serve as key biomarkers in these settings. However, selecting predictive log ratios is a combinatorial challenge, and existing greedy search-based methods are computationally expensive, limiting their applicability to high-dimensional data. We introduce the supervised log ratio (SLR) method, a novel and efficient approach for selecting predictive log ratios in high-dimensional settings. SLR first screens active variables using univariate regression on log ratio transformed data and then applies principal balance analysis to define balance biomarkers. Our approach leverages both the relationship between the response and predictors and the correlations among the predictors to improve accuracy in variable selection and prediction. Through simulations and two case studies – one on inflammatory bowel disease (IBD) and another on colorectal cancer (CRC) – we demonstrate that SLR outperforms existing methods, particularly in high-dimensional settings. SLR is implemented in an R package, publicly available on GitHub.

Read more