Variable Selection Via Penalized Regression and the Genetic Algorithm Using Information Complexity, with Applications for High-dimensional -omics Data

Variable Selection Via Penalized Regression and the Genetic Algorithm Using Information Complexity, with Applications for High-dimensional -omics Data
Title Variable Selection Via Penalized Regression and the Genetic Algorithm Using Information Complexity, with Applications for High-dimensional -omics Data PDF eBook
Author Tyler J. Massaro
Publisher
Pages 360
Release 2016
Genre Algorithms
ISBN

Download Variable Selection Via Penalized Regression and the Genetic Algorithm Using Information Complexity, with Applications for High-dimensional -omics Data Book in PDF, Epub and Kindle

This dissertation is a collection of examples, algorithms, and techniques for researchers interested in selecting influential variables from statistical regression models. Chapters 1, 2, and 3 provide background information that will be used throughout the remaining chapters, on topics including but not limited to information complexity, model selection, covariance estimation, stepwise variable selection, penalized regression, and especially the genetic algorithm (GA) approach to variable subsetting. In chapter 4, we fully develop the framework for performing GA subset selection in logistic regression models. We present advantages of this approach against stepwise and elastic net regularized regression in selecting variables from a classical set of ICU data. We further compare these results to an entirely new procedure for variable selection developed explicitly for this dissertation, called the post hoc adjustment of measured effects (PHAME). In chapter 5, we reproduce many of the same results from chapter 4 for the first time in a multinomial logistic regression setting. The utility and convenience of the PHAME procedure is demonstrated on a set of cancer genomic data. Chapter 6 marks a departure from supervised learning problems as we shift our focus to unsupervised problems involving mixture distributions of count data from epidemiologic fields. We start off by reintroducing Minimum Hellinger Distance estimation alongside model selection techniques as a worthy alternative to the EM algorithm for generating mixtures of Poisson distributions. We also create for the first time a GA that derives mixtures of negative binomial distributions. The work from chapter 6 is incorporated into chapters 7 and 8, where we conclude the dissertation with a novel analysis of mixtures of count data regression models. We provide algorithms based on single and multi-target genetic algorithms which solve the mixture of penalized count data regression models problem, and we demonstrate the usefulness of this technique on HIV count data that were used in a previous study published by Gray, Massaro et al. (2015) as well as on time-to-event data taken from the cancer genomic data sets from earlier.

Variable Selection for High-dimensional Data with Error Control

Variable Selection for High-dimensional Data with Error Control
Title Variable Selection for High-dimensional Data with Error Control PDF eBook
Author Han Fu (Ph. D. in biostatistics)
Publisher
Pages 0
Release 2022
Genre Statistics
ISBN

Download Variable Selection for High-dimensional Data with Error Control Book in PDF, Epub and Kindle

Many high-throughput genomic applications involve a large set of covariates and it is crucial to discover which variables are truly associated with the response. It is often desirable for researchers to select variables that are indeed true and reproducible in followup studies. Effectively controlling the false discovery rate (FDR) increases the reproducibility of the discoveries and has been a major challenge in variable selection research, especially for high-dimensional data. Existing error control approaches include augmentation approaches which utilize artificial variables as benchmarks for decision making, such as model-X knockoffs. We introduce another augmentation-based selection framework extended from a Bayesian screening approach called reference distribution variable selection. Ordinal responses, which were not previously considered in this area, were used to compare different variable selection approaches. We constructed various importance measures that fit into the selection frameworks, using either L1 penalized regression or machine learning techniques, and compared these measures in terms of the FDR and power using simulated data. Moreover, we applied these selection methods to high-throughput methylation data for identifying features associated with the progression from normal liver tissue to hepatocellular carcinoma to further compare and contrast their performances. Having established the effectiveness of FDR control for model-X knockoffs, we turned our attention to another important data type - survival data with long-term survivors. Medical breakthroughs in recent years have led to cures for many diseases, resulting in increased observations of long-term survivors. The mixture cure model (MCM) is a type of survival model that is often used when a cured fraction exists. Unfortunately, currently few variable selection methods exist for MCMs when there are more predictors than samples. To fill the gap, we developed penalized MCMs for high-dimensional datasets which allow for identification of prognostic factors associated with both cure status and/or survival. Both parametric models and semi-parametric proportional hazards models were considered for modeling the survival component. For penalized parametric MCMs, we demonstrated how the estimation proceeded using two different iterative algorithms, the generalized monotone incremental forward stagewise (GMIFS) and Expectation-Maximization (E-M). For semi-parametric MCMs where multiple types of penalty functions were considered, the coordinate descent algorithm was combined with E-M for optimization. The model-X knockoffs method was combined with these algorithms to allow for FDR control in variable selection. Through extensive simulation studies, our penalized MCMs have been shown to outperform alternative methods on multiple metrics and achieve high statistical power with FDR being controlled. In two acute myeloid leukemia (AML) applications with gene expression data, our proposed approaches identified important genes associated with potential cure or time-to-relapse, which may help inform treatment decisions for AML patients.

Penalized Variable Selection for Gene-environment Interactions

Penalized Variable Selection for Gene-environment Interactions
Title Penalized Variable Selection for Gene-environment Interactions PDF eBook
Author Yinhao Du
Publisher
Pages 0
Release 2021
Genre
ISBN

Download Penalized Variable Selection for Gene-environment Interactions Book in PDF, Epub and Kindle

Gene-environment (GxE) interaction is critical for understanding the genetic basis of complex disease beyond genetic and environment main effects. In addition to existing tools for interaction studies, penalized variable selection emerges as a promising alternative for dissecting GxE interactions. Despite the success, variable selection is limited in the following aspects. First, multidimensional measurements have not been taken into fully account in interaction studies. Published variable selection methods cannot accommodate structured sparsity in the framework of integrating multiomics data for disease outcomes. Second, in the big data context, no variable selection method has been developed so far to conduct tailored interaction analysis. Third, the solution to case control association GxE studies with high dimensional genomics variants in the big data context has not been made available so far. In this dissertation, we tackle these challenges rising from GxE interaction studies in the modern era through the following projects. In the first project, we have developed a novel variable selection method to integrate multi-omics measurements in GxE interaction studies. Extensive studies have already revealed that analyzing omics data across multi-platforms is not only sensible biologically but also resulting in improved identification and prediction performance. Our integrative model can efficiently pinpoint important regulators of gene expressions through sparse dimensionality reduction and link the disease outcomes to multiple effects in the integrative GxE studies via accommodating a sparse bi-level structure. Simulation studies show the integrative model leads to better identification of GxE interactions and regulators than that of the alternative methods. In two GxE lung cancer studies with high dimensional multi-omics data, the integrative model leads to improved prediction and findings with important biological implications. In the second project, we propose to conduct interaction studies in the big data context by adopting the divide-and-conquer strategy. In particular, the sparse group variable selection for important GxE effects has been developed within the framework of alternating direction method of multiplier (ADMM). To accommodate the large-scale data in terms of either samples or features, we have developed two novel parallel ADMM based variable selection methods across samples and features, respectively. The corresponding parallel algorithms can be efficiently implemented in distributed computing platforms. Simulation studies demonstrate that the parallel ADMM based penalization methods significantly improve the computational speed for analyzing large scale data from GxE interaction studies with satisfactory identification and prediction performance. In the third project, we extend the proposed parallel ADMM based variable selection for GxE interactions in the case-control association study of type 2 diabetes. Within the parallel computation framework, we have developed a penalized logistic regression model accommodating the bi-level selection tailored for the case control GxE interaction study. The advantage of the proposed parallel penalization method has been fully illustrated in the distributed learning scenario. Simulation studies show the proposed method dramatically reduces the computational time while maintaining a competitive performance compared to the non-parallel counterparts. In the case study of type 2 diabetes with environmental factors and high dimensional SNP measurements, the proposed parallel penalization method leads to the identification of biologically important interaction effects.

Variable Selection Via Penalized Likelihood

Variable Selection Via Penalized Likelihood
Title Variable Selection Via Penalized Likelihood PDF eBook
Author
Publisher
Pages 121
Release 2014
Genre
ISBN

Download Variable Selection Via Penalized Likelihood Book in PDF, Epub and Kindle

Variable selection via penalized likelihood plays an important role in high dimensional statistical modeling and it has attracted great attention in recent literature. This thesis is devoted to the study of variable selection problem. It consists of three major parts, all of which fall within the framework of penalized least squares regression setting. In the first part of this thesis, we propose a family of nonconvex penalties named the K-Smallest Items (KSI) penalty for variable selection, which is able to improve the performance of variable selection and reduce estimation bias on the estimates of the important coefficients. We fully investigate the theoretical properties of the KSI method and show that it possesses the weak oracle property and the oracle property in the high-dimensional setting where the number of coefficients is allowed to be much larger than the sample size. To demonstrate its numerical performance, we applied the KSI method to several simulation examples as well as the well known Boston housing dataset. We also extend the idea of the KSI method to handle the group variable selection problem. In the second part of this thesis, we propose another nonconvex penalty named Self-adaptive penalty (SAP) for variable selection. It is distinguished from other existing methods in the sense that the penalization on each individual coefficient takes into account directly the influence of other estimated coefficients. We also thoroughly study the theoretical properties of the SAP method and show that it possesses the weak oracle property under desirable conditions. The proposed method is applied to the glioblastoma cancer data obtained from The Cancer Genome Atlas. In many scientific and engineering applications, covariates are naturally grouped. When the group structures are available among covariates, people are usually interested in identifying both important groups and important variables within the selected groups. In statistics, this is a group variable selection problem. In the third part of this thesis, we propose a novel Log-Exp-Sum(LES) penalty for group variable selection. The LES penalty is strictly convex. It can identify important groups as well as select important variables within the group. We develop an efficient group-level coordinate descent algorithm to fit the model. We also derive non-asymptotic error bounds and asymptotic group selection consistency for our method in the high-dimensional setting. Numerical results demonstrate the good performance of our method in both variable selection and prediction. We applied the proposed method to an American Cancer Society breast cancer survivor dataset. The findings are clinically meaningful and may help design intervention programs to improve the quality of life for breast cancer survivors.

Genetic Algorithms and Genetic Programming

Genetic Algorithms and Genetic Programming
Title Genetic Algorithms and Genetic Programming PDF eBook
Author Michael Affenzeller
Publisher CRC Press
Pages 395
Release 2009-04-09
Genre Computers
ISBN 1420011324

Download Genetic Algorithms and Genetic Programming Book in PDF, Epub and Kindle

Genetic Algorithms and Genetic Programming: Modern Concepts and Practical Applications discusses algorithmic developments in the context of genetic algorithms (GAs) and genetic programming (GP). It applies the algorithms to significant combinatorial optimization problems and describes structure identification using HeuristicLab as a platform for al

Genetic Algorithms as Tool for Statistical Analysis of High-Dimensional Data Structures

Genetic Algorithms as Tool for Statistical Analysis of High-Dimensional Data Structures
Title Genetic Algorithms as Tool for Statistical Analysis of High-Dimensional Data Structures PDF eBook
Author Rüdiger Krause
Publisher
Pages 0
Release 2004
Genre
ISBN 9783832506612

Download Genetic Algorithms as Tool for Statistical Analysis of High-Dimensional Data Structures Book in PDF, Epub and Kindle

In regression the objective is to determine an appropriate function which reflects reality as accurate as possible but also eliminates irregularities from data noise and is therefore easy to interpret. A popular and flexible approach for estimating the true underlying function is the additive model. One possible approach for fitting additive models is the expansion in B-splines which allows direct calculation of the estimators. If the number of B-splines is too large the estimated functions become wiggly and tend to be very close to the observed data. To avoid this problem of overfitting we use a penalization approach characterized by smoothing parameters. In this thesis we propose the use of genetic algorithms for smoothing parameter optimization. Genetic algorithms are rarely applied in the field of statistics and refer to the principle that better adapted individuals win against their competitors under equal conditions. Apart from smoothing parameter optimization the user often faces datasets containing large numbers of relevant and irrelevant explanatory variables. Appropriate variable selection approaches allow to reduce the number of variables to subsets of relevant variables. We propose to consider the problems of variable selection and choice of smoothing parameters simultaneously by using genetic algorithms. Our approach bases on an appropriate combination of the genetic algorithms for smoothing parameter optimization and variable selection.

Penalized Regression Methods for Interaction and Mixed-effects Models with Applications to Genomic and Brain Imaging Data

Penalized Regression Methods for Interaction and Mixed-effects Models with Applications to Genomic and Brain Imaging Data
Title Penalized Regression Methods for Interaction and Mixed-effects Models with Applications to Genomic and Brain Imaging Data PDF eBook
Author Sahir Bhatnagar
Publisher
Pages
Release 2019
Genre
ISBN

Download Penalized Regression Methods for Interaction and Mixed-effects Models with Applications to Genomic and Brain Imaging Data Book in PDF, Epub and Kindle

"In high-dimensional (HD) data, where the number of covariates (??) greatly exceeds the number of observations (??), estimation can benefit from the bet-on-sparsity principle, i.e., only a small number of predictors are relevant in the response. This assumption can lead to more interpretable models, improved predictive accuracy, and algorithms that are computationally efficient. In genomic and brain imaging studies, where the sample sizes are particularly small due to high data collection costs, we must often assume a sparse model because there isn't enough information to estimate ?? parameters. For these reasons, penalized regression methods such as the lasso and group-lasso have generated substantial interest since they can set model coefficients exactly to zero. In the penalized regression framework, many approaches have been developed for main effects. However, there is a need for developing interaction and mixed-effects models. Indeed, accurate capture of interactions may hold the potential to better understand biological phenomena and improve prediction accuracy since they may reflect important modulation of a biological system by an external factor. Furthermore, penalized mixed-effects models that account for correlations due to groupings of observations can improve sensitivity and specificity. This thesis is composed primarily of three manuscripts. In the first manuscript, we propose a method called sail for detecting non-linear interactions that automatically enforces the strong heredity property using both the l1 and l2 penalty functions. We describe a blockwise coordinate descent procedure for solving the objective function and provide performance metrics on both simulated and real data. The second manuscript develops a general penalized mixed effects model framework to account for correlations in genetic data due to relatedness called ggmix. Our method can accommodate several sparsity-inducing penalties such as the lasso, elastic net and group lasso and also readily handles prior annotation information in the form of weights. Our algorithm has theoretical guarantees of convergence and we again assess its performance in both simulated and real data. The third manuscript describes a novel strategy called eclust for dimension reduction that leverages the effects of an exposure variable with broad impact on HD measures. With eclust, we found improved prediction and variable selection performance compared to methods that do not consider the exposure in the clustering step, or to methods that use the original data as features. We further illustrate this modeling framework through the analysis of three data sets from very different fields, each with HD data, a binary exposure, and a phenotype of interest. We provide efficient implementations of all our algorithms in freely available and open source software." --