Penalized Methods and Algorithms for High-dimensional Regression in the Presence of Heterogeneity

Penalized Methods and Algorithms for High-dimensional Regression in the Presence of Heterogeneity
Title Penalized Methods and Algorithms for High-dimensional Regression in the Presence of Heterogeneity PDF eBook
Author Congrui Yi
Publisher
Pages 98
Release 2016
Genre Algorithms
ISBN

Download Penalized Methods and Algorithms for High-dimensional Regression in the Presence of Heterogeneity Book in PDF, Epub and Kindle

In fields such as statistics, economics and biology, heterogeneity is an important topic concerning validity of data inference and discovery of hidden patterns. This thesis focuses on penalized methods for regression analysis with the presence of heterogeneity in a potentially high-dimensional setting. Two possible strategies to deal with heterogeneity are: robust regression methods that provide heterogeneity-resistant coefficient estimation, and direct detection of heterogeneity while estimating coefficients accurately in the meantime. We consider the first strategy for two robust regression methods, Huber loss regression and quantile regression with Lasso or Elastic-Net penalties, which have been studied theoretically but lack efficient algorithms. We propose a new algorithm Semismooth Newton Coordinate Descent to solve them. The algorithm is a novel combination of Semismooth Newton Algorithm and Coordinate Descent that applies to penalized optimization problems with both nonsmooth loss and nonsmooth penalty. We prove its convergence properties, and show its computational efficiency through numerical studies. We also propose a nonconvex penalized regression method, Heterogeneity Discovery Regression (HDR) , as a realization of the second idea. We establish theoretical results that guarantees statistical precision for any local optimum of the objective function with high probability. We also compare the numerical performances of HDR with competitors including Huber loss regression, quantile regression and least squares through simulation studies and a real data example. In these experiments, HDR methods are able to detect heterogeneity accurately, and also largely outperform the competitors in terms of coefficient estimation and variable selection.

Robust Penalized Regression for Complex High-dimensional Data

Robust Penalized Regression for Complex High-dimensional Data
Title Robust Penalized Regression for Complex High-dimensional Data PDF eBook
Author Bin Luo
Publisher
Pages 169
Release 2020
Genre Dimensional analysis
ISBN

Download Robust Penalized Regression for Complex High-dimensional Data Book in PDF, Epub and Kindle

"Robust high-dimensional data analysis has become an important and challenging task in complex Big Data analysis due to the high-dimensionality and data contamination. One of the most popular procedures is the robust penalized regression. In this dissertation, we address three typical robust ultra-high dimensional regression problems via penalized regression approaches. The first problem is related to the linear model with the existence of outliers, dealing with the outlier detection, variable selection and parameter estimation simultaneously. The second problem is related to robust high-dimensional mean regression with irregular settings such as the data contamination, data asymmetry and heteroscedasticity. The third problem is related to robust bi-level variable selection for the linear regression model with grouping structures in covariates. In Chapter 1, we introduce the background and challenges by overviews of penalized least squares methods and robust regression techniques. In Chapter 2, we propose a novel approach in a penalized weighted least squares framework to perform simultaneous variable selection and outlier detection. We provide a unified link between the proposed framework and a robust M-estimation in general settings. We also establish the non-asymptotic oracle inequalities for the joint estimation of both the regression coefficients and weight vectors. In Chapter 3, we establish a framework of robust estimators in high-dimensional regression models using Penalized Robust Approximated quadratic M estimation (PRAM). This framework allows general settings such as random errors lack of symmetry and homogeneity, or covariates are not sub-Gaussian. Theoretically, we show that, in the ultra-high dimension setting, the PRAM estimator has local estimation consistency at the minimax rate enjoyed by the LS-Lasso and owns the local oracle property, under certain mild conditions. In Chapter 4, we extend the study in Chapter 3 to robust high-dimensional data analysis with structured sparsity. In particular, we propose a framework of high-dimensional M-estimators for bi-level variable selection. This framework encourages bi-level sparsity through a computationally efficient two-stage procedure. It produces strong robust parameter estimators if some nonconvex redescending loss functions are applied. In theory, we provide sufficient conditions under which our proposed two-stage penalized M-estimator possesses simultaneous local estimation consistency and the bi-level variable selection consistency, if a certain nonconvex penalty function is used at the group level. The performances of the proposed estimators are demonstrated in both simulation studies and real examples. In Chapter 5, we provide some discussions and future work."--Abstract from author supplied metadata

An Algorithmic Framework for High Dimensional Regression with Dependent Variables

An Algorithmic Framework for High Dimensional Regression with Dependent Variables
Title An Algorithmic Framework for High Dimensional Regression with Dependent Variables PDF eBook
Author Hoyt Koepke
Publisher
Pages 156
Release 2013
Genre Mathematical optimization
ISBN

Download An Algorithmic Framework for High Dimensional Regression with Dependent Variables Book in PDF, Epub and Kindle

We present an exploration of the rich theoretical connections between several classes of regularized models, network flows, and recent results in submodular function theory. This work unifies key aspects of these problems under a common theory, leading to novel methods for working with several important models of interest in statistics, machine learning and computer vision. Most notably, we describe the full regularization path of a class of penalized regression problems with dependent variables that includes variants of the fused LASSO and total variation constrained models. We begin by reviewing the concepts of network flows and submodular function optimization theory foundational to our results. We then examine the connections between network flows and the minimum-norm algorithm from submodular optimization, extending and improving several current results. This theory leads to a new representation of the structure of a large class of pairwise regularized models important in machine learning, statistics and computer vision. Finally, by applying an arbitrarily accurate approximation, our approach allows us to efficiently optimize total variation penalized models on continuous functions. Ultimately, our new algorithms scale up easily to high-dimensional problems with millions of variables.

Penalized Regression Methods for Interaction and Mixed-effects Models with Applications to Genomic and Brain Imaging Data

Penalized Regression Methods for Interaction and Mixed-effects Models with Applications to Genomic and Brain Imaging Data
Title Penalized Regression Methods for Interaction and Mixed-effects Models with Applications to Genomic and Brain Imaging Data PDF eBook
Author Sahir Bhatnagar
Publisher
Pages
Release 2019
Genre
ISBN

Download Penalized Regression Methods for Interaction and Mixed-effects Models with Applications to Genomic and Brain Imaging Data Book in PDF, Epub and Kindle

"In high-dimensional (HD) data, where the number of covariates (??) greatly exceeds the number of observations (??), estimation can benefit from the bet-on-sparsity principle, i.e., only a small number of predictors are relevant in the response. This assumption can lead to more interpretable models, improved predictive accuracy, and algorithms that are computationally efficient. In genomic and brain imaging studies, where the sample sizes are particularly small due to high data collection costs, we must often assume a sparse model because there isn't enough information to estimate ?? parameters. For these reasons, penalized regression methods such as the lasso and group-lasso have generated substantial interest since they can set model coefficients exactly to zero. In the penalized regression framework, many approaches have been developed for main effects. However, there is a need for developing interaction and mixed-effects models. Indeed, accurate capture of interactions may hold the potential to better understand biological phenomena and improve prediction accuracy since they may reflect important modulation of a biological system by an external factor. Furthermore, penalized mixed-effects models that account for correlations due to groupings of observations can improve sensitivity and specificity. This thesis is composed primarily of three manuscripts. In the first manuscript, we propose a method called sail for detecting non-linear interactions that automatically enforces the strong heredity property using both the l1 and l2 penalty functions. We describe a blockwise coordinate descent procedure for solving the objective function and provide performance metrics on both simulated and real data. The second manuscript develops a general penalized mixed effects model framework to account for correlations in genetic data due to relatedness called ggmix. Our method can accommodate several sparsity-inducing penalties such as the lasso, elastic net and group lasso and also readily handles prior annotation information in the form of weights. Our algorithm has theoretical guarantees of convergence and we again assess its performance in both simulated and real data. The third manuscript describes a novel strategy called eclust for dimension reduction that leverages the effects of an exposure variable with broad impact on HD measures. With eclust, we found improved prediction and variable selection performance compared to methods that do not consider the exposure in the clustering step, or to methods that use the original data as features. We further illustrate this modeling framework through the analysis of three data sets from very different fields, each with HD data, a binary exposure, and a phenotype of interest. We provide efficient implementations of all our algorithms in freely available and open source software." --

High-dimensional Regression Models with Structured Coefficients

High-dimensional Regression Models with Structured Coefficients
Title High-dimensional Regression Models with Structured Coefficients PDF eBook
Author Yuan Li
Publisher
Pages 0
Release 2018
Genre
ISBN

Download High-dimensional Regression Models with Structured Coefficients Book in PDF, Epub and Kindle

Regression models are very common for statistical inference, especially linear regression models with Gaussian noise. But in many modern scientific applications with large-scale datasets, the number of samples is small relative to the number of model parameters, which is the so-called high- dimensional setting. Directly applying classical linear regression models to high-dimensional data is ill-posed. Thus it is necessary to impose additional assumptions for regression coefficients to make high-dimensional statistical analysis possible. Regularization methods with sparsity assumptions have received substantial attention over the past two decades. But there are still some open questions regarding high-dimensional statistical analysis. Firstly, most literature provides statistical analysis for high-dimensional linear models with Gaussian noise, it is unclear whether similar results still hold if we are no longer in the Gaussian setting. To answer this question under Poisson setting, we study the minimax rates and provide an implementable convex algorithm for high-dimensional Poisson inverse problems under weak sparsity assumption and physical constraints. Secondly, much of the theory and methodology for high-dimensional linear regression models are based on the assumption that independent variables are independent of each other or have weak correlations. But it is possible that this assumption is not satisfied that some features are highly correlated with each other. It is natural to ask whether it is still possible to make high-dimensional statistical inference with high-correlated designs. Thus we provide a graph-based regularization method for high-dimensional regression models with high-correlated designs along with theoretical guarantees.

Penalized Methods for High-dimensional Least Absolute Deviations Regression

Penalized Methods for High-dimensional Least Absolute Deviations Regression
Title Penalized Methods for High-dimensional Least Absolute Deviations Regression PDF eBook
Author Xiaoli Gao
Publisher
Pages 236
Release 2008
Genre Least absolute deviations (Statistics)
ISBN

Download Penalized Methods for High-dimensional Least Absolute Deviations Regression Book in PDF, Epub and Kindle

Statistical Foundations of Data Science

Statistical Foundations of Data Science
Title Statistical Foundations of Data Science PDF eBook
Author Jianqing Fan
Publisher CRC Press
Pages 942
Release 2020-09-21
Genre Mathematics
ISBN 0429527616

Download Statistical Foundations of Data Science Book in PDF, Epub and Kindle

Statistical Foundations of Data Science gives a thorough introduction to commonly used statistical models, contemporary statistical machine learning techniques and algorithms, along with their mathematical insights and statistical theories. It aims to serve as a graduate-level textbook and a research monograph on high-dimensional statistics, sparsity and covariance learning, machine learning, and statistical inference. It includes ample exercises that involve both theoretical studies as well as empirical applications. The book begins with an introduction to the stylized features of big data and their impacts on statistical analysis. It then introduces multiple linear regression and expands the techniques of model building via nonparametric regression and kernel tricks. It provides a comprehensive account on sparsity explorations and model selections for multiple regression, generalized linear models, quantile regression, robust regression, hazards regression, among others. High-dimensional inference is also thoroughly addressed and so is feature screening. The book also provides a comprehensive account on high-dimensional covariance estimation, learning latent factors and hidden structures, as well as their applications to statistical estimation, inference, prediction and machine learning problems. It also introduces thoroughly statistical machine learning theory and methods for classification, clustering, and prediction. These include CART, random forests, boosting, support vector machines, clustering algorithms, sparse PCA, and deep learning.