Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Dec 18 Monday

Schedule

7.00: Breakfast

8.45: Opening

9:00: John Duchi Stanford University Privacy - Definitions, Procedures, Open Problems

I will provide a broad overview of differential privacy, which provides guarantees that a data analysis protects the privacy of data contributors. The main focus will be on the private computation and release of different statistics, both classical (low-dimensional) and high-dimensional statistics. In addition to giving a high-level program for the development of optimal private estimators, I will likely discuss a few open questions as well.

Back to day schedule


10.00: Coffee break

Session: Robust statistics

Chair: Pierre Alquier, ESSEC Business School

10:20: Stanislav Minsker University of Southern California (USC) Improved performance guarantees for robust mean estimation

The median of means (MOM) estimator and its variants has become a go-to method for problems involving heavy-tailed and outlier-contaminated data. Examples include robust versions of mean and covariance estimators, linear regression, and k-means clustering, among others. Achieving the best possible performance for the MOM estimator in the simplest univariate case has positive implications for many of these problems. In the first part of the talk, we will demonstrate how to obtain efficient versions of the MOM estimator that satisfies deviation inequalities with sharp constants, requiring only minimal assumptions on the underlying distribution. Moreover, we will discuss the interplay between this question and the theory of U-statistics.

The second part of the talk will be devoted to the multivariate version of the MOM estimator based on the geometric median. We will demonstrate that for large classes of heavy-tailed distributions, the geometric MOM estimator attains sub-exponential deviation guarantees, improving the known bounds in many cases. New analysis of this estimator reveals interesting connections with the small ball probabilities and some questions about the negative moments of the norms. This part is based on a joint work with Nate Strawn.

Back to day schedule

11:00: Arshak Minasyan CentraleSupélec - Paris-Saclay University Statistically optimal robust mean and covariance estimation for anisotropic gaussians

Assume that \(X_{1}, \ldots, X_{N} \) is an \(\varepsilon\)-contaminated sample of \(N\) independent Gaussian vectors in \(\mathbb{R}^d\) with mean \(\mu\) and covariance \(\Sigma\). In the strong \(\varepsilon\)-contamination model we assume that the adversary replaced an \(\varepsilon\) fraction of vectors in the original Gaussian sample by any other vectors. We show that there is an estimator \(\widehat \mu\) of the mean satisfying, with probability at least \(1 - \delta\), a bound of the form

\(|\widehat{\mu} - \mu|_2 \le c\left(\sqrt{\frac{Tr(\Sigma)}{N}} + \sqrt{\frac{|\Sigma|\log(1/\delta)}{N}} + \varepsilon\sqrt{|\Sigma|}\right), \)

where \(c > 0\) is an absolute constant, and \(|\Sigma|\) denotes the operator norm of \(\Sigma\). In the same contaminated Gaussian setup, we construct an estimator \(\widehat \Sigma\) of the covariance matrix \(\Sigma\) that satisfies, with probability at least \(1 - \delta\),

\( \left|\widehat{\Sigma} - \Sigma\right| \le c\left(\sqrt{\frac{|\Sigma|Tr(\Sigma)}{N}} + |\Sigma|\sqrt{\frac{\log(1/\delta)}{N}} + \varepsilon|\Sigma|\right). \)

Both results are optimal up to multiplicative constant factors. Despite the recent significant interest in robust statistics, achieving both dimension-free bounds in the canonical Gaussian case remained open. In fact, several previously known results were either dimension-dependent and required \(\Sigma\) to be close to identity, or had a sub-optimal dependence on the contamination level \(\varepsilon\).

As a part of the analysis, we derive sharp concentration inequalities for central order statistics of Gaussian, folded normal, and chi-squared distributions.

This is a joint work with N. Zhivotovskiy.

Back to day schedule

11:40: Chao Gao University of Chicago Robust and minimax estimation in a two-groups model

The advent of large scale inference has spurred reexamination of conventional statistical thinking. In a series of highly original articles, Efron showed in some examples that the ensemble of the null distributed test statistics grossly deviated from the theoretical null distribution, and Efron persuasively illustrated the danger in assuming the theoretical null’s veracity for downstream inference. Though intimidating in other contexts, the large scale setting is to the statistician’s benefit here. There is now potential to estimate, rather than assume, the null distribution.

In a model for n many z-scores with at most k nonnulls, we adopt Efron’s suggestion and consider estimation of location and scale parameters for a Gaussian null distribution. Placing no assumptions on the nonnull effects, we consider rate-optimal estimation in the entire regime k < n/2, that is, precisely the regime in which the null parameters are identifiable. The minimax upper bound is obtained by considering estimators based on the empirical characteristic function and the classical kernel mode estimator. Faster rates than those in Huber’s contamination model are achievable by exploiting the Gaussian character of the data. As a consequence, it is shown that consistent estimation is indeed possible in the practically relevant regime k ≍ n. In a certain regime, the minimax lower bound involves constructing two marginal distributions whose characteristic functions match on a wide interval containing zero. The construction notably differs from those in the literature by sharply capturing a second-order scaling of n/2 − k in the minimax rate.

Back to day schedule


12.30: Lunch


Session: Fairness

Chair: Philippe Rigollet, MIT

16:00: Christophe Giraud Paris-Saclay University Improving fairness in online learning

Online learning and recommandation systems are ubiquitous in decision making, including in some sensitive cases such as job hiring, decision in justice, admission in college. Applying blindly standard ML technics can lead to unfair and discriminatory decisions. We will discuss some approaches for improving fairness in online learning and recommandation systems.

https://arxiv.org/abs/2305.15807

https://arxiv.org/abs/2106.12242

https://arxiv.org/abs/2203.09784

Back to day schedule

16:40: Jean-Michel Loubes Université Toulouse Paul Sabatier counterfactual fairness using optimal transport

A supervised machine learning algorithm determines a model from a learning sample that will be used to predict new observations. To this end, it aggregates individual characteristics of the observations of the learning sample. But this information aggregation may be potentially biased.. This situation has raised concerns around the so-called fairness of machine learning algorithms, especially towards disadvantaged groups.

We provide a better comprehension of both global biases and individual biases using definitions grounded in optimal transport theory. Besides the identification of theses biases, we will also consider mitigation methods to handle local bias while maintaining some accuracy in the output.

Back to day schedule

17.20: Coffee break

Session: High dimensional covariance estimation

Chair: Cristina Butucea, ENSAE

17:40: Zhao Ren University of Pittsburgh Heteroskedastic Sparse PCA in High Dimensions

Principal component analysis (PCA) is one of the most commonly used techniques for dimension reduction and feature extraction. Though it has been well-studied for high-dimensional sparse PCA, little is known when the noise is heteroskedastic, which turns out to be ubiquitous in many scenarios, like biological sequencing data and information network data. We propose an iterative algorithm for sparse PCA in the presence of heteroskedastic noise, which alternatively updates the estimates of the sparse eigenvectors using the orthogonal iteration with adaptive thresholding in one step, and imputes the diagonal values of the sample covariance matrix to reduce the estimation bias due to heteroskedasticity in the other step. Our procedure is computationally fast and provably optimal under the generalized spiked covariance model, assuming the leading eigenvectors are sparse. A comprehensive simulation study demonstrates its robustness and effectiveness in various settings.

Back to day schedule

18:20: Angelika Rohde Freiburg University Bootstrapping high-dimensional sample covariance matrices

Bootstrapping is the classical approach for distributional approximation of estimators and test statistics when an asymptotic distribution contains unknown quantities or provides a poor approximation quality. For the analysis of massive data, however, the bootstrap is computationally intractable in its basic sampling-with-replacement version. Moreover, it is even not valid in some important high-dimensional applications. Combining subsampling of observations with suitable selection of their coordinates, we introduce a new $(m,mp/n)$ out of $(n,p)$-sampling with replacement bootstrap for eigenvalue statistics of high-dimensional sample covariance matrices based on $n$ independent $p$-dimensional random vectors. In the high-dimensional scenario $p/n\rightarrow c\in [0,\infty)$, this fully nonparametric bootstrap is shown to consistently reproduce the underlying spectral measure if $m/n\rightarrow 0$. If $m^2/n\rightarrow 0$, it approximates correctly the distribution of linear spectral statistics. The crucial component is a suitably defined representative subpopulation condition which is shown to be verified in a large variety of situations. The proofs incorporate several delicate technical results which may be of independent interest.

Back to day schedule

19.30: Dinner