A Kernel-based Test of Independence for Cluster-correlated Data

Contents

Overview

We propose a kernel-based independence test, HSICcl, to evaluate the generalized association between two multivariate (and potentially high-dimensional) variables based on cluster-correlated data. This test is built upon the Hilbert-Schmidt Independence Criterion (HSIC) proposed by Gretton et al. (2005), while making proper accommodation for clustered data. It makes no assumption on the distributions of the variables or the nature of the dependence. HSICcl is implemented as the HSIC_cl() function in R language. We introduce the usage of the HSIC_cl() function and demonstrate how to replicate the simulation results in our paper.

Requirements

The CompQuadForm R package is required for the HSIC_cl() function. R packages used in the simulation section include MASS, reshape2 and ggplot2. All the packages are available on CRAN.

To install the CompQuadForm package (and other R packages), we can run the following command in R environment:

install.packages("CompQuadForm")

To source the HSIC_cl() function, we can run the following command:

source("R/HSIC_for_clustered_data.R")

Demo and instructions for use

Consider a set of exposure variables \(X\) and a set of outcome variables \(Y\), with a joint probability distribution \(P_{XY}\). Suppose a sample \(\{(X_j, Y_j) \}_{j=1}^n\) is drawn from \(P_{XY}\) with clustered correlation among the observations. Specifically, we assume that \((X_1, Y_1), \cdots, (X_n, Y_n)\) are identically distributed according to \(P_{XY}\), and the \(m\) clusters of fixed size \(d\): \[\begin{equation} \Big \{ \Big[ (X_{di-d+1}, Y_{di-d+1}), \cdots, (X_{di}, Y_{di}) \Big] \Big\}_{i=1}^m \end{equation}\] are independent from each other while having identical within-cluster correlation structure. We aim to evaluate the dependence between \(X\) and \(Y\) based on the sample \(\{(X_j, Y_j) \}_{j=1}^n\).

To demonstrate the usage of HSICcl, we simulate multivariate normal observations \(\{(X_j, Y_j) \}_{j=1}^n\) with clustered correlation, where \(X \in \mathbb{R}^{20}\) and \(Y \in \mathbb{R}^{20}\), according to Section 4.1.1 of our paper. Here we let the exposure \(X\) be independent from the outcome \(Y\).

library(MASS)
source("R/useful_functions.R")

## Baseline parameters
m <- 100 # number of clusters
d <- 3  # cluster size
n_expo <- 20  # number of exposure/outcome variables
rho_c <- 0.5  # parameter for AR(1) within-cluster correlation structure

## within-cluster covariance matrix, with AR(1) structure
cluster_cov_mt <- ar1_cor(d, rho_c)

## covariance matrix for exposure variables
expo_cov_mt <- matrix(0.5, n_expo, n_expo)  
diag(expo_cov_mt) <- 1

## covariance matrix for outcome variables
outcome_cov_mt <- matrix(0.5, n_expo, n_expo)  
diag(outcome_cov_mt) <- 1

## simulate MVN exposures
set.seed(41)
expo_list <- replicate(m, sim_mvn_in_cluster(5, n_expo, expo_cov_mt, cluster_cov_mt, d),
                         simplify = F)
expo_matrix <- do.call(rbind, expo_list)
  
## simulate MVN outcomes
outcome_list <- replicate(m, sim_mvn_in_cluster(0, n_expo, outcome_cov_mt, cluster_cov_mt, d),
                            simplify = F)
outcome_matrix <- do.call(rbind, outcome_list)

We then construct kernel matrices based on the exposure and outcome matrices and center them. Here we use linear kernels for both exposures and outcomes.

# construct linear kernel matrices
expo.K <- expo_matrix %*% t(expo_matrix)  
outcome.K <- outcome_matrix %*% t(outcome_matrix)

# centering matrix 
I.m <- diag(1,m*d)
I.1 <- rep(1,m*d)
H <- I.m - I.1 %*% t(I.1)/(m*d)
  
# center the kernel matrices
outcome.K.c <- H %*% outcome.K %*% H
expo.K.c <- H %*% expo.K %*% H

Now we can use the HSIC_cl() function to perform the HSICcl test and obtain a p-value. Description of the input arguments for HSIC_cl() can be found in HSIC_for_clustered_data.R.

p_value <- HSIC_cl(m, expo.K.c, outcome.K.c, kernel_type="linear", n_expo,
           n_expo)$pval
p_value
## [1] 0.1270706

Note that, if we use the original HSIC test, as developed by Gretton et al. (2007) and Zhang et al. (2012) and implemented by Broadaway et al. (2016), without any adjustment for within-cluster correlation, we might obtain a false positive result.

p_value_orig <- HSIC_orig(expo.K.c, outcome.K.c)$pval
p_value_orig
## [1] 0.02808093

Reproduction of simulation results

We provide example code: R/simulation_code.R to reproduce simulation results in Section 4.1.2 of our paper.

For illustrative purposes, we only consider \(m=500\), \(\rho_c=0.5\) with 50 simulated data sets under type I error simulation and Power Scenario 1. Results for the other settings can be produced similarly by altering the input arguments.

The following commands shall be run in the Bash environment (from command line). On a 12-core computer with 2.40 GHz CPUs and 256 GB memory, it takes approximately 5-6 minutes to run each command using the linear kernel, and 84-87 minutes to run each command using the Gaussian kernel.

Simulation to evaluate type I error of HSICcl and HSICorig (Table 1)

Simulation to evaluate power of HSICcl and competing methods (Figure 2)

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.1 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.1.csv' 

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.2 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.2.csv'  

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.3 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.3.csv' 

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.4 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.4.csv' 


Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.1 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.1.csv' 

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.2 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.2.csv'  

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.3 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.3.csv' 

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.4 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.4.csv' 

Report the empirical type I error rate and power

Example output files of the above Bash commands are provided in the “Simulation_results” folder. We next return to R and analyze these output files.

The empirical type I error rate at significance level \(\alpha\) is reported as following:

Example empirical type I error rate
\(\alpha\) \(\rho_c\) Kernel HSICorig HSICcl
0.10 0.5 Linear 0.18 0.08
0.10 0.5 Gaussian 0.74 0.06
0.05 0.5 Linear 0.14 0.04
0.05 0.5 Gaussian 0.56 0.04

The empirical power at a significance level of 0.05 (under Power Scenario 1) is reported as following:

License

All content is licensed under the GPL-3.0 license.

References

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.

Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, Alexander J Smola, et al. A kernel statistical test of independence. In NIPS, volume 20, pages 585–592. Citeseer, 2007.

Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI’11, page 804–813, Arlington, Virginia, USA, 2011. AUAI Press.

K Alaine Broadaway, David J Cutler, Richard Duncan, Jacob L Moore, Erin B Ware, Min A Jhun, Lawrence F Bielak, Wei Zhao, Jennifer A Smith, Patricia A Peyser, et al. A statistical approach for testing cross-phenotype effects of rare variants. The American Journal of Human Genetics, 98(3):525–540, 2016.