We propose a kernel-based independence test, HSICcl, to evaluate the generalized association between two multivariate (and potentially high-dimensional) variables based on cluster-correlated data. This test is built upon the Hilbert-Schmidt Independence Criterion (HSIC) proposed by Gretton et al. (2005), while making proper accommodation for clustered data. It makes no assumption on the distributions of the variables or the nature of the dependence. HSICcl is implemented as the HSIC_cl()
function in R language. We introduce the usage of the HSIC_cl()
function and demonstrate how to replicate the simulation results in our paper.
The CompQuadForm R package is required for the HSIC_cl()
function. R packages used in the simulation section include MASS, reshape2 and ggplot2. All the packages are available on CRAN.
To install the CompQuadForm package (and other R packages), we can run the following command in R environment:
To source the HSIC_cl()
function, we can run the following command:
Consider a set of exposure variables \(X\) and a set of outcome variables \(Y\), with a joint probability distribution \(P_{XY}\). Suppose a sample \(\{(X_j, Y_j) \}_{j=1}^n\) is drawn from \(P_{XY}\) with clustered correlation among the observations. Specifically, we assume that \((X_1, Y_1), \cdots, (X_n, Y_n)\) are identically distributed according to \(P_{XY}\), and the \(m\) clusters of fixed size \(d\): \[\begin{equation} \Big \{ \Big[ (X_{di-d+1}, Y_{di-d+1}), \cdots, (X_{di}, Y_{di}) \Big] \Big\}_{i=1}^m \end{equation}\] are independent from each other while having identical within-cluster correlation structure. We aim to evaluate the dependence between \(X\) and \(Y\) based on the sample \(\{(X_j, Y_j) \}_{j=1}^n\).
To demonstrate the usage of HSICcl, we simulate multivariate normal observations \(\{(X_j, Y_j) \}_{j=1}^n\) with clustered correlation, where \(X \in \mathbb{R}^{20}\) and \(Y \in \mathbb{R}^{20}\), according to Section 4.1.1 of our paper. Here we let the exposure \(X\) be independent from the outcome \(Y\).
library(MASS)
source("R/useful_functions.R")
## Baseline parameters
m <- 100 # number of clusters
d <- 3 # cluster size
n_expo <- 20 # number of exposure/outcome variables
rho_c <- 0.5 # parameter for AR(1) within-cluster correlation structure
## within-cluster covariance matrix, with AR(1) structure
cluster_cov_mt <- ar1_cor(d, rho_c)
## covariance matrix for exposure variables
expo_cov_mt <- matrix(0.5, n_expo, n_expo)
diag(expo_cov_mt) <- 1
## covariance matrix for outcome variables
outcome_cov_mt <- matrix(0.5, n_expo, n_expo)
diag(outcome_cov_mt) <- 1
## simulate MVN exposures
set.seed(41)
expo_list <- replicate(m, sim_mvn_in_cluster(5, n_expo, expo_cov_mt, cluster_cov_mt, d),
simplify = F)
expo_matrix <- do.call(rbind, expo_list)
## simulate MVN outcomes
outcome_list <- replicate(m, sim_mvn_in_cluster(0, n_expo, outcome_cov_mt, cluster_cov_mt, d),
simplify = F)
outcome_matrix <- do.call(rbind, outcome_list)
We then construct kernel matrices based on the exposure and outcome matrices and center them. Here we use linear kernels for both exposures and outcomes.
# construct linear kernel matrices
expo.K <- expo_matrix %*% t(expo_matrix)
outcome.K <- outcome_matrix %*% t(outcome_matrix)
# centering matrix
I.m <- diag(1,m*d)
I.1 <- rep(1,m*d)
H <- I.m - I.1 %*% t(I.1)/(m*d)
# center the kernel matrices
outcome.K.c <- H %*% outcome.K %*% H
expo.K.c <- H %*% expo.K %*% H
Now we can use the HSIC_cl()
function to perform the HSICcl test and obtain a p-value. Description of the input arguments for HSIC_cl()
can be found in HSIC_for_clustered_data.R.
## [1] 0.1270706
Note that, if we use the original HSIC test, as developed by Gretton et al. (2007) and Zhang et al. (2012) and implemented by Broadaway et al. (2016), without any adjustment for within-cluster correlation, we might obtain a false positive result.
## [1] 0.02808093
We provide example code: R/simulation_code.R to reproduce simulation results in Section 4.1.2 of our paper.
For illustrative purposes, we only consider \(m=500\), \(\rho_c=0.5\) with 50 simulated data sets under type I error simulation and Power Scenario 1. Results for the other settings can be produced similarly by altering the input arguments.
The following commands shall be run in the Bash environment (from command line). On a 12-core computer with 2.40 GHz CPUs and 256 GB memory, it takes approximately 5-6 minutes to run each command using the linear kernel, and 84-87 minutes to run each command using the Gaussian kernel.
# power: 0 = type I error analysis, 1 = Power Scenario 1, 2 = Power Scenario 2.
# m: number of clusters
# d: cluster size
# n_expo: number of exposures/outcomes
# rho_c: parameter for the AR(1) within-cluster correlation structure
# gamma: proportion of outcomes associated with the exposure
# kernel: type of kernel (linear or gaussian)
# seed_num: seed number for simulation
# n_sim: number of data sets to simulate
# output_file: name of the output file
Rscript R/simulation_code.R --power=0 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_type_I_error_result_500cl_20expo_20out_linearK_rho0.5.csv'
Rscript R/simulation_code.R --power=0 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_type_I_error_result_500cl_20expo_20out_gaussK_rho0.5.csv'
Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.1 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.1.csv'
Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.2 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.2.csv'
Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.3 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.3.csv'
Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.4 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.4.csv'
Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.1 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.1.csv'
Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.2 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.2.csv'
Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.3 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.3.csv'
Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.4 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.4.csv'
Example output files of the above Bash commands are provided in the “Simulation_results” folder. We next return to R and analyze these output files.
The empirical type I error rate at significance level \(\alpha\) is reported as following:
library(knitr)
type_I_error_result <- data.frame('alpha'=rep(c(0.1, 0.05),each=2),'rho_c'=0.5,
'kernel'=rep(c('Linear','Gaussian'),2), 'HSIC_orig'=NA,
'HSIC_cl'=NA)
type_I_error_result[1,4:5] <- result_eval(power_ind=0, "Simulation_results/example_type_I_error_result_500cl_20expo_20out_linearK_rho0.5.csv", sig_level=0.1)
type_I_error_result[3,4:5] <- result_eval(power_ind=0, "Simulation_results/example_type_I_error_result_500cl_20expo_20out_linearK_rho0.5.csv", sig_level=0.05)
type_I_error_result[2,4:5] <- result_eval(power_ind=0, "Simulation_results/example_type_I_error_result_500cl_20expo_20out_gaussK_rho0.5.csv", sig_level=0.1)
type_I_error_result[4,4:5] <- result_eval(power_ind=0, "Simulation_results/example_type_I_error_result_500cl_20expo_20out_gaussK_rho0.5.csv", sig_level=0.05)
kable(type_I_error_result, caption = 'Example empirical type I error rate', digits = 2,
col.names = c('$\\alpha$', '$\\rho_c$', 'Kernel', 'HSIC~orig~',
'HSIC~cl~'))
\(\alpha\) | \(\rho_c\) | Kernel | HSICorig | HSICcl |
---|---|---|---|---|
0.10 | 0.5 | Linear | 0.18 | 0.08 |
0.10 | 0.5 | Gaussian | 0.74 | 0.06 |
0.05 | 0.5 | Linear | 0.14 | 0.04 |
0.05 | 0.5 | Gaussian | 0.56 | 0.04 |
The empirical power at a significance level of 0.05 (under Power Scenario 1) is reported as following:
# linear kernel
power_result_linear <- data.frame('Method'=c('HSIC_cl','HSIC_cat','HSIC_mean'),
'effsize1'=NA, 'effsize2'=NA, 'effsize3'=NA, 'effsize4'=NA)
power_result_linear[,2] <- result_eval(power_ind=1, "Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.1.csv", sig_level=0.05)
power_result_linear[,3] <- result_eval(power_ind=1, "Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.2.csv", sig_level=0.05)
power_result_linear[,4] <- result_eval(power_ind=1, "Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.3.csv", sig_level=0.05)
power_result_linear[,5] <- result_eval(power_ind=1, "Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.4.csv", sig_level=0.05)
power_plot(power_result_linear, kernel_type = 'linear kernel', rho_c=0.5,
effect_size_input = c("effsize1", "effsize2", "effsize3","effsize4"),
effect_size_output = c("10%", "20%", "30%", "40%"))
# Gaussian kernel
power_result_gauss <- data.frame('Method'=c('HSIC_cl','HSIC_cat','HSIC_mean'),
'effsize1'=NA, 'effsize2'=NA, 'effsize3'=NA, 'effsize4'=NA)
power_result_gauss[,2] <- result_eval(power_ind=1, "Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.1.csv", sig_level=0.05)
power_result_gauss[,3] <- result_eval(power_ind=1, "Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.2.csv", sig_level=0.05)
power_result_gauss[,4] <- result_eval(power_ind=1, "Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.3.csv", sig_level=0.05)
power_result_gauss[,5] <- result_eval(power_ind=1, "Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.4.csv", sig_level=0.05)
power_plot(power_result_gauss, kernel_type = 'Gaussian kernel', rho_c=0.5,
effect_size_input = c("effsize1", "effsize2", "effsize3", "effsize4"),
effect_size_output = c("10%", "20%", "30%", "40%"))
All content is licensed under the GPL-3.0 license.
Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.
Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, Alexander J Smola, et al. A kernel statistical test of independence. In NIPS, volume 20, pages 585–592. Citeseer, 2007.
Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI’11, page 804–813, Arlington, Virginia, USA, 2011. AUAI Press.
K Alaine Broadaway, David J Cutler, Richard Duncan, Jacob L Moore, Erin B Ware, Min A Jhun, Lawrence F Bielak, Wei Zhao, Jennifer A Smith, Patricia A Peyser, et al. A statistical approach for testing cross-phenotype effects of rare variants. The American Journal of Human Genetics, 98(3):525–540, 2016.