A Kernel-based Test of Independence for Cluster-correlated Data

Requirements

The CompQuadForm R package is required for the HSIC_cl() function. R packages used in the simulation section include MASS, reshape2 and ggplot2. All the packages are available on CRAN.

To install the CompQuadForm package (and other R packages), we can run the following command in R environment:

install.packages("CompQuadForm")

To source the HSIC_cl() function, we can run the following command:

source("R/HSIC_for_clustered_data.R")

Demo and instructions for use

Consider a set of exposure variables \(X\) and a set of outcome variables \(Y\), with a joint probability distribution \(P_{XY}\). Suppose a sample \(\{(X_j, Y_j) \}_{j=1}^n\) is drawn from \(P_{XY}\) with clustered correlation among the observations. Specifically, we assume that \((X_1, Y_1), \cdots, (X_n, Y_n)\) are identically distributed according to \(P_{XY}\), and the \(m\) clusters of fixed size \(d\): \[\begin{equation} \Big \{ \Big[ (X_{di-d+1}, Y_{di-d+1}), \cdots, (X_{di}, Y_{di}) \Big] \Big\}_{i=1}^m \end{equation}\] are independent from each other while having identical within-cluster correlation structure. We aim to evaluate the dependence between \(X\) and \(Y\) based on the sample \(\{(X_j, Y_j) \}_{j=1}^n\).

To demonstrate the usage of HSIC_cl, we simulate multivariate normal observations \(\{(X_j, Y_j) \}_{j=1}^n\) with clustered correlation, where \(X \in \mathbb{R}^{20}\) and \(Y \in \mathbb{R}^{20}\), according to Section 4.1.1 of our paper. Here we let the exposure \(X\) be independent from the outcome \(Y\).

library(MASS)
source("R/useful_functions.R")

## Baseline parameters
m <- 100 # number of clusters
d <- 3  # cluster size
n_expo <- 20  # number of exposure/outcome variables
rho_c <- 0.5  # parameter for AR(1) within-cluster correlation structure

## within-cluster covariance matrix, with AR(1) structure
cluster_cov_mt <- ar1_cor(d, rho_c)

## covariance matrix for exposure variables
expo_cov_mt <- matrix(0.5, n_expo, n_expo)  
diag(expo_cov_mt) <- 1

## covariance matrix for outcome variables
outcome_cov_mt <- matrix(0.5, n_expo, n_expo)  
diag(outcome_cov_mt) <- 1

## simulate MVN exposures
set.seed(41)
expo_list <- replicate(m, sim_mvn_in_cluster(5, n_expo, expo_cov_mt, cluster_cov_mt, d),
                         simplify = F)
expo_matrix <- do.call(rbind, expo_list)
  
## simulate MVN outcomes
outcome_list <- replicate(m, sim_mvn_in_cluster(0, n_expo, outcome_cov_mt, cluster_cov_mt, d),
                            simplify = F)
outcome_matrix <- do.call(rbind, outcome_list)

We then construct kernel matrices based on the exposure and outcome matrices and center them. Here we use linear kernels for both exposures and outcomes.

# construct linear kernel matrices
expo.K <- expo_matrix %*% t(expo_matrix)  
outcome.K <- outcome_matrix %*% t(outcome_matrix)

# centering matrix 
I.m <- diag(1,m*d)
I.1 <- rep(1,m*d)
H <- I.m - I.1 %*% t(I.1)/(m*d)
  
# center the kernel matrices
outcome.K.c <- H %*% outcome.K %*% H
expo.K.c <- H %*% expo.K %*% H

Now we can use the HSIC_cl() function to perform the HSIC_cl test and obtain a p-value. Description of the input arguments for HSIC_cl() can be found in HSIC_for_clustered_data.R.

p_value <- HSIC_cl(m, expo.K.c, outcome.K.c, kernel_type="linear", n_expo,
           n_expo)$pval
p_value

## [1] 0.1270706

Note that, if we use the original HSIC test, as developed by Gretton et al. (2007) and Zhang et al. (2012) and implemented by Broadaway et al. (2016), without any adjustment for within-cluster correlation, we might obtain a false positive result.

p_value_orig <- HSIC_orig(expo.K.c, outcome.K.c)$pval
p_value_orig

## [1] 0.02808093

Reproduction of simulation results

We provide example code: R/simulation_code.R to reproduce simulation results in Section 4.1.2 of our paper.

For illustrative purposes, we only consider \(m=500\), \(\rho_c=0.5\) with 50 simulated data sets under type I error simulation and Power Scenario 1. Results for the other settings can be produced similarly by altering the input arguments.

The following commands shall be run in the Bash environment (from command line). On a 12-core computer with 2.40 GHz CPUs and 256 GB memory, it takes approximately 5-6 minutes to run each command using the linear kernel, and 84-87 minutes to run each command using the Gaussian kernel.

Simulation to evaluate type I error of HSIC_cl and HSIC_orig (Table 1)

# power: 0 = type I error analysis, 1 = Power Scenario 1, 2 = Power Scenario 2.
# m: number of clusters
# d: cluster size
# n_expo: number of exposures/outcomes
# rho_c: parameter for the AR(1) within-cluster correlation structure
# gamma: proportion of outcomes associated with the exposure
# kernel: type of kernel (linear or gaussian)
# seed_num: seed number for simulation
# n_sim: number of data sets to simulate
# output_file: name of the output file

Rscript R/simulation_code.R --power=0 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_type_I_error_result_500cl_20expo_20out_linearK_rho0.5.csv' 

Rscript R/simulation_code.R --power=0 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_type_I_error_result_500cl_20expo_20out_gaussK_rho0.5.csv'

Simulation to evaluate power of HSIC_cl and competing methods (Figure 2)

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.1 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.1.csv' 

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.2 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.2.csv'  

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.3 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.3.csv' 

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.4 --kernel='linear' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.4.csv' 


Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.1 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.1.csv' 

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.2 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.2.csv'  

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.3 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.3.csv' 

Rscript R/simulation_code.R --power=1 --m=500 --d=3 --n_expo=20 --rho_c=0.5 --gamma=0.4 --kernel='gaussian' --seed_num=123 --n_sim=50 --output_file='Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.4.csv'

Report the empirical type I error rate and power

Example output files of the above Bash commands are provided in the “Simulation_results” folder. We next return to R and analyze these output files.

The empirical type I error rate at significance level \(\alpha\) is reported as following:

library(knitr)

type_I_error_result <- data.frame('alpha'=rep(c(0.1, 0.05),each=2),'rho_c'=0.5,
                                  'kernel'=rep(c('Linear','Gaussian'),2), 'HSIC_orig'=NA,
                                  'HSIC_cl'=NA)
type_I_error_result[1,4:5] <- result_eval(power_ind=0,   "Simulation_results/example_type_I_error_result_500cl_20expo_20out_linearK_rho0.5.csv", sig_level=0.1)
type_I_error_result[3,4:5] <- result_eval(power_ind=0,   "Simulation_results/example_type_I_error_result_500cl_20expo_20out_linearK_rho0.5.csv", sig_level=0.05)
type_I_error_result[2,4:5] <- result_eval(power_ind=0,   "Simulation_results/example_type_I_error_result_500cl_20expo_20out_gaussK_rho0.5.csv", sig_level=0.1)
type_I_error_result[4,4:5] <- result_eval(power_ind=0,   "Simulation_results/example_type_I_error_result_500cl_20expo_20out_gaussK_rho0.5.csv", sig_level=0.05)

kable(type_I_error_result, caption = 'Example empirical type I error rate', digits = 2,
      col.names = c('$\\alpha$', '$\\rho_c$', 'Kernel', 'HSIC~orig~',
                    'HSIC~cl~'))

Example empirical type I error rate
\(\alpha\)	\(\rho_c\)	Kernel	HSIC_orig	HSIC_cl
0.10	0.5	Linear	0.18	0.08
0.10	0.5	Gaussian	0.74	0.06
0.05	0.5	Linear	0.14	0.04
0.05	0.5	Gaussian	0.56	0.04

The empirical power at a significance level of 0.05 (under Power Scenario 1) is reported as following:

# linear kernel
power_result_linear <- data.frame('Method'=c('HSIC_cl','HSIC_cat','HSIC_mean'),
                                  'effsize1'=NA, 'effsize2'=NA, 'effsize3'=NA, 'effsize4'=NA)
power_result_linear[,2] <- result_eval(power_ind=1,   "Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.1.csv", sig_level=0.05)
power_result_linear[,3] <- result_eval(power_ind=1,   "Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.2.csv", sig_level=0.05)
power_result_linear[,4] <- result_eval(power_ind=1,   "Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.3.csv", sig_level=0.05)
power_result_linear[,5] <- result_eval(power_ind=1,   "Simulation_results/example_power_result_500cl_20expo_20out_linearK_rho0.5_gamma0.4.csv", sig_level=0.05)

power_plot(power_result_linear, kernel_type = 'linear kernel', rho_c=0.5,
                       effect_size_input = c("effsize1", "effsize2", "effsize3","effsize4"), 
                       effect_size_output = c("10%", "20%", "30%", "40%"))

# Gaussian kernel
power_result_gauss <- data.frame('Method'=c('HSIC_cl','HSIC_cat','HSIC_mean'),
                                  'effsize1'=NA, 'effsize2'=NA, 'effsize3'=NA, 'effsize4'=NA)
power_result_gauss[,2] <- result_eval(power_ind=1,   "Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.1.csv", sig_level=0.05)
power_result_gauss[,3] <- result_eval(power_ind=1,   "Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.2.csv", sig_level=0.05)
power_result_gauss[,4] <- result_eval(power_ind=1,   "Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.3.csv", sig_level=0.05)
power_result_gauss[,5] <- result_eval(power_ind=1,   "Simulation_results/example_power_result_500cl_20expo_20out_gaussK_rho0.5_gamma0.4.csv", sig_level=0.05)

power_plot(power_result_gauss, kernel_type = 'Gaussian kernel', rho_c=0.5,
                       effect_size_input = c("effsize1", "effsize2", "effsize3", "effsize4"), 
                       effect_size_output = c("10%", "20%", "30%", "40%"))

License

All content is licensed under the GPL-3.0 license.

References

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. In International conference on algorithmic learning theory, pages 63–77. Springer, 2005.

Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, Alexander J Smola, et al. A kernel statistical test of independence. In NIPS, volume 20, pages 585–592. Citeseer, 2007.

Kun Zhang, Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Kernel-based conditional independence test and application in causal discovery. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI’11, page 804–813, Arlington, Virginia, USA, 2011. AUAI Press.

K Alaine Broadaway, David J Cutler, Richard Duncan, Jacob L Moore, Erin B Ware, Min A Jhun, Lawrence F Bielak, Wei Zhao, Jennifer A Smith, Patricia A Peyser, et al. A statistical approach for testing cross-phenotype effects of rare variants. The American Journal of Human Genetics, 98(3):525–540, 2016.

A Kernel-based Test of Independence for Cluster-correlated Data

Contents

Overview

Requirements

Demo and instructions for use

Reproduction of simulation results

Simulation to evaluate type I error of HSIC_cl and HSIC_orig (Table 1)

Simulation to evaluate power of HSIC_cl and competing methods (Figure 2)

Report the empirical type I error rate and power

License

References

A Kernel-based Test of Independence for Cluster-correlated Data

Contents

Overview

Requirements

Demo and instructions for use

Reproduction of simulation results

Simulation to evaluate type I error of HSICcl and HSICorig (Table 1)

Simulation to evaluate power of HSICcl and competing methods (Figure 2)

Report the empirical type I error rate and power

License

References

Simulation to evaluate type I error of HSIC_cl and HSIC_orig (Table 1)

Simulation to evaluate power of HSIC_cl and competing methods (Figure 2)