SNPLinkage is modular enough to use directly dataframes of correlation matrices and chromosomic positions specified by the user, e.g. for visualizing RNASeq data. The user can compute a correlation matrix or any kind of pair-wise similarity matrix independently and then use SNPLinkage to build and arrange easily customizable ggplot2 objects.
The user can specify the correlations he wants to visualize as a
dataframe to the ggplot_ld
function. The column names must
follow the following pattern: SNP_A
and SNP_B
for the two variables in relation, and R2
for the
correlation value.
library(snplinkage)
#> Loading required package: GWASTools
#> Loading required package: Biobase
#> Loading required package: BiocGenerics
#> Loading required package: generics
#>
#> Attaching package: 'generics'
#> The following objects are masked from 'package:base':
#>
#> as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
#> setequal, union
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, is.unsorted, lapply,
#> mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
#> rank, rbind, rownames, sapply, saveRDS, table, tapply, unique,
#> unsplit, which.max, which.min
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
# example rnaseq data matrix, 20 variables of 20 patients
m_rna = matrix(runif(20 ^ 2), nrow = 20)
# pair-wise correlation matrix
m_ld = cor(m_rna) ^ 2
# keep only upper triangle and reshape to data frame
m_ld[lower.tri(m_ld, diag = TRUE)] = NA
df_ld = reshape2::melt(m_ld) |> na.omit()
# rename for SNPLinkage
names(df_ld) = c('SNP_A', 'SNP_B', 'R2')
# visualize with ggplot_ld
gg_ld = ggplot_ld(df_ld)
gg_ld
Similarly, the user can specify a dataframe to the
ggplot_snp_pos
function. The dataframe is assumed to be in
the same order as the correlation dataframe, and the column name
position
is required.
# let's imagine the 20 variables came from 3 physically close regions
positions = c(runif(7, 31e6, 31.5e6), runif(6, 32e6, 32.5e6),
runif(7, 33e6, 33.5e6)) |> sort()
# build the dataframe
df_snp_pos = data.frame(position = positions)
# minimal call
gg_snp_pos = ggplot_snp_pos(df_snp_pos)
Optionally, one can specify the labels_colname
parameter
to give the name of a column that will have the labels to display.
We then arrange the plots with the gtable_ld_grobs
function. One needs to specify in the labels_colname
parameter if the chromosomic positions plot was built with labels or
not. The title
parameter is also required.
l_ggs = list(snp_pos = gg_snp_pos, ld = gg_ld)
gt_ld = gtable_ld_grobs(l_ggs, labels_colname = TRUE,
title = 'RNASeq correlations')
grid::grid.draw(gt_ld)
Finally we add the variables’ associations to our outcome of
interest. The ggplot_associations
uses as input a dataframe
and accepts a parameter pvalue_colname
to specify which
column holds the association values, by default ‘pvalues’. It also
requires a labels_colname
parameter to specify the column
holding the labels, and a column named chromosome
. The
linked_area
parameter will affect how the associations are
plotted and it is recommended to be used in combination with the
diamonds
parameter of ggplot_ld
(i.e. TRUE for
small number of variables, approximately less than 40).
Additionally, the n_labels
parameter controls the number
of highest association labels displayed (be default 10, the behavior can
be disabled by setting labels_colname
to NULL), and the
nudge
parameter will affect how the labels are displayed
(passed to geom_label_repel
function of ‘ggrepel’
package).
# let's imagine the middle region, HLA-B, is more associated with the outcome
pvalues = c(runif(7, 1e-3, 1e-2), runif(6, 1e-8, 1e-6), runif(7, 1e-3, 1e-2))
log10_pvals = -log10(pvalues)
# we can reuse the df_snp_pos object
df_snp_pos$pvalues = log10_pvals
# add the chromosome column
df_snp_pos$chromosome = 6
gg_assocs = ggplot_associations(df_snp_pos, labels_colname = 'label',
linked_area = TRUE, nudge = c(0, 0.5),
n_labels = 12)
We then arrange the plots with the gtable_ld_grobs
function as previously. We need to call the ggplot_snp_pos
function with the upper_subset
parameter set to TRUE for it
to connect to the upper graph.
gg_pos_biplot = ggplot_snp_pos(df_snp_pos, labels_colname = 'label',
upper_subset = TRUE)
# let's also say the middle region HLA-B is particularly correlated
df_ld$R2[df_ld$SNP_A %in% 8:13 & df_ld$SNP_B %in% 8:13] = runif(15, 0.7, 0.9)
gg_ld = ggplot_ld(df_ld)
l_ggs = list(pos = gg_pos_biplot, ld = gg_ld, pval = gg_assocs)
gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)
We can extract a title and remove the horizontal axis text as follows.
library(ggplot2)
gg_assocs <- gg_assocs + theme(axis.text.x = element_blank())
title <- gg_assocs$labels$x %>% gsub(' (Mbp)', '', ., fixed = TRUE) %>%
paste('-', nrow(df_snp_pos), 'SNPs')
gg_assocs <- gg_assocs + labs(title = title, x = NULL)
l_ggs$pval = gg_assocs
gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)
grid::grid.draw(gt_ld)
Let’s say we want to change the color of the associations area. We first need to identify which layer it corresponds to:
gg_assocs$layers
#> [[1]]
#> geom_area: na.rm = FALSE, orientation = NA, outline.type = upper
#> stat_align: na.rm = FALSE, orientation = NA
#> position_stack
#>
#> [[2]]
#> mapping: xend = ~x
#> geom_segment: arrow = NULL, arrow.fill = NULL, lineend = butt, linejoin = round, na.rm = FALSE
#> stat_identity: na.rm = FALSE
#> position_identity
#>
#> [[3]]
#> geom_label_repel: parse = FALSE, box.padding = 0.25, label.padding = 0.1, point.padding = 1e-06, label.r = 0.15, label.size = 0.1, min.segment.length = 0.5, arrow = NULL, na.rm = TRUE, force = 1, force_pull = 1, max.time = 0.5, max.iter = 10000, max.overlaps = 10, nudge_x = 0, nudge_y = 0.5, xlim = c(NA, NA), ylim = c(NA, NA), direction = both, seed = NA, verbose = FALSE
#> stat_identity: na.rm = TRUE
#> position_nudge_repel
#>
#> [[4]]
#> geom_point: na.rm = FALSE
#> stat_identity: na.rm = FALSE
#> position_identity
Then we can change the color with:
And rebuild our object.
To change the lines and labels colors, parameters in the functions are available. You can either specify a single value or a vector of same length as your number of features.
gg_pos_biplot = ggplot_snp_pos(df_snp_pos, labels_colname = 'label',
upper_subset = TRUE, colors = '#101d6b')
gg_assocs = ggplot_associations(df_snp_pos, labels_colname = 'label',
linked_area = TRUE, nudge = c(0, 0.5),
n_labels = 12, colors = '#101d6b')
# extract title
gg_assocs <- gg_assocs + theme(axis.text.x = element_blank())
title <- gg_assocs$labels$x %>% gsub(' (Mbp)', '', ., fixed = TRUE) %>%
paste('-', nrow(df_snp_pos), 'SNPs')
gg_assocs <- gg_assocs + labs(title = title, x = NULL)
# replace area color
gg_assocs$layers[[1]]$aes_params$fill = "#0147ab"
# rebuild
l_ggs = list(pos = gg_pos_biplot, ld = gg_ld, pval = gg_assocs)
gt_ld = gtable_ld_associations_combine(l_ggs, diamonds = TRUE)
grid::grid.draw(gt_ld)
In this dataset from the ‘gap’ package (Zhao, Kurt Hornik, and Ripley 2015), 206 SNPs from chromosome 5 (5q31) were measured from 129 Crohn’s disease patients and their 2 parents, totalling 387 samples.
data('crohn')
m_hla = crohn[, -(1:6)]
m_ld = cor(m_hla) ^ 2
# keep only upper triangle and reshape to data frame
m_ld[lower.tri(m_ld, diag = TRUE)] = NA
df_ld = reshape2::melt(m_ld) |> na.omit()
# rename for SNPLinkage
names(df_ld) = c('SNP_A', 'SNP_B', 'R2')
# visualize with ggplot_ld
gg_ld = ggplot_ld(df_ld)
Compute p-values
mlog10_pvals = chisq_pvalues(m_hla, crohn[, 'crohn'])
df_pos = data.frame(probe_id = colnames(m_hla), pvalues = mlog10_pvals,
chromosome = 5)
# if we don't have positions we can use byindex = TRUE
gg_assocs = ggplot_associations(df_pos, byindex = TRUE, nudge = c(0, 0.5))
Arrange with ‘cowplot’
Focus on most associated
df_top_assocs = subset(df_pos, pvalues > quantile(pvalues, 0.9))
gg_assocs = ggplot_associations(df_top_assocs, linked_area = TRUE,
nudge = c(0, 0.5))
df_ld = subset(df_ld, SNP_A %in% df_top_assocs$probe_id &
SNP_B %in% df_top_assocs$probe_id)
gg_ld = ggplot_ld(df_ld)
cowplot::plot_grid(gg_assocs, gg_ld, nrow = 2)
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.5.1 snplinkage_1.2.0 GWASTools_1.53.0
#> [4] Biobase_2.67.0 BiocGenerics_0.53.3 generics_0.1.3
#> [7] rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] Rdpack_2.6.2 DBI_1.2.3 gdsfmt_1.43.1
#> [4] httr2_1.1.0 sandwich_3.1-1 biomaRt_2.63.0
#> [7] rlang_1.1.5 magrittr_2.0.3 compiler_4.4.2
#> [10] RSQLite_2.3.9 mgcv_1.9-1 reshape2_1.4.4
#> [13] png_0.1-8 vctrs_0.6.5 quantreg_5.99.1
#> [16] stringr_1.5.1 pkgconfig_2.0.3 shape_1.4.6.1
#> [19] crayon_1.5.3 SNPRelate_1.41.0 fastmap_1.2.0
#> [22] backports_1.5.0 dbplyr_2.5.0 XVector_0.47.2
#> [25] labeling_0.4.3 UCSC.utils_1.3.1 nloptr_2.1.1
#> [28] MatrixModels_0.5-3 purrr_1.0.2 bit_4.5.0.1
#> [31] xfun_0.50 glmnet_4.1-8 jomo_2.7-6
#> [34] logistf_1.26.0 cachem_1.1.0 GenomeInfoDb_1.43.2
#> [37] jsonlite_1.8.9 progress_1.2.3 blob_1.2.4
#> [40] pan_1.9 parallel_4.4.2 broom_1.0.7
#> [43] prettyunits_1.2.0 R6_2.5.1 bslib_0.8.0
#> [46] stringi_1.8.4 boot_1.3-31 DNAcopy_1.81.0
#> [49] rpart_4.1.24 lmtest_0.9-40 jquerylib_0.1.4
#> [52] Rcpp_1.0.14 iterators_1.0.14 knitr_1.49
#> [55] zoo_1.8-12 IRanges_2.41.2 Matrix_1.7-1
#> [58] splines_4.4.2 nnet_7.3-20 tidyselect_1.2.1
#> [61] yaml_2.3.10 codetools_0.2-20 curl_6.1.0
#> [64] plyr_1.8.9 lattice_0.22-6 tibble_3.2.1
#> [67] withr_3.0.2 quantsmooth_1.73.0 KEGGREST_1.47.0
#> [70] evaluate_1.0.3 survival_3.8-3 BiocFileCache_2.15.1
#> [73] xml2_1.3.6 Biostrings_2.75.3 filelock_1.0.3
#> [76] pillar_1.10.1 mice_3.17.0 foreach_1.5.2
#> [79] stats4_4.4.2 reformulas_0.4.0 S4Vectors_0.45.2
#> [82] hms_1.1.3 munsell_0.5.1 scales_1.3.0
#> [85] minqa_1.2.8 GWASExactHW_1.2 glue_1.8.0
#> [88] maketools_1.3.1 tools_4.4.2 sys_3.4.3
#> [91] data.table_1.16.4 lme4_1.1-36 SparseM_1.84-2
#> [94] buildtools_1.0.0 cowplot_1.1.3 grid_4.4.2
#> [97] tidyr_1.3.1 rbibutils_2.3 AnnotationDbi_1.69.0
#> [100] colorspace_2.1-1 nlme_3.1-166 formula.tools_1.7.1
#> [103] GenomeInfoDbData_1.2.13 cli_3.6.3 rappdirs_0.3.3
#> [106] dplyr_1.1.4 gtable_0.3.6 sass_0.4.9
#> [109] digest_0.6.37 operator.tools_1.6.3 ggrepel_0.9.6
#> [112] farver_2.1.2 memoise_2.0.1 htmltools_0.5.8.1
#> [115] lifecycle_1.0.4 httr_1.4.7 mitml_0.4-5
#> [118] bit64_4.6.0-1 MASS_7.3-64