Title: | OPTICS K-Xi Density-Based Clustering |
---|---|
Description: | Density-based clustering methods are well adapted to the clustering of high-dimensional data and enable the discovery of core groups of various shapes despite large amounts of noise. This package provides a novel density-based cluster extraction method, OPTICS k-Xi, and a framework to compare k-Xi models using distance-based metrics to investigate datasets with unknown number of clusters. The vignette first introduces density-based algorithms with simulated datasets, then presents and evaluates the k-Xi cluster extraction method. Finally, the models comparison framework is described and experimented on 2 genetic datasets to identify groups and their discriminating features. The k-Xi algorithm is a novel OPTICS cluster extraction method that specifies directly the number of clusters and does not require fine-tuning of the steepness parameter as the OPTICS Xi method. Combined with a framework that compares models with varying parameters, the OPTICS k-Xi method can identify groups in noisy datasets with unknown number of clusters. Results on summarized genetic data of 1,200 patients are in Charlon T. (2019) <doi:10.13097/archive-ouverte/unige:161795>. |
Authors: | Thomas Charlon [aut, cre] |
Maintainer: | Thomas Charlon <[email protected]> |
License: | GPL-3 |
Version: | 1.2.0 |
Built: | 2025-01-26 01:21:43 UTC |
Source: | https://gitlab.com/thomaschln/opticskxi |
Pipe an object forward into a function or call expression and update the 'lhs' object with the resulting value. Magrittr imported function, see details and examples in the magrittr package.
lhs |
An object which serves both as the initial value and as target. |
rhs |
a function call using the magrittr semantics. |
None, used to update the value of lhs.
Pipe an object forward into a function or call expression. Magrittr imported function, see details and examples in the magrittr package.
lhs |
A value or the magrittr placeholder. |
rhs |
A function call using the magrittr semantics. |
Result of rhs applied to lhs, see details in magrittr package.
Expose the names in 'lhs' to the 'rhs' expression. Magrittr imported function, see details and examples in the magrittr package.
lhs |
A list, environment, or a data.frame. |
rhs |
An expression where the names in lhs is available. |
Result of rhs applied to one or several names of lhs.
Include NAs and add totals to table.
contingency_table(...)
contingency_table(...)
... |
Passed to table |
Table object
Inputs will be L2 normalized, then matrix multiplied. If second input is missing, first input will be recycled, which enables to efficiently compute cosine similarities between the rows of a rectangular matrix.
cosine_simi(x, y)
cosine_simi(x, y)
x |
Numeric vector or matrix |
y |
Numeric vector or matrix. If missing, copied from parameter x. |
Symmetric numeric similarity matrix
The data set consist of 103 common (>5% minor allele frequency) SNPs genotyped in 129 trios from an European-derived population. These SNPs are in a 500-kb region on human chromosome 5q31 implicated as containing a genetic risk factor for Crohn disease.
Imported from the gap R package.
An example use of the data is with the following paper, Kelly M. Burkett, Celia M. T. Greenwood, BradMcNeney, Jinko Graham. Gene genealogies for genetic association mapping, with application to Crohn's disease. Fron Genet 2013, 4(260) doi: 10.3389/fgene.2013.00260
data(crohn)
data(crohn)
A data frame containing 387 rows and 212 columns
MJ Daly, JD Rioux, SF Schaffner, TJ Hudson, ES Lander (2001) High-resolution haplotype structure in the human genome Nature Genetics 29:229-232
Dispatch of amap::Dist, cosine_simi, and norm_inprod methods.
dist_matrix(data, method = "euclidean", n_cores = 1)
dist_matrix(data, method = "euclidean", n_cores = 1)
data |
Rectangular numeric matrix [Observations, Features] |
method |
Methods accepted by amap::Dist or cosine and norm_inprod |
n_cores |
Number of cores |
Dissimarility symmetric matrix
Use models' rankings over several metrics to select best model. Several approaches can be taken to sum the models' rankings, and instead of summing the ranks of all models over all metrics, we prefer to rank only the top models for each metrics, and set 0 to all other. This behavior is controlled by the n_top parameter. In a second step, we sum the ranks and return only the top models, and this is controlled by the n_models parameter. The output is a list of the rankings matrix, for quality control purposes, and the selected models' parameters data frame, which is used by the ensemble_models function.
ensemble_metrics( n_top = 0, df_params, metrics = NULL, metrics_exclude = NULL, n_models = 10 )
ensemble_metrics( n_top = 0, df_params, metrics = NULL, metrics_exclude = NULL, n_models = 10 )
n_top |
Threshold of number of models to rank |
df_params |
Output of opticskxi_pipeline |
metrics |
Names of metrics to use. Any of those computed by opticskxi_pipeline, e.g. 'sindex', 'ch', 'dunn', 'dunn2', 'widestgap', 'entropy' etc. NULL for all (8). |
metrics_exclude |
Names of metrics to exclude. Typically used with metrics = NULL. E.g. 'entropy'. |
n_models |
Number of best models to return |
List of metrics' rankings matrix and best models' parameters data frame.
Typically we will call ensemble_metrics with varying numbers of ranks to consider and this function will sum up the ranks from those calls.
ensemble_metrics_bootstrap(l_ensemble_metrics, n_models = 4)
ensemble_metrics_bootstrap(l_ensemble_metrics, n_models = 4)
l_ensemble_metrics |
Output of function ensemble_metrics |
n_models |
Number of best models to return |
List of parameters of best models
Call ensemble_metrics with varying numbers of rank thresholds to consider and sum up the ranks from those calls.
ensemble_models( df_kxi, n_models = 4, metrics = NULL, metrics_exclude = NULL, model_subsample = c(0.1, 0.2, 0.5), n_models_subsample = 10 )
ensemble_models( df_kxi, n_models = 4, metrics = NULL, metrics_exclude = NULL, model_subsample = c(0.1, 0.2, 0.5), n_models_subsample = 10 )
df_kxi |
Output of opticskxi_pipeline function. Dataframe with models' parameters and OPTICS k-Xi results |
n_models |
Number of best models to return |
metrics |
Names of metrics to use. Any of those computed by opticskxi_pipeline, e.g. 'sindex', 'ch', 'dunn', 'dunn2', 'widestgap', 'entropy' etc. NULL for all (8). |
metrics_exclude |
Names of metrics to exclude. Typically used with metrics = NULL. E.g. 'entropy'. |
model_subsample |
Ratios of best models to consider. |
n_models_subsample |
Number of best models when subsampling. |
Input object df_kxi subsetted to best models according to ensemble metrics.
data('m_psychwords') m_psychwords = m_psychwords[1:200, 1:20] df_params = expand.grid(n_xi = 4:5, pts = c(5, 10), dist = 'cosine', dim_red = 'ICA', n_dimred_comp = 5) df_kxi = psych_kxi_pipeline(m_data, df_params, n_min_clusters = 2) df_kxi = ensemble_models(df_kxi, n_models = 4, metrics = c('avg.silwidth', 'dunn'), model_subsample = c(0.4, 0.6), n_models_subsample = 4)
data('m_psychwords') m_psychwords = m_psychwords[1:200, 1:20] df_params = expand.grid(n_xi = 4:5, pts = c(5, 10), dist = 'cosine', dim_red = 'ICA', n_dimred_comp = 5) df_kxi = psych_kxi_pipeline(m_data, df_params, n_min_clusters = 2) df_kxi = ensemble_models(df_kxi, n_models = 4, metrics = c('avg.silwidth', 'dunn'), model_subsample = c(0.4, 0.6), n_models_subsample = 4)
Fortify a dimension reduction object
fortify_dimred( m_dimred, m_vars = NULL, v_variance = NULL, sup_vars = NULL, var_digits = 1 )
fortify_dimred( m_dimred, m_vars = NULL, v_variance = NULL, sup_vars = NULL, var_digits = 1 )
m_dimred |
Projection matrix |
m_vars |
Rotation matrix (optional) |
v_variance |
Explained variance (optional) |
sup_vars |
Optional supplementary variables |
var_digits |
Explained variance percent digits |
Data frame
pca <- prcomp(iris[-5]) df_pca <- fortify_dimred(pca$x)
pca <- prcomp(iris[-5]) df_pca <- fortify_dimred(pca$x)
Get and fortify ICA
fortify_ica(m_data, ..., sup_vars = NULL)
fortify_ica(m_data, ..., sup_vars = NULL)
m_data |
Input matrix |
... |
Passed to fastICA::fastICA |
sup_vars |
Optional supplementary variables |
Fortified dimension reduction
df_ica <- fortify_ica(iris[-5], n.comp = 2)
df_ica <- fortify_ica(iris[-5], n.comp = 2)
Get and fortify PCA
fortify_pca(m_data, ..., sup_vars = NULL)
fortify_pca(m_data, ..., sup_vars = NULL)
m_data |
Input matrix |
... |
Passed to stats::prcomp |
sup_vars |
Optional supplementary variables |
Fortified dimension reduction
df_pca <- fortify_pca(iris[-5]) df_pca <- fortify_pca(iris[-5], sup_vars = iris[5])
df_pca <- fortify_pca(iris[-5]) df_pca <- fortify_pca(iris[-5], sup_vars = iris[5])
Select k-Xi clustering model based on a metric and a rank
get_best_kxi(df_kxi, metric = "avg.silwidth", rank = 1)
get_best_kxi(df_kxi, metric = "avg.silwidth", rank = 1)
df_kxi |
Data frame returned by opticsxi_pipeline |
metric |
Metric to choose best model |
rank |
Rank(s) of model to choose, ordered by decreasing metric |
df_kxi row with specified metric and rank, simplified to a list if only one rank selected
Plot multiple axes of a data frame or a fortified dimension reduction.
ggpairs( df_data, group = NULL, axes = 1:2, variables = FALSE, n_vars = 0, ellipses = FALSE, ..., title = NULL, colors = if (!is.null(group)) nice_palette(df_data[[group]]) )
ggpairs( df_data, group = NULL, axes = 1:2, variables = FALSE, n_vars = 0, ellipses = FALSE, ..., title = NULL, colors = if (!is.null(group)) nice_palette(df_data[[group]]) )
df_data |
Data frame |
group |
Column name of the grouping of observations |
axes |
Axes to plot. If more than 2, plots all pair combinations |
variables |
Logical, plot variable contributions of the dimension reduction to the selected axes, only for 2 axes |
n_vars |
Maximum number of variable contributions to plot. By default 0, for all variables. |
ellipses |
Logical, plot ellipses of groups |
... |
Passed to ggplot2 stat_ellipse if ellipses are requested |
title |
String to add as title, default NULL |
colors |
Vector of colors for each group |
ggmatrix
df_pca <- fortify_pca(iris[-5]) ggpairs(df_pca) df_pca <- fortify_pca(iris[-5], sup_vars = iris[5]) ggpairs(df_pca, group = 'Species', ellipses = TRUE, variables = TRUE)
df_pca <- fortify_pca(iris[-5]) ggpairs(df_pca) df_pca <- fortify_pca(iris[-5], sup_vars = iris[5]) ggpairs(df_pca, group = 'Species', ellipses = TRUE, variables = TRUE)
Plot metrics of a kxi_pipeline output
ggplot_kxi_metrics(df_kxi, metric = c("avg.silwidth", "bw.ratio"), n = 8)
ggplot_kxi_metrics(df_kxi, metric = c("avg.silwidth", "bw.ratio"), n = 8)
df_kxi |
Data frame returned by opticskxi_pipeline |
metric |
Vector of metrics to display from the df_kxi object |
n |
Number of best models for the first metric to display |
ggplot
Plot OPTICS reachability plot.
ggplot_optics( optics_obj, groups = NULL, colors = if (!is.null(groups)) nice_palette(groups), segment_size = 300/nrow(df_optics) )
ggplot_optics( optics_obj, groups = NULL, colors = if (!is.null(groups)) nice_palette(groups), segment_size = 300/nrow(df_optics) )
optics_obj |
dbscan::optics object |
groups |
Optional vector defining groups of OPTICS observations |
colors |
If groups specified, vector of colors for each group |
segment_size |
Size for geom_segment |
ggplot
data('multishapes') optics_obj <- dbscan::optics(multishapes[1:2]) ggplot_optics(optics_obj) ggplot_optics(optics_obj, groups = opticskxi(optics_obj, n_xi = 5, pts = 30))
data('multishapes') optics_obj <- dbscan::optics(multishapes[1:2]) ggplot_optics(optics_obj) ggplot_optics(optics_obj, groups = opticskxi(optics_obj, n_xi = 5, pts = 30))
Plot OPTICS distance profiles of k-Xi clustering models
gtable_kxi_profiles(df_kxi, metric = "avg.silwidth", rank = 1:4, ...)
gtable_kxi_profiles(df_kxi, metric = "avg.silwidth", rank = 1:4, ...)
df_kxi |
Data frame returned by opticskxi_pipeline |
metric |
Metric to choose best clustering model |
rank |
Ranks of models to plot, ordered by decreasing model metric |
... |
Passed to ggplot_kxi_profile |
This data set contains HLA markers DRB, DQA, DQB and phenotypes of 271 Schizophrenia patients (y=1) and controls (y=0). Genotypes for 3 HLA loci have prefixes name (e.g., "DQB") and a suffix for each of two alleles (".a1" and ".a2").
Imported from the gap package.
data(hla)
data(hla)
A data frame containing 271 rows and 8 columns
Dr Padraig Wright of Pfizer
Data containing Glove embeddings of psychological related words, useful for demonstrating the use of ensemble metrics.
data("m_psych_embeds")
data("m_psych_embeds")
A matrix with 831 words in rows and 100 embedding dimensions in columns.
The dataset contains groups of related words among other irrelevant words.
Data containing clusters of any shapes. Useful for comparing density-based clustering (DBSCAN) and standard partitioning methods such as k-means clustering. Imported from the factoextra package.
data("multishapes")
data("multishapes")
A data frame with 1100 observations on the following 3 variables.
x
a numeric vector containing the x coordinates of observations
y
a numeric vector containing the y coordinates of observations
shape
a numeric vector corresponding to the cluster number of each observations.
The dataset contains 5 clusters and some outliers/noises.
data('multishapes') plot(multishapes[, 1], multishapes[, 2], col = multishapes[, 3], pch = 19, cex = 0.8)
data('multishapes') plot(multishapes[, 1], multishapes[, 2], col = multishapes[, 3], pch = 19, cex = 0.8)
Color palette
nice_palette(groups, rainbow = FALSE)
nice_palette(groups, rainbow = FALSE)
groups |
Vector, each unique value will get a color |
rainbow |
If TRUE, rainbow-like colors, else differentiate successive values |
Vector of colors
Normalized inner product with transposed input matrix
norm_inprod(m)
norm_inprod(m)
m |
Numeric matrix |
Numeric matrix
For each largest distance differences on the OPTICS profile, consecutive observations left and right on the OPTICS profile (i.e. lower and higher OPTICS id) will be assigned to 2 different clusters if their distance is below the distance of the edge point. If above, observations are NA. The pts parameter defines a minimum number of observations to form a valley (i.e. cluster). If the number of observations in one valley is smaller than pts, observations are set to NA.
opticskxi( optics_obj, n_xi, pts = optics_obj$minPts, max_loop = 50, verbose = FALSE )
opticskxi( optics_obj, n_xi, pts = optics_obj$minPts, max_loop = 50, verbose = FALSE )
optics_obj |
Data frame returned by optics |
n_xi |
Number of clusters to define |
pts |
Minimum number of points per clusters |
max_loop |
Maximum iterations to find n_xi clusters |
verbose |
Print the ids of the largest difference considered and cluster information if they define one |
Vector of clusters
opticskxi_pipeline, ggplot_optics
data('multishapes') optics_shapes <- dbscan::optics(multishapes[1:2]) kxi_shapes <- opticskxi(optics_shapes, n_xi = 5, pts = 30) ggplot_optics(optics_shapes, groups = kxi_shapes) ggpairs(cbind(multishapes[1:2], kXi = kxi_shapes), group = 'kXi')
data('multishapes') optics_shapes <- dbscan::optics(multishapes[1:2]) kxi_shapes <- opticskxi(optics_shapes, n_xi = 5, pts = 30) ggplot_optics(optics_shapes, groups = kxi_shapes) ggpairs(cbind(multishapes[1:2], kXi = kxi_shapes), group = 'kXi')
Computes OPTICS k-Xi models based on a parameter grid, binds results in a data frame, and computes distance based metrics for each model.
opticskxi_pipeline( m_data, df_params = expand.grid(n_xi = 1:10, pts = c(20, 30, 40), dist = c("euclidean", "abscorrelation"), dim_red = c("identity", "PCA", "ICA"), n_dimred_comp = c(5, 10, 20)), metrics_dist = c("euclidean", "cosine"), max_size_ratio = 1, n_min_clusters = 0, n_cores = 1 )
opticskxi_pipeline( m_data, df_params = expand.grid(n_xi = 1:10, pts = c(20, 30, 40), dist = c("euclidean", "abscorrelation"), dim_red = c("identity", "PCA", "ICA"), n_dimred_comp = c(5, 10, 20)), metrics_dist = c("euclidean", "cosine"), max_size_ratio = 1, n_min_clusters = 0, n_cores = 1 )
m_data |
Data matrix |
df_params |
Parameter grid for the OPTICS k-Xi function call and optional dimension reduction. Required columns: n_xi, pts, dist. Optonal columns: dim_red, n_dim_red. |
metrics_dist |
Distance used for metrics, either euclidean or cosine. |
max_size_ratio |
Maximum size ratio of clusters. E.g. for 0.8, if a cluster is larger than 80% of points it will be removed. |
n_min_clusters |
Minimum number of clusters. Ignored if 0. |
n_cores |
Number of cores |
Input parameter data frame with with results binded in columns optics, clusters and metrics.
get_best_kxi, ggplot_kxi_metrics, gtable_kxi_profiles
data('hla') m_hla <- hla[-c(1:2)] %>% scale df_params_hla <- expand.grid(n_xi = 3:5, pts = c(20, 30), dist = c('manhattan', 'euclidean')) df_kxi_hla <- opticskxi_pipeline(m_hla, df_params_hla) ggplot_kxi_metrics(df_kxi_hla, n = 8) gtable_kxi_profiles(df_kxi_hla) %>% plot best_kxi_hla <- get_best_kxi(df_kxi_hla, rank = 2) clusters_hla <- best_kxi_hla$clusters fortify_pca(m_hla, sup_vars = data.frame(Clusters = clusters_hla)) %>% ggpairs('Clusters', ellipses = TRUE, variables = TRUE)
data('hla') m_hla <- hla[-c(1:2)] %>% scale df_params_hla <- expand.grid(n_xi = 3:5, pts = c(20, 30), dist = c('manhattan', 'euclidean')) df_kxi_hla <- opticskxi_pipeline(m_hla, df_params_hla) ggplot_kxi_metrics(df_kxi_hla, n = 8) gtable_kxi_profiles(df_kxi_hla) %>% plot best_kxi_hla <- get_best_kxi(df_kxi_hla, rank = 2) clusters_hla <- best_kxi_hla$clusters fortify_pca(m_hla, sup_vars = data.frame(Clusters = clusters_hla)) %>% ggpairs('Clusters', ellipses = TRUE, variables = TRUE)
Print knitr::kable latex table with legend at bottom.
print_vignette_table(table_obj, label)
print_vignette_table(table_obj, label)
table_obj |
Table object |
label |
Latex label |
caption |
Table caption |
None, side-effect prints a Latex table
Bind contingency table and Pearson Chi-squared residuals.
residuals_table(...)
residuals_table(...)
... |
Passed to contingency_table and chisq.test |
Matrix
Get mean of standard deviations of matrix columns
stddev_mean(m)
stddev_mean(m)
m |
Numeric matrix |
Mean of standard deviations of matrix columns