Package 'opticskxi' reference manual

Title:	OPTICS K-Xi Density-Based Clustering
Description:	Density-based clustering methods are well adapted to the clustering of high-dimensional data and enable the discovery of core groups of various shapes despite large amounts of noise. This package provides a novel density-based cluster extraction method, OPTICS k-Xi, and a framework to compare k-Xi models using distance-based metrics to investigate datasets with unknown number of clusters. The vignette first introduces density-based algorithms with simulated datasets, then presents and evaluates the k-Xi cluster extraction method. Finally, the models comparison framework is described and experimented on 2 genetic datasets to identify groups and their discriminating features. The k-Xi algorithm is a novel OPTICS cluster extraction method that specifies directly the number of clusters and does not require fine-tuning of the steepness parameter as the OPTICS Xi method. Combined with a framework that compares models with varying parameters, the OPTICS k-Xi method can identify groups in noisy datasets with unknown number of clusters. Results on summarized genetic data of 1,200 patients are in Charlon T. (2019) <doi:10.13097/archive-ouverte/unige:161795>. A short video tutorial can be found at <https://www.youtube.com/watch?v=P2XAjqI5Lc4/>.
Authors:	Thomas Charlon [aut, cre]
Maintainer:	Thomas Charlon <[email protected]>
License:	GPL-3
Version:	1.2.1
Built:	2025-03-22 10:25:14 UTC
Source:	https://gitlab.com/thomaschln/opticskxi

Pipe

Description

Pipe an object forward into a function or call expression. Magrittr imported function, see details and examples in the magrittr package.

Arguments

`lhs`	A value or the magrittr placeholder.
`rhs`	A function call using the magrittr semantics.

Value

Result of rhs applied to lhs, see details in magrittr package.

Exposition pipe

Description

Expose the names in 'lhs' to the 'rhs' expression. Magrittr imported function, see details and examples in the magrittr package.

Arguments

`lhs`	A list, environment, or a data.frame.
`rhs`	An expression where the names in lhs is available.

Value

Result of rhs applied to one or several names of lhs.

Cosine similarity between vectors and/or matrices.

Description

Inputs will be L2 normalized, then matrix multiplied (y is transposed). If second input is missing, first input will be recycled, which enables to efficiently compute cosine similarities between the rows of a rectangular matrix.

Usage

cosine_simi(x, y)
cosine_simi(x, y)

Arguments

`x`	Numeric vector or matrix
`y`	Numeric vector or matrix. If missing, copied from parameter x.

Value

Symmetric numeric similarity matrix

Crohn's disease data

Description

The data set consist of 103 common (>5% minor allele frequency) SNPs genotyped in 129 trios from an European-derived population. These SNPs are in a 500-kb region on human chromosome 5q31 implicated as containing a genetic risk factor for Crohn disease.

Imported from the gap R package.

An example use of the data is with the following paper, Kelly M. Burkett, Celia M. T. Greenwood, BradMcNeney, Jinko Graham. Gene genealogies for genetic association mapping, with application to Crohn's disease. Fron Genet 2013, 4(260) doi: 10.3389/fgene.2013.00260

Usage

data(crohn)data(crohn)

Format

A data frame containing 387 rows and 212 columns

Source

MJ Daly, JD Rioux, SF Schaffner, TJ Hudson, ES Lander (2001) High-resolution haplotype structure in the human genome Nature Genetics 29:229-232

dist_matrix

Description

Dispatch of amap::Dist, cosine_dist, and norm_inprod methods.

Usage

dist_matrix(data, method = "euclidean", n_cores = 1)
dist_matrix(data, method = "euclidean", n_cores = 1)

Arguments

`data`	Rectangular numeric matrix [Observations, Features]
`method`	Methods accepted by amap::Dist or cosine and norm_inprod
`n_cores`	Number of cores

Value

Distance symmetric matrix

Compute ensemble metrics

Description

Use models' rankings over several metrics to select best model. Several approaches can be taken to sum the models' rankings, and instead of summing the ranks of all models over all metrics, we prefer to rank only the top models for each metrics, and set 0 to all other. This behavior is controlled by the n_top parameter. In a second step, we sum the ranks and return only the top models, and this is controlled by the n_models parameter. The output is a list of the rankings matrix, for quality control purposes, and the selected models' parameters data frame, which is used by the ensemble_models function.

Usage

ensemble_metrics(
  n_top = 0,
  df_params,
  metrics = NULL,
  metrics_exclude = NULL,
  n_models = 10
)
ensemble_metrics(
  n_top = 0,
  df_params,
  metrics = NULL,
  metrics_exclude = NULL,
  n_models = 10
)

Arguments

`n_top`	Threshold of number of models to rank
`df_params`	Output of opticskxi_pipeline
`metrics`	Names of metrics to use. Any of those computed by opticskxi_pipeline, e.g. 'sindex', 'ch', 'dunn', 'dunn2', 'widestgap', 'entropy' etc. NULL for all (8).
`metrics_exclude`	Names of metrics to exclude. Typically used with metrics = NULL. E.g. 'entropy'.
`n_models`	Number of best models to return

Value

List of metrics' rankings matrix and best models' parameters data frame.

Select models based on ensemble metrics

Description

Typically we will call ensemble_metrics with varying numbers of ranks to consider and this function will sum up the ranks from those calls.

Usage

ensemble_metrics_bootstrap(l_ensemble_metrics, n_models = 4)
ensemble_metrics_bootstrap(l_ensemble_metrics, n_models = 4)

Arguments

`l_ensemble_metrics`	Output of function ensemble_metrics
`n_models`	Number of best models to return

Value

List of parameters of best models

Select best models based on ensemble metrics

Description

Call ensemble_metrics with varying numbers of rank thresholds to consider and sum up the ranks from those calls.

Usage

ensemble_models(
  df_kxi,
  n_models = 4,
  metrics = NULL,
  metrics_exclude = NULL,
  model_subsample = c(0.1, 0.2, 0.5),
  n_models_subsample = 10
)
ensemble_models(
  df_kxi,
  n_models = 4,
  metrics = NULL,
  metrics_exclude = NULL,
  model_subsample = c(0.1, 0.2, 0.5),
  n_models_subsample = 10
)

Arguments

`df_kxi`	Output of opticskxi_pipeline function. Dataframe with models' parameters and OPTICS k-Xi results
`n_models`	Number of best models to return
`metrics`	Names of metrics to use. Any of those computed by opticskxi_pipeline, e.g. 'sindex', 'ch', 'dunn', 'dunn2', 'widestgap', 'entropy' etc. NULL for all (8).
`metrics_exclude`	Names of metrics to exclude. Typically used with metrics = NULL. E.g. 'entropy'.
`model_subsample`	Ratios of best models to consider.
`n_models_subsample`	Number of best models when subsampling.

Value

Input object df_kxi subsetted to best models according to ensemble metrics.

Examples

data('m_psych_embeds')
m_psych_embeds = m_psych_embeds[1:200, 1:20]

df_params = expand.grid(n_xi = 4:5, pts = c(5, 10), dist = 'cosine',
                        dim_red = 'ICA', n_dimred_comp = 5)

df_kxi = opticskxi_pipeline(m_psych_embeds, df_params,
                            metrics_dist = 'cosine',
                            n_min_clusters = 2, n_cores = 1,
                            metrics = c('avg.silwidth', 'dunn'))

df_kxi = ensemble_models(df_kxi, n_models = 4,
                         model_subsample = c(0.4, 0.6),
                         n_models_subsample = 4)

data('m_psych_embeds')
m_psych_embeds = m_psych_embeds[1:200, 1:20]

df_params = expand.grid(n_xi = 4:5, pts = c(5, 10), dist = 'cosine',
                        dim_red = 'ICA', n_dimred_comp = 5)

df_kxi = opticskxi_pipeline(m_psych_embeds, df_params,
                            metrics_dist = 'cosine',
                            n_min_clusters = 2, n_cores = 1,
                            metrics = c('avg.silwidth', 'dunn'))

df_kxi = ensemble_models(df_kxi, n_models = 4,
                         model_subsample = c(0.4, 0.6),
                         n_models_subsample = 4)

Fortify a dimension reduction object

Description

Fortify a dimension reduction object

Usage

fortify_dimred(
  m_dimred,
  m_vars = NULL,
  v_variance = NULL,
  sup_vars = NULL,
  var_digits = 1
)
fortify_dimred(
  m_dimred,
  m_vars = NULL,
  v_variance = NULL,
  sup_vars = NULL,
  var_digits = 1
)

Arguments

`m_dimred`	Projection matrix
`m_vars`	Rotation matrix (optional)
`v_variance`	Explained variance (optional)
`sup_vars`	Optional supplementary variables
`var_digits`	Explained variance percent digits

Value

Data frame

Examples

pca <- prcomp(iris[-5])
df_pca <- fortify_dimred(pca$x)
pca <- prcomp(iris[-5])
df_pca <- fortify_dimred(pca$x)

Get and fortify ICA

Description

Get and fortify ICA

Usage

fortify_ica(m_data, ..., sup_vars = NULL)
fortify_ica(m_data, ..., sup_vars = NULL)

Arguments

`m_data`	Input matrix
`...`	Passed to fastICA::fastICA
`sup_vars`	Optional supplementary variables

Value

Fortified dimension reduction

Examples

df_ica <- fortify_ica(iris[-5], n.comp = 2)
df_ica <- fortify_ica(iris[-5], n.comp = 2)

Get and fortify PCA

Description

Get and fortify PCA

Usage

fortify_pca(m_data, ..., sup_vars = NULL)
fortify_pca(m_data, ..., sup_vars = NULL)

Arguments

`m_data`	Input matrix
`...`	Passed to stats::prcomp
`sup_vars`	Optional supplementary variables

Value

Fortified dimension reduction

Examples

df_pca <- fortify_pca(iris[-5])
df_pca <- fortify_pca(iris[-5], sup_vars = iris[5])
df_pca <- fortify_pca(iris[-5])
df_pca <- fortify_pca(iris[-5], sup_vars = iris[5])

Get best k-Xi model

Description

Select k-Xi clustering model based on a metric and a rank

Usage

get_best_kxi(df_kxi, metric = "avg.silwidth", rank = 1)
get_best_kxi(df_kxi, metric = "avg.silwidth", rank = 1)

Arguments

`df_kxi`	Data frame returned by opticsxi_pipeline
`metric`	Metric to choose best model
`rank`	Rank(s) of model to choose, ordered by decreasing metric

Value

df_kxi row with specified metric and rank, simplified to a list if only one rank selected

Plot multiple axes of a data frame or a fortified dimension reduction.

Description

Plot multiple axes of a data frame or a fortified dimension reduction.

Usage

ggpairs(
  df_data,
  group = NULL,
  axes = 1:2,
  variables = FALSE,
  n_vars = 0,
  ellipses = FALSE,
  ...,
  title = NULL,
  colors = if (!is.null(group)) nice_palette(df_data[[group]])
)
ggpairs(
  df_data,
  group = NULL,
  axes = 1:2,
  variables = FALSE,
  n_vars = 0,
  ellipses = FALSE,
  ...,
  title = NULL,
  colors = if (!is.null(group)) nice_palette(df_data[[group]])
)

Arguments

`df_data`	Data frame
`group`	Column name of the grouping of observations
`axes`	Axes to plot. If more than 2, plots all pair combinations
`variables`	Logical, plot variable contributions of the dimension reduction to the selected axes, only for 2 axes
`n_vars`	Maximum number of variable contributions to plot. By default 0, for all variables.
`ellipses`	Logical, plot ellipses of groups
`...`	Passed to ggplot2 stat_ellipse if ellipses are requested
`title`	String to add as title, default NULL
`colors`	Vector of colors for each group

Value

ggmatrix

Examples

df_pca <- fortify_pca(iris[-5])
ggpairs(df_pca)
df_pca <- fortify_pca(iris[-5], sup_vars = iris[5])
ggpairs(df_pca, group = 'Species', ellipses = TRUE, variables = TRUE)
df_pca <- fortify_pca(iris[-5])
ggpairs(df_pca)
df_pca <- fortify_pca(iris[-5], sup_vars = iris[5])
ggpairs(df_pca, group = 'Species', ellipses = TRUE, variables = TRUE)

Ggplot OPTICS k-Xi metrics

Description

Plot metrics of a kxi_pipeline output

Usage

ggplot_kxi_metrics(df_kxi, metric = c("avg.silwidth", "bw.ratio"), n = 8)
ggplot_kxi_metrics(df_kxi, metric = c("avg.silwidth", "bw.ratio"), n = 8)

Arguments

`df_kxi`	Data frame returned by opticskxi_pipeline
`metric`	Vector of metrics to display from the df_kxi object
`n`	Number of best models for the first metric to display

Value

ggplot

Ggplot optics

Description

Plot OPTICS reachability plot.

Usage

ggplot_optics(
  optics_obj,
  groups = NULL,
  colors = if (!is.null(groups)) nice_palette(groups),
  segment_size = 300/nrow(df_optics)
)
ggplot_optics(
  optics_obj,
  groups = NULL,
  colors = if (!is.null(groups)) nice_palette(groups),
  segment_size = 300/nrow(df_optics)
)

Arguments

`optics_obj`	dbscan::optics object
`groups`	Optional vector defining groups of OPTICS observations
`colors`	If groups specified, vector of colors for each group
`segment_size`	Size for geom_segment

Value

ggplot

Examples

data('multishapes')
optics_obj <- dbscan::optics(multishapes[1:2])
ggplot_optics(optics_obj)
ggplot_optics(optics_obj,
  groups = opticskxi(optics_obj, n_xi = 5, pts = 30))
data('multishapes')
optics_obj <- dbscan::optics(multishapes[1:2])
ggplot_optics(optics_obj)
ggplot_optics(optics_obj,
  groups = opticskxi(optics_obj, n_xi = 5, pts = 30))

Gtable OPTICS k-Xi distance profiles

Description

Plot OPTICS distance profiles of k-Xi clustering models

Usage

gtable_kxi_profiles(df_kxi, metric = "avg.silwidth", rank = 1:4, ...)
gtable_kxi_profiles(df_kxi, metric = "avg.silwidth", rank = 1:4, ...)

Arguments

`df_kxi`	Data frame returned by opticskxi_pipeline
`metric`	Metric to choose best clustering model
`rank`	Ranks of models to plot, ordered by decreasing model metric
`...`	Passed to ggplot_kxi_profile

The HLA data

Description

This data set contains HLA markers DRB, DQA, DQB and phenotypes of 271 Schizophrenia patients (y=1) and controls (y=0). Genotypes for 3 HLA loci have prefixes name (e.g., "DQB") and a suffix for each of two alleles (".a1" and ".a2").

Imported from the gap package.

Usage

data(hla)data(hla)

Format

A data frame containing 271 rows and 8 columns

Source

Dr Padraig Wright of Pfizer

A dataset containing the embeddings matrix of psychological related words

Description

Data containing Glove embeddings of psychological related words, useful for demonstrating the use of ensemble metrics.

Usage

data("m_psych_embeds")
data("m_psych_embeds")

Format

A matrix with 831 words in rows and 100 embedding dimensions in columns.

Details

The dataset contains groups of related words among other irrelevant words.

A dataset containing clusters of multiple shapes

Description

Data containing clusters of any shapes. Useful for comparing density-based clustering (DBSCAN) and standard partitioning methods such as k-means clustering. Imported from the factoextra package.

Usage

data("multishapes")
data("multishapes")

Format

A data frame with 1100 observations on the following 3 variables.

x: a numeric vector containing the x coordinates of observations
y: a numeric vector containing the y coordinates of observations
shape: a numeric vector corresponding to the cluster number of each observations.

Details

The dataset contains 5 clusters and some outliers/noises.

Examples

data('multishapes')
plot(multishapes[, 1], multishapes[, 2],
    col = multishapes[, 3], pch = 19, cex = 0.8)

data('multishapes')
plot(multishapes[, 1], multishapes[, 2],
    col = multishapes[, 3], pch = 19, cex = 0.8)

Nice palette

Description

Color palette

Usage

nice_palette(groups, rainbow = FALSE)
nice_palette(groups, rainbow = FALSE)

Arguments

`groups`	Vector, each unique value will get a color
`rainbow`	If TRUE, rainbow-like colors, else differentiate successive values

Value

Vector of colors

norm_inprod

Description

Normalized inner product with transposed input matrix

Usage

norm_inprod(m)
norm_inprod(m)

Arguments

`m`	Numeric matrix

Value

Numeric matrix

Matrix normalization

Description

normalize matrix rows using given norm. Copied from text2vec

Usage

normalize(m, norm = c("l1", "l2", "none"))
normalize(m, norm = c("l1", "l2", "none"))

Arguments

`m`	`matrix` (sparse or dense).
`norm`	`character` the method used to normalize term vectors

Value

normalized matrix

OPTICS k-Xi clustering algorithm

Description

For each largest distance differences on the OPTICS profile, consecutive observations left and right on the OPTICS profile (i.e. lower and higher OPTICS id) will be assigned to 2 different clusters if their distance is below the distance of the edge point. If above, observations are NA. The pts parameter defines a minimum number of observations to form a valley (i.e. cluster). If the number of observations in one valley is smaller than pts, observations are set to NA.

Usage

opticskxi(
  optics_obj,
  n_xi,
  pts = optics_obj$minPts,
  max_loop = 50,
  verbose = FALSE
)
opticskxi(
  optics_obj,
  n_xi,
  pts = optics_obj$minPts,
  max_loop = 50,
  verbose = FALSE
)

Arguments

`optics_obj`	Data frame returned by optics
`n_xi`	Number of clusters to define
`pts`	Minimum number of points per clusters
`max_loop`	Maximum iterations to find n_xi clusters
`verbose`	Print the ids of the largest difference considered and cluster information if they define one

Value

Vector of clusters

Examples

data('multishapes')
optics_shapes <- dbscan::optics(multishapes[1:2])
kxi_shapes <- opticskxi(optics_shapes, n_xi = 5, pts = 30)
ggplot_optics(optics_shapes, groups = kxi_shapes)
ggpairs(cbind(multishapes[1:2], kXi = kxi_shapes), group = 'kXi')
data('multishapes')
optics_shapes <- dbscan::optics(multishapes[1:2])
kxi_shapes <- opticskxi(optics_shapes, n_xi = 5, pts = 30)
ggplot_optics(optics_shapes, groups = kxi_shapes)
ggpairs(cbind(multishapes[1:2], kXi = kxi_shapes), group = 'kXi')

OPTICS k-Xi models comparison pipeline

Description

Computes OPTICS k-Xi models based on a parameter grid, binds results in a data frame, and computes distance based metrics for each model.

Usage

opticskxi_pipeline(
  m_data,
  df_params = expand.grid(n_xi = 1:10, pts = c(20, 30, 40), dist = c("euclidean",
    "abscorrelation"), dim_red = c("identity", "PCA", "ICA"), n_dimred_comp = c(5, 10,
    20)),
  metrics_dist = c("euclidean", "cosine"),
  max_size_ratio = 1,
  n_min_clusters = 0,
  n_cores = 1,
  ...
)
opticskxi_pipeline(
  m_data,
  df_params = expand.grid(n_xi = 1:10, pts = c(20, 30, 40), dist = c("euclidean",
    "abscorrelation"), dim_red = c("identity", "PCA", "ICA"), n_dimred_comp = c(5, 10,
    20)),
  metrics_dist = c("euclidean", "cosine"),
  max_size_ratio = 1,
  n_min_clusters = 0,
  n_cores = 1,
  ...
)

Arguments

`m_data`	Data matrix
`df_params`	Parameter grid for the OPTICS k-Xi function call and optional dimension reduction. Required columns: n_xi, pts, dist. Optonal columns: dim_red, n_dim_red.
`metrics_dist`	Distance used for metrics, either euclidean or cosine.
`max_size_ratio`	Maximum size ratio of clusters. E.g. for 0.8, if a cluster is larger than 80% of points it will be removed.
`n_min_clusters`	Minimum number of clusters. Ignored if 0.
`n_cores`	Number of cores
`...`	Passed to get_kxi_metrics

Value

Input parameter data frame with with results binded in columns optics, clusters and metrics.

Examples


data('hla')
m_hla <- hla[-c(1:2)] %>% scale

df_params_hla <- expand.grid(n_xi = 3:5, pts = c(20, 30),
  dist = c('manhattan', 'euclidean'))

df_kxi_hla <- opticskxi_pipeline(m_hla, df_params_hla)

ggplot_kxi_metrics(df_kxi_hla, n = 8)
gtable_kxi_profiles(df_kxi_hla) %>% plot

best_kxi_hla <- get_best_kxi(df_kxi_hla, rank = 2)
clusters_hla <- best_kxi_hla$clusters

fortify_pca(m_hla, sup_vars = data.frame(Clusters = clusters_hla)) %>%
  ggpairs('Clusters', ellipses = TRUE, variables = TRUE)

data('hla')
m_hla <- hla[-c(1:2)] %>% scale

df_params_hla <- expand.grid(n_xi = 3:5, pts = c(20, 30),
  dist = c('manhattan', 'euclidean'))

df_kxi_hla <- opticskxi_pipeline(m_hla, df_params_hla)

ggplot_kxi_metrics(df_kxi_hla, n = 8)
gtable_kxi_profiles(df_kxi_hla) %>% plot

best_kxi_hla <- get_best_kxi(df_kxi_hla, rank = 2)
clusters_hla <- best_kxi_hla$clusters

fortify_pca(m_hla, sup_vars = data.frame(Clusters = clusters_hla)) %>%
  ggpairs('Clusters', ellipses = TRUE, variables = TRUE)

Print vignette table

Description

Print knitr::kable latex table with legend at bottom.

Usage

print_vignette_table(table_obj, label)
print_vignette_table(table_obj, label)

Arguments

`table_obj`	Table object
`label`	Latex label

Value

None, side-effect prints a Latex table

Residuals table

Description

Bind contingency table and Pearson Chi-squared residuals.

Usage

residuals_table(...)
residuals_table(...)

Arguments

...

Passed to contingency_table and chisq.test

Value

Matrix

stddev_mean

Description

Get mean of standard deviations of matrix columns

Usage

stddev_mean(m)
stddev_mean(m)

Arguments

`m`	Numeric matrix

Value

Mean of standard deviations of matrix columns

`lhs`	An object which serves both as the initial value and as target.
`rhs`	a function call using the magrittr semantics.

Package 'opticskxi'

Help Index

Assignment pipe

Description

Arguments

Value

Pipe

Description

Arguments

Value

Exposition pipe

Description

Arguments

Value

Contingency table

Description

Usage

Arguments

Value

Cosine similarity between vectors and/or matrices.

Description

Usage

Arguments

Value

Crohn's disease data

Description

Usage

Format

Source

dist_matrix

Description

Usage

Arguments

Value

Compute ensemble metrics

Description

Usage

Arguments

Value

Select models based on ensemble metrics

Description

Usage

Arguments

Value

Select best models based on ensemble metrics

Description

Usage

Arguments

Value

Examples

Fortify a dimension reduction object

Description

Usage

Arguments

Value

See Also

Examples

Get and fortify ICA

Description

Usage

Arguments

Value

See Also

Examples

Get and fortify PCA

Description

Usage

Arguments

Value

See Also

Examples

Get best k-Xi model

Description

Usage

Arguments

Value

See Also

Plot multiple axes of a data frame or a fortified dimension reduction.

Description

Usage