tokenizers - Fast, Consistent Tokenization of Natural Language Text
Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.
Last updated 8 months ago
nlppeer-reviewedtext-miningtokenizer
13.29 score 184 stars 79 packages 1.1k scripts 36k downloadsopticskxi - OPTICS K-Xi Density-Based Clustering
Provides a novel density-based cluster extraction method, OPTICS k-Xi, and a framework to compare k-Xi models using distance-based metrics to investigate datasets with unknown number of clusters.
Last updated 5 months ago
4.30 score 2 stars 1 scripts 150 downloads