A clustering method for high dimensional dataset


>>back to top


Current version: 1.0, Oct 2014
Update on Mar 2015: An R script of SNN.R is added as an alternative to SNN.m
Update on Jun 2016: Make changes in to accommodate Python 3. is now tested on Python 2.6, 2.7 and 3.5.
Update on Feb 2017: Make change in (Line 85) to accommodate the change of dict.keys() function in Python 3 (which returns a view instead of a list in Python 2).

The SNN-Cliq is consisted of two steps. First, a shared nearest neighbor (SNN) graph is constructed using the input dataset. Second, dense subgraphs (clusters) are found in the SNN graph by recursively merging quasi-cliques. The two steps are implemented in the respective program:

1) SNN.m (or SNN.R)

Using SNN.m

This Matlab program takes in a data matrix and produce a file of the SNN graph.

  • Usage: SNN(data, edge_file, k, distance);
  • Arguments:
    data A matrix representing the target dataset. Rows correspond to objects/observations and columns correspond to attributes/variables.
    edge_file The name of the output file. The output contains all edges (with weight>0) in the SNN graph. It is used as the input file for the second step.
  • Options:
    k The size of the nearest neighbor list. Default is 3.
    distance The method for calculating primary similarity matrix. The default is 'euclidean'. Other values include 'correlation', 'cosine', and etc. Refer to the argument 'Distance' of the Matlab function knnsearch() for more options.

  • Example of usage
    In Matlab, import the target dataset into a matrix, e.g. data=importdata('path/to/data_file.txt'). Process the data if necessary, for example, log-transformation or feature selection.
  • Example of the original data file
    In the context of clustering cells according to gene expression levels, the rows should correspond to cells and columns should correspond to gene levels.

Using SNN.R

This R program serves the same purpose as SNN.m.

  • Usage: SNN(data, edge_file, k, distance);
  • Arguments:
    The arguments are the same as the Matlab function SNN().
    distance: default is 'euclidean'. Other options include "maximum", "manhattan", "canberra" or "binary". Please refer to the dist() function in R.
  • Example of usage:
    Open R and import the data, e.g. data <- read.table(infile, header=TRUE, sep="\t", row.names=1). Process the data as needed, e.g. data<-log2(data+1). Then call the function by source("path/to/SNN.R"); SNN(data, 'edge_file.txt', k=3, distance="euclidean").


This Python program takes in the output file from SNN.m and produces a file of the clustering result.

  • Usage: ./ -i <edge_file> -o <out_file> [options]
  • Arguments:
    -i/ --input <edge_file> The name of the input file. It is the same as the output file from SNN.m. The edge_file can also be created by other methods. Each line represents an edge of the graph in the format of 'node1 node2 weight_of_edge' (whitespace delimited).
    -o/--output <out_file> The name of the output file. The output is cluster labels for each node in the graph.
  • Options:
    -h/--help Prints the help message and exits.
    -r/--r-quasi-cliq Parameter for quasi-clique finding. r should be a float in the range of (0,1]. The default is 0.7. r specifies the density threshold of quasi-cliques. Changing r can fine-tune the granularity of cluster output: increasing r results in granulated clusters; reducing r leads to bigger clusters.
    -m/--merging Parameter for cluster merging. m should be a float in the range of (0,1]. The default is 0.5. m specifies the threshold on the overlapping rate for merging. Change m has the same effect as changing r.
    -n/--number Number of objects in the data.
    -h/--help Print help message.

  • Example of usage
  • Example of input <edge_file>

    A node in the edge_file is represented by the index of the corresponding data point in the original data file. The edge_file should contain edges (weight>0. An edge with 0 weight will be ignored.) for only undirected graphs, and each edge only need to appear once. For example, the edge between node 1 and 2 is either represented by '1 2 0.1' or '2 1 0.1' . If your edge_file contains more than one lines for an edge and the weights are unequal, e.g. '1 2 0.1' and '2 1 0.2', the program will not exit with error, instead it will use the one that is read in later. Note that the edge_file may not include all the data points in the original dataset, since the SNN graph may have some nodes isolated (no edges connecting to other nodes), e.g. node 6 is not in the edge_file.
  • Example of output <out_file>

    The cluster labels in out_file denote the cluster identities of the corresponding data points. Thus, the result means: node 1 2 3 are clustered into cluster 1, while node 4 5 6 7 are all singletons (not included in any clusters). The program assigns cluster identities to each node up to the maximum indexed node in the edge_file (node 7). Since node 6 is not in the graph, it is also assigned to singletons. If the number of labels in your out_file is less than the number of data points, it is probably because the last several data points are isolated nodes in the SNN graph and are not in the edge_file. You can either manually annotate these points as singletons, or require the program to do so by giving the total number of data points using the -n option when running the program.
>>back to top


step 1: SNN.m or SNN.R
step 2: (change the file extension from .txt to .py after download to use)


Please e-mail Chen Xu if you have any concerns or suggestions: xuchen2004 at gmail dot com

The Su Lab - Bioinformatics and Genomics -->>

College of Computing and Informatics -->>

>>back to top