The following node is available in the Open Source KNIME predictive analytics and data mining platform version 2.7.1. Discover over 1000 other nodes, as well as enterprise functionality at http://knime.com.

Hierarchical Clustering

Hierarchically clusters the input data.
Note: This node works only on small data sets. It keeps the entire data in memory and has cubic complexity.
There are two methods to do hierarchical clustering:

This algorithm works agglomerative.

In order to determine the distance between clusters a measure has to be defined. Basically, there exist three methods to compare two clusters:

In order to measure the distance between two points a distance measure is necessary. You can choose between the Manhattan distance and the Euclidean distance, which corresponds to the L1 and the L2 norm.

The output is the same data as the input with one additional column with the clustername the data point is assigned to. Since a hierarchical clustering algorithm produces a series of cluster results, the number of clusters for the output has to be defined in the dialog.

Dialog Options

Number output cluster
Which level of the hierarchy to use for the output column.
Distance function
Which distance measure to use for the distance between points.
Linkage type
Which method to use to measure the distance between points (as described above)
Distance cache
Caching the distances between the data points drastically improves performance especially for high-dimensional datasets. However, it needs much memory, so you can switch it off for large datasets.

Ports

Input Ports
0 The data that should be clustered using hierarchical clustering. Only numeric columns are considered, nominal columns are ignored.
Output Ports
0 The input data with an extra column with the cluster name where the data point is assigned to.

Views

Dendrogram/Distance View
This node is contained in KNIME Base Nodes provided by KNIME GmbH, Konstanz, Germany.