The following node is available in the Open Source KNIME predictive analytics and data mining platform version 2.7.1. Discover over 1000 other nodes, as well as enterprise functionality at
http://knime.com.
Decision Tree Learner
This node induces a classification decision tree in main memory.
The target attribute must be nominal. The other attributes used for
decision making can be either nominal or numerical. Numeric splits
are always binary (two outcomes), dividing the domain in two partitions at a
given split point. Nominal splits can be either binary (two outcomes) or
they can have as many outcomes as nominal values. In the
case of a binary split the nominal values are divided into two subsets.
The algorithm provides two quality measures for split calculation;
the gini index and the gain ratio. Further, there exist a
post pruning method to reduce the tree size and increase prediction
accuracy. The pruning method is based on the minimum
description length principle.
The algorithm can be run in multiple threads, and thus, exploit multiple
processors or cores.
Most of the techniques used in this decision tree implementation
can be found in "C4.5 Programs for machine learning", by J.R. Quinlan
and in "SPRINT: A Scalable Parallel Classifier for Data Mining", by
J. Shafer, R. Agrawal, M. Mehta (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.104.152&rep=rep1&type=pdf)
If the optional PMML inport is connected and contains
preprocessing operations in the TransformationDictionary those are
added to the learned model.
Dialog Options
- Class column
- To select the target attribute.
Only nominal attributes are allowed
- Quality measure
-
To select the quality measure according to which the
split is calculated. Available are the "Gini Index" and the "Gain Ratio".
- Pruning method
-
Pruning reduces tree size and avoids overfitting which increases
the generalization performance, and thus, the prediction quality
(for predictions, use the "Decision Tree Predictor" node).
Available is the "Minimal Description Length" (MDL) pruning or
it can also be switched off.
- Min number records per node
-
To select the minimum number of records at least required in each node. If
the number of records is smaller or equal to this number the tree
is not grown any further. This corresponds to a stopping criteria (pre pruning).
- Number records to store for view
-
To select the number of records stored in the tree for the view.
The records are necessary to enable highlighting.
- Average split point
-
If checked, the split value for numeric attributes is determined according to the mean
value of the two attribute values that separate the two partitions.
If unchecked the split value is set to the largest
value of the lower partition (like C4.5).
- Number threads
-
This node can exploit multiple threads and thus multiple processors
or cores. This can improve performance. The default value is set to
the number of processors or cores that are available to KNIME. If
set to 1, the algorithm is performed sequentially.
- Skip nominal columns without domain information
-
If checked, nominal columns containing no domain value information are
skipped. This is generally the case for nominal columns that have
too many different values.
- Binary nominal splits
-
If checked, nominal attributes are split in a binary fashion. Binary
splits are more difficult to calculate but result also in more
accurate trees. The nominal values are divided in two subsets (one
for each child). If unchecked, for each nominal value one child is
created.
- Max #nominal
-
The subsets for the binary nominal splits are difficult to calculate.
To find the best subsets for n nominal values there must be
performed 2^n calculations. In case of many different nominal values
this can be prohibitive expensive. Thus the maximum number of nominal
values can be defined for which all possible subsets are calculated.
Above this threshold, a heuristic is applied that first calculates the
best nominal value for the second partition, then the second best value,
and so on; until no improvement can be achieved.
- Filter invalid attribute values in child nodes
-
Binary splits on nominal values may lead to tests for attribute
values, which have been filtered out by a parent tree node.
This is due to the fact that the learning algorithm is consistently
using the table's domain information instead of the data in a tree
node to define the split sets. These duplicate checks do not harm
(the tree is the same and and will classify unknown data the exact
same way), though they are confusing when the tree is inspected
in the tree viewer. Enabling this option will post-process the tree
and filter invalid checks.
Ports
Input Ports
0 |
The pre-classified data that should be used to induce the decision tree.
At least one attribute must be nominal.
|
1 |
Optional PMML port object
containing preprocessing operations. |
Output Ports
0 |
The induced decision tree. The model can be used to classify
data with unknown target (class) attribute. To do so, connect the
model out port to the "Decision Tree Predictor" node.
|
Views
- Decision Tree View
-
Visualizes the learned decision tree. The tree can be expanded and
collapsed with the plus/minus signs.
- Decision Tree View (simple)
-
Visualizes the learned decision tree. The tree can be expanded and
collapsed with the plus/minus signs. The squared brackets show the
splitting criteria. This is the attribute name on which the parent
node was split and the value (numeric) and nominal value (set) that
has led to this child. The class value in single quotes states the
majority class in this node. The value in round brackets states
(x of y) where x is the quantity of the majority class and y is the
total count of examples in this node. The bar with the black border
and partly filled with yellow color represents the amount
of records that belongs to the node relatively to the parent node.
The colored pie chart renders the distribution of the color attribute
associated with the input data table. NOTE: the colors not necessarily
reflect the class attribute. If the color distribution and the
target attribute should correspond to each other, ensure that the
"Color Manager" node colors the same attribute as selected in this
decision tree node as target attribute.
This node is contained in KNIME Base Nodes
provided by KNIME GmbH, Konstanz, Germany.