The following node is available in the Open Source KNIME predictive analytics and data mining platform version 2.7.1. Discover over 1000 other nodes, as well as enterprise functionality at
http://knime.com.
Equal Size Sampling
Removes rows from the input data set such that the values in a
categorical column are equally distributed. This can be useful, for instance
if a learning algorithm is prone to unequal class distributions and you
want to downsize the data set so that the class attributes occur equally
often in the data set.
The node will remove random rows belonging to the majority classes. The
rows returned by this node will contain all records from the minority
class(es) and a random sample from each of the majority classes,
whereby each sample contains as many objects as the minority class
contains.
Dialog Options
- Nominal Column
-
Select the class column here. The node will run over the data set once
to count the occurrences in this selected column and then do the filtering
in a second pass. Note that missing values in this column are treated as
a separate category (can also build the minority class).
- Use exact sampling
-
If selected, the final output will be determined up-front. Each class will have
the same number of instances in the output table. This sampling is
slightly more memory expensive as each class will need to be represented
by a bit set containing instances of the corresponding rows. In most cases
it is save to select this option unless you have very large data with many
different class labels.
- Use approximate sampling
-
If selected, the final output will be determined on the fly. The number
of occurrences of each class may slightly differ as the final number can't
be determined beforehand.
- Enable static seed
-
If selected, the removal of rows is driven by a static seed
(result is reproducable).
Ports
Output Ports
0 |
The input data with fewer rows.
|
This node is contained in KNIME Base Nodes
provided by KNIME GmbH, Konstanz, Germany.