The following node is available in the Open Source KNIME predictive analytics and data mining platform version 2.7.1. Discover over 1000 other nodes, as well as enterprise functionality at
http://knime.com.
Term vector
This node creates a term vector for each term to
represent the terms in the document space. The values of
the feature vectors can be specified as boolean values or as
values of a specified column i.e. an tf*idf column. The dimension
of the vectors will be equal to the number of distinct documents
in the BoW.
Dialog Options
- Document column
-
The column containing the documents to use.
- Ignore tags
-
If checked tags are ignored when comparing terms.
- Bitvector
-
If checked a bitvector will be created indicating whether a
certain document contains a term or not.
- Vector value
-
If Bitvector setting is not checked it is possible to specify the
column to use as feature vector values. The column can i.e. contain
tf*idf values which are than used as values of the feature vector.
Be aware that you have to compute these values before using this
node. To do so i.e. the frequency calculation nodes can be used.
- As collection cell
-
If checked all vector entries will be stored in a collection cell
consisting of double cells. The cells are ordered, the ordering
is specified in the data table spec. If not checked all double
cells will be stored in corresponding columns. The advantage
of the column representation is that most of the regular algorithms
in KNIME can be applied. The disadvantage is (which is on the
other hand the advantage of the collection representation) that
processing of subsequent nodes will be slowed down, due to the many
columns that will be created (dependent on the input data of
course).
Ports
Input Ports
0 |
The input table
containing the bag of words. |
Output Ports
0 |
An output table
containing the terms with the related term vectors. |
This node is contained in KNIME Textprocessing Plug-in
provided by KNIME GmbH, Konstanz, Germany.