The following node is available in the Open Source KNIME predictive analytics and data mining platform version 2.7.1. Discover over 1000 other nodes, as well as enterprise functionality at
http://knime.com.
Chi-square keyword extractor
This node analyses documents and extracts relevant keywords
using cooccurrence statistics as described in
"Keyword extraction from a single document using word co-occurrence
statistical information" by Y.Matsuo and M. Ishizuka.
First, the most frequent terms (see node settings) are extracted and
then clustered together using the pointwise mutual information and
a normalized version of the L1 norm as measures of distance between
their cooccurrence probability distributions.
A term can be considered as member of a cluster if it is similar to
all the terms inside it according to at least one of the similarity
measures. If more than one cluster meets this condition, the
one with the highest average score will be used. If no cluster
is similar, a new one is created.
Once this is done, each term is ranked
in decreasing order of the deviation between their expected cluster
cooccurrence and the actual observed cooccurrence value. The terms
with the highest divergence are returned as keywords.
Setting the console's output level to DEBUG will make this node
display the set of frequent terms, the distance between them during
the clustering phase and the final clusters.
terms.
Dialog Options
- Document column
-
The name of the column which contains the documents to analyse.
- Number of keywords to extract
-
The number of keywords to extract per document.
- Percentage of unique terms in the document to use for the chi-square measures
-
The percentage of the set of unique terms in the document to use to
build the term clusters.
The article this node is based on provides 30% as a rule of thumb.
- Ignore tags
-
If this option is checked, the node will only compare terms based
on their word content. In other words, tags and any other meta
information will be ignored.
This will not affect the output documents, only the way they are
analysed.
- Pointwise mutual information threshold
-
Terms whose pointwise mutual information score is greater than or
equal to this value will be considered as similar and thus clustered
together.
This similarity measure typically ranges from 0 to infinity but has
been normalized from 0 to 1 using arctan(value)/(pi/2). It measures
the discrepancy between the actual cooccurrence probability and the
one if both terms were completely independent.
- Normalized L1 norm threshold
-
Terms whose normalized L1 norm score is greater than or
equal to this value will be considered as similar and thus clustered
together.
This similarity measure ranges from 0 to 1 inclusively. It measures
the similarity between the cooccurrence probability of every term in
the document with the terms (P(t|first term) vs P(t|second term)
for every possible t).
Ports
Input Ports
0 |
The input table which contains the documents to analyse. |
Output Ports
0 |
The output table which contains (keyword term, deviation value,
associated document) tuples.
|
This node is contained in KNIME Textprocessing Plug-in
provided by KNIME GmbH, Konstanz, Germany.