The following node is available in the Open Source KNIME predictive analytics and data mining platform version 2.7.1. Discover over 1000 other nodes, as well as enterprise functionality at http://knime.com.
This node analyses documents and extracts relevant keywords
using the graph-based approach described in
"KeyGraph: Automatic Indexing by Co-occurrence Graph based on
Building Connstruction Metaphor" by Yukio Ohsawa.
First, a predetermined amount of terms are selected based on their
frequency (high frequency set, HF) and added as the initial nodes of
the graph.
The association strength between each of these terms is then
calculated using the following scoring method: assoc(term1, term2) =
min(occurrence frequency of term1, occurrence frequency of term2)
summed for every sentence in the document.
The top |HF|-1 associations are inserted into the graph as edges.
If an edge between two terms is the only path that connects them, it
is pruned.
The graph's connected subgraphs are then extracted and considered as
"concept" clusters.
A new batch of terms is added based on their key score, which is the
conditional probability that a term will be used if the author has
all the concepts (clusters) in mind (P(UNION(w|g)) where t is the
term and the union is done over every cluster g of the set of clusters.
Each of these new terms is then linked to every cluster using the
strongest scoring edge amongst the possible ones.
Finally, all the terms in the graph are rated based on this formula:
score(t) = summation over every edge connecting t and other terms (w),
summation over every sentences, min(freq(t), freq(w)).
Setting the console's output level to DEBUG will make this node
display the contents of the clusters after the pruning phase.
terms.
0 | The input table which contains the documents to analyse. |
0 | The output table which contains (keyword term, score, associated document) tuples. |