The following node is available in the Open Source KNIME predictive analytics and data mining platform version 2.7.1. Discover over 1000 other nodes, as well as enterprise functionality at http://knime.com.
This node splits the string content of a selected column into logical groups using regular expressions. A group is identified by a pair of parentheses, whereby the pattern in such parentheses is a regular expression. Each content of each group is appended as an individual column.
A short introduction to Groups and Capturing is given by in the Java API . Some examples are given below:
Patent identifiers such as "US5443036-X21" consisting of a (at most) two letter country code ("US"), a patent number ("5443036") and possibly some application code ("X21"), which is separated by a dash or a space character, can be grouped by the expression ([A-Za-z]{1,2})([0-9]*)[ \-]*(.*$). Each of the parenthesized terms corresponds to the aforementioned properties.
This is particularly useful when this node is used to parse the file URL of a file reader node (the URL is exposed as flow variable and then exported to a table using a Variable to Table node). The format of such URLs is similar to "file:c:\some\directory\foo.csv". Using the pattern [A-Za-z]*:(.*[/\\])(([^\.]*)\.(.*$)) generates four groups (by counting the number of opening parentheses): The first group identifies the directory and is denoted by "(.*[/\\])". It consumes all characters until a final slash or backslash is encountered; in the example this refers to "c:\some\directory\". The second group represents the file name, whereby it encapsulates the third and fourth group. The third group (denoted by "([^\.]*)") consumes all characters after the directory, which are not a dot '.' (which is "foo" in the above example). The pattern expects a single dot (which is ignored) and finally the fourth group "(.*$)", which reads until the end of the string and indicates the file suffix ('csv'). The groups for the above example are
0 | Input table with string column to be split. |
0 | Input table amended by additional column representing the pattern groups. |