commit | 2e4905c5f9ea5d73fcba22036bc76ba587a0ea20 | [log] [tgz] |
---|---|---|
author | Richard Eckart de Castilho <rec@apache.org> | Wed Mar 03 10:02:32 2021 +0100 |
committer | Richard Eckart de Castilho <rec@apache.org> | Wed Mar 03 10:02:32 2021 +0100 |
tree | 2d57728f7d0bdddf54ce44a6931a8cc468ccc51f | |
parent | b0f607765a4318cdf1609f37ac34d5520687386c [diff] | |
parent | bd451705dd0afc6a31c9545697e4dfb4a7dc9250 [diff] |
Merge branch 'main' into bugfix/UIMA-6323-SeedLexer-not-generated-when-building-in-Eclipse * main: (77 commits) [UIMA-6307] Centralize Jenkins pipelines [UIMA-6301] Rename "master" branches to "main" HD-6268: revised and improved documentation UIMA-6281: fix method for null arguments UIMA-6281: Ruta: use uimaFIT instead of CAS.select().coveredBy() internally UIMA-6281: Ruta: use uimaFIT instead of CAS.select().coveredBy() internally no jira - deactivate test for now UIMA-6271: Ruta: option to validate internal indexing in RutaEngine [UIMA-6231] Reducing memory pressure generated by UIMA Ruta [NO JIRA] Copying v3 code to a branch under the v2 spot so it gets included in the GitHub mirror. [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release ruta-3.0.1 UIMA-6194: merge v2 changes no jira - manual rollback [maven-release-plugin] prepare for next development iteration [maven-release-plugin] prepare release ruta-3.0.1 merged v2: UIMA-6195, UIMA-6194, UIMA-6193, UIMA-6192, UIMA-6191, UIMA-6183, UIMA-6171 no jira - update updatesite version no jira - preparations fro next release - updated versions - updated jira version - updated release notes UIMA-6183: fix NPE and visibility problems of literal string matches in v3, behavior adapted in order to avoid unexpected matching within tokens ... % Conflicts: % Jenkinsfile % ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation/AnnotationLabelExpressionTest.java
Apache UIMA Ruta™ is a rule-based script language supported by Eclipse-based tooling. The language is designed to enable rapid development of text processing applications within Apache UIMA™. A special focus lies on the intuitive and flexible domain specific language for defining patterns of annotations. Writing rules for information extraction or other text processing applications is a tedious process. The Eclipse-based tooling for UIMA Ruta, called the Apache UIMA Ruta Workbench, was created to support the user and to facilitate every step when writing UIMA Ruta rules. Both the Ruta rule language and the UIMA Ruta Workbench integrate smoothly with Apache UIMA.
The UIMA Ruta language is an imperative rule language extended with scripting elements. A rule defines a pattern of annotations with additional conditions. If this pattern applies, then the actions of the rule are performed on the matched annotations. A rule is composed of a sequence of rule elements and a rule element usually consists of four parts: A matching condition, an optional quantifier, a list of conditions and a list of actions. The matching condition is typically a type of an annotation by which the rule element matches on the covered text of one of those annotations. The quantifier specifies, whether it is necessary that the rule element successfully matches and how often the rule element may match. The list of conditions specifies additional constraints that the matched text or annotations need to fulfill. The list of actions defines the consequences of the rule and often creates new annotations or modifies existing annotations.
The following example rule consists of three rule elements. The first one (ANY...
) matches on every token, which has a covered text that occurs in a word lists, named MonthsList
. The second rule element (PERIOD?
) is optional and does not need to be fulfilled, which is indicated by the quantifier ?
. The last rule element (NUM...
) matches on numbers that fulfill the regular expression REGEXP(".{2,4}")
and are therefore at least two characters to a maximum of four characters long. If this rule successfully matches on a text passage, then its three actions are executed: An annotation of the type Month
is created for the first rule element, an annotation of the type Year
is created for the last rule element and an annotation of the type Date
is created for the span of all three rule elements. If the word list contains the correct entries, then this rule matches on strings like Dec. 2004
, July 85
or 11.2008
and creates the corresponding annotations.
(ANY{INLIST(MonthsList) -> Month} PERIOD? @NUM{REGEXP(".{2,4}") -> Year}){-> Date};
Here is a short overview of additional features of the rule language:
The UIMA Ruta Workbench was created to facilitate all steps in creating Analysis Engines based on the UIMA Ruta language. Here is a short overview of included features:
Editing support: The full-featured editor for the UIMA Ruta language provides syntax and semantic highlighting, syntax checking, context-sensitive auto-completion, template-based completion, open declaration and more.
Rule Explanation: Each step in the matching process can be explained: This includes how often a rule was applied, which condition was not fulfilled, or by which rule a specific annotation was created. Additionally, profile information about the runtime performance can be accessed.
Automatic Validation: UIMA Ruta scripts can automatically validated against a set of annotated documents (F1 score, test-driven development) and even against unlabeled documents (constraint-driven evaluation).
Rule learning: The supervised learning algorithms of the included TextRuler framework are able to induce rules and, therefore, enable semi-automatic development of rule-based components.
Query: Rules can be used as query statements in order to investigate annotated documents.
The UIMA Ruta Workbench can be installed via Eclipse update sites:
If you use UIMA Ruta to support academic research, then please consider citing the following paper as appropriate:
@article{NLE:10051335, author = {Kluegl, Peter and Toepfer, Martin and Beck, Philip-Daniel and Fette, Georg and Puppe, Frank}, title = {UIMA Ruta: Rapid development of rule-based information extraction applications}, journal = {Natural Language Engineering}, volume = {22}, issue = {01}, month = {1}, year = {2016}, issn = {1469-8110}, pages = {1--40}, numpages = {40}, doi = {10.1017/S1351324914000114}, URL = {https://journals.cambridge.org/article_S1351324914000114}, }