9.6. Expression Language

9.6.1. Introduction

The DISQOVER Data Ingestion Engine is equipped with a powerful built-in expression language that can be used for two purposes:

  1. Narrowing the scope of a component to a specific set of resources, by specifying the condition that must be fulfilled for a resource to be in scope of the component. The expression that defines this scope for a component is called the filter.
  2. Defining the operation that is to be performed on the predicate values of a resource in a Transform Literals component. For example, an expression could be used to concatenate the first and last name of a person into a single literal, with a whitespace in between.

9.6.2. Statements

A Data Ingestion Engine expression consists in one or more statements. In case more than one statement is present, they have to be separated by semicolons (;). Each statement can be one of the following two types:

Assignment statement

This type of statement assigns a new value to an expression variable. The essential structure of an assignment is:

set [variable] = [expression]

You cannot overwrite an existing predicate in a statement, only append new values to an existing predicate or create a new one.

Return value statement
This type of statement returns a value. For example, a statement returning a boolean value can be used as scope filter for a component. The last statement of an expression determines the return value of the entire expression.

Comments can be added to a statement by starting text with a # character.

In some cases it is useful to combine a set of statements into a new, single statement. The compound keyword can be used to achieve this:

compound{
  [statement1];
  [statement2];
}

The return value of the last statement in this compound set determines the return value of the compound statement.

9.6.3. Variables

Three types of variables can be used in an expression:

Input variables
When an expression is evaluated in the context of a resource, the predicates that exist at the time of the evaluation can be used as variables in the expression. They are referred to by the predicate name with a predecessing $ sign. For example, the existing predicate “name” is referred to by $name. An input variable can appear in a return value statement or in the right-hand side of an assignment statement, but never in the left-hand side of an assignment statement.
Output variables
In case an expression is used to create new predicates in a Transform Literals component, such a new predicate is referred to by the predicate name with a predecessing @ sign For example, the new predicate “full_name” is referred to by @full_name. An output variable can only appear in the left-hand side of an assignment statement.
Temporary variables
In addition, helper variables can be used that only exist during the evaluation of the expression. Such variables have to start with an underscore character (_). A temporary variable must be created by writing it in the left-hand side of an assignment. Once created, it can appear in the right-hand side of an assignment, or in a return value statement.

Note

The data type of a variable that is linked to a predicate (input or output) is always a list. This reflects the fact that a predicate is intrinsically multi-valued.

In case a predicate is known to be single-valued, that value can be accessed using the a double $$ notation. For example, $name returns the value of a single-valued predicate “name”. In case this predicate actually holds more than one value, this construction will cause an error.

9.6.4. Data Types

The following data types are supported in the expression language:

Boolean
A binary value that can be either True or False.
Numerical
A numerical value that can be either have an integer or a floating-point content. For example: 3.14159.
String
A piece of text. String literals can be entered in an expression in two ways:
  • Standard notation: surrounded by double quotes, such as "text".
  • Raw block quoted strings, written as r"""text""". This notation has the advantage that the string literal may contain special characters such as backspaces or quotes without the need to escape them.
List
An ordered collection of entities. Each entity in itself can be of any of the data types supported by the expression language. Lists literals can be written as a comma-separated enumeration of the elements, surrounded by square brackets []. For example: [1, 2, 3].
Map
A map is a lookup dictionary that associates a set of entities with unique string values. Each entity in itself can be of any of the data types supported by the expression language. Map literals can be written as a comma-separated set of key-value pairs (separated by colons), surrounded by curly brackets {}. For example: {"a": 1, "b":2, "c":3}.

A special value null is used to represent the absence of any data.

Note

The data types recognised by the expression language only exist during the execution of the expression. Predicate values are always stored as string during the data ingestion process.

For example this means that, in order to process predicate values as floats, a type conversion must happen using the Float conversion function. When float values are stored as predicate values, those must be converted to strings using the Str conversion function. An illustration of this process can be found in this example.

9.6.5. Syntax

Operators

The Data Ingestion Engine expression language supports the following operators (in order of increasing precedence):
  • Boolean binary operators: or, and. These can be applied on Boolean data types.
  • Comparison binary operators: <=, >=, ==, !=, <, >. These can be applied on Boolean, Numerical and String data types (both sides of the operator must have the same type).
  • Arithmetic binary operators: +, -, *, /, ^. These can be applied on Numerical data types.

Round brackets () can be used to control the order of execution of operators.

A ternary operator, ifthen, can be used to perform conditional evaluation. The expression:

ifthen([condition], [value1], [value2])

returns [value1] if [condition] is True, and [value2] if it is False.

Functions

Functions are called by writing the function name, followed by a comma-separated list of arguments surrounded by round brackets. For example: StrFind("abc", "b"). Some functions have optional arguments, which can be specified by adding the arguments name and value, separated by =. For example: StrFind("abc", "b", case_sensitive=False).

See Expression Functions for a list of built-in functions.

9.6.6. List iterations

Since predicates are always lists in the DISQOVER Data Ingestion Engine, the expression language contains a number of tools to assist in the processing of lists. See also List Iteration.

Map

The following syntax can be used to iterate over a list, apply a transformation to each element of the list, and return a new list with the transformed elements:

Map([sourceList], [iteratorVariable], [expression], filter=[expression], ignore_null=[True/False])
with:
  • [sourceList]: The list to transform.
  • [iteratorVariable]: the name of a temporary variable that holds the value of each source element during the iteration over the source list elements. This variable name must start with an underscore (_).
  • [expression]: expression to evaluate for each source list element. The return value of that expression is used as destination element in the result list.
  • filter: expression to evaluate for each source list element. Only the elements which return True for this expression will be transformed by the main [expression] itself. This parameter is optional and will default to None if not specified.
  • ignore_null: Can be set to either True or False. Declares whether or not null values, coming out of the [expression], are added to the result list. This parameter is optional and will default to True if not specified.

The return value of the Map function is the list with the transformed elements.

Example

The expression:

Map([1,2,3], _el, 2*_el)

Returns [2,4,6] (each element multiplied by two).

Find

The following syntax can be used to return the first element in a list that matches a specific criterion:

Find([SourceList], [iteratorVariable], [expression], required=[True/False], unique=[True/False])
with:
  • [sourceList]: The list to find the element in.
  • [iteratorVariable]: the name of a temporary variable that holds the value of each source element during the iteration over the source list elements. This variable name must start with an underscore (_).
  • [expression]: expression to evaluate for each source list element.
  • required: Can be set to either True or False. If set to False, Find will return a null value when there are no matches. If set to True, it will instead return an error message, causing the transformation and/or component to fail. This parameter is optional and will default to False if not specified.
  • unique: Can be set to either True or False. If set to True, Find will return an error message when there is more than one match, causing the transformation and/or component to fail. This parameter is optional and will default to False if not specified.

The return value of the Find function is the first element for which [expression] evaluates to True . In case no element matches, null is returned.

Example

The expression:

Find(["123", "ab", "bc", "cd", "ae"], _el, StrStarts(_el, "a"))

Returns "ab".

Whereas both:

Find(["123", "ab", "bc", "cd", "ae"], _el, StrStarts(_el, "f"), required = True)

and:

Find(["123", "ab", "bc", "cd", "ae"], _el, StrStarts(_el, "a"), unique = True)

Return an error.

Reduce

The following syntax can be used to iterate over a list, accumulate an aggregated operation over each element, and return the result of the aggregation:

Reduce([sourceList], [aggregatorInitialValue], [aggregatorVariable],
       [iteratorVariable], [expression], filter=[expression])
with:
  • [sourceList]: The list to apply the aggregating operation on.
  • [aggregatorInitialValue]: The start value of the aggregated value.
  • [aggregatorVariable]: the name of a temporary variable that accumulates the aggregated value during the iteration over the source list elements. This variable name must start with an underscore (_).
  • [iteratorVariable]: the name of a temporary variable that holds the value of each source element during the iteration over the source list elements. This variable name must start with an underscore (_).
  • [expression]: expression to evaluate for each source list element. The return value of that expression is used as the new aggregated value and assigned to [aggregatorVariable].
  • filter: expression to evaluate for each source list element. Only the elements which return True for this expression will be transformed by the main [expression] itself. This parameter is optional and will default to None if not specified.

The return value of the Reduce function is the aggregated value after iteration over each element in the list.

Example

The expression:

Reduce([1,2,3], 0 , _aggr, _el, _aggr + _el)

Returns 6 (the sum of all values in the list).

9.6.7. Custom functions

A custom function can be defined using a special statement of the form:

function [functionName]([variableList]) [expression]
with:
  • [functionName] the name of the custom function.
  • [variableList] a comma-separated list of variable names for arguments of the function. These variable names must start with an underscore (_).
  • [expression] the expression to evaluate and return as the result value of the custom functions. This expression can use the argument variables of the custom function as well as variables that are globally defined.

In case more than one statement is used to define the action of the custom function, the compound{} keyword can be used to group the statements into a single, new expression. The return value of the last statement in this compound set determines the return value of the custom function.

Once defined, a custom function can be used in the same way a built-in function is used.

Example

A custom function that returns the first letter of a string:

function GetFirstLetter(_str) StrSubstring(_str, 0, 0)

Example

Grouping several expressions in a custom function using a compound statement

function tst(_lst) compound{
   set _temp = Map(_lst, _el, Int(_el));
   Map(_temp, _el, _el+1);
};

Note that, in this example, a new temporary variable was created in the definition of this custom function.

9.6.8. Unit Tests

In the user interface edit box where an expression can be entered, a second tab allows the user to enter a unit test for that expression. Such a unit test challenges the expression with a pre-defined input, and verifies the content of the output. Several unit tests can be defined, separated byb semicolons (;).

A single unit test consists in a comma-separated list of assignments, where both input and output variables can be defined (with preceding $ and @ signs). In case the return value of an expression is to be verified, the syntax result= can be used to specify that return value.

../_images/unit_test.png

Figure 9.96 Screen shot of the unit test environment in the expression edit box.

Example

The following expression takes an existing predicate “inp” and creates a new predicate “out” by incrementing each value:

set @out = Map($inp, _el, _el+1)

This unit test verifies the correct operation of this expression:

$inp=[1,2,3],@out=[2,3,4]

Example

The following expression determines whether or not an input predicate “data” contains at least one value (such an expression could be used as a filter narrowing down the scope of a component):

ListNotEmpty($data)

This unit test verifies the correct operation of this expression:

$data=[],result=False;
$data=[1],result=True;

Partial tests

In case a Transform Literals component is used to create new, transformed predicates, a single expression may contain a set of separate statements that each define a new output predicate. In such a situation, unit tests can be defined for each output predicate separately, by starting the test line with Partial:. In such case, the unit test evaluation will only consider the output predicates that are part of that specific test, and ignore any output predicates that are defined in the expression but not specified in the test line.

Example

The following expression creates two output predicates, “out1” and “out2”:

set @out1 = $in1 + 1;
set @out2 = $in2 + 2;

For each output predicate, a separate, partial unit test can be defined:

Partial: $in1=1,@out1=2;
Partial: $in2=1,@out2=3;

Testing for errors

Unit tests can also be used to verify that the expression returns an error under certain condition. This can be achieved using the error= construction.

Example

The following expression uses the special built-in function Error to generate an error in case the single-valued input predicate “inp” is negative, and creates a new, single valued predicate “out” with an incremented value if not:

Error("the_error", condition = ($inp < 0));
set @out = [$imp + 1];

Two tests verify both aspects of this behaviour:

$inp=[1],@out=[2];
$inp=[-1],error="the_error"

9.6.9. Other Examples

Example

The following expression, used in a Transform Literals component, sets a single, fixed literal value for a new predicate disease:type:

set @disease:type = ["Lowest level term"];

Example

The following expression, used in a Transform Literals component, takes an existing single valued predicate disease_id and constructs a new predicate disease_url by prefixing it with an url:

set @disease_url = ["http://database.org/disease/"
                    + $$disease_id];

Notice the usage of $$ to fetch the single value content of the predicate.

Example

The following expression, used in a Transform Literals component, takes an existing multi-valued predicate type_raw and constructs a new predicate type by capitalising words in all predicate values:

set @type = Map($type_raw, _el, StrTitle(_el));

Notice the usage of Map to iterate over all values in the source predicate.

Example

The following expression, used in a Transform Literals component, takes an existing predicate scores_float containing multi-valued floating-point scores, and creates a new predicate scores_int, with the scores rounded to integer values:

set @scores_int = Map($scores_float, _el, Str(Round(Float(_el), 0)))

Notice that the values read from the source predicates contain float values but are stored as strings, and the result must be storerd as strings again. This is achieved by using the Float and Str convertor functions.

A possible unit test for this expression:

$scores_float=["1.2","1.9","3.4"],@scores_int=["1.0","2.0","3.0"]

Example

The following expression, used in a Transform Literals component, takes an existing predicate cas, and creates a new predicate cas_number, containing only those values that do not start with "ec" (case insensitive) or are equal to "0":

set @cas_number = Map(
      $cas, _el, _el,
      filter=not StrStarts(StrLower(_el), "ec") and _el != "0"
   )

A possible set of unit tests for this expression:

$cas = ["EC 1.134", "123-05-34"], @cas_number = ["123-05-34"];
$cas = ["0", "123-05-34"], @cas_number = ["123-05-34"];