9.4. Components

9.4.1. Introduction

Components can be configured via their options.

Some options occur in many component types. We’ll discuss these common options first, before diving into the individual component types.

Filters

Most components offer the possibility to restrict the resources that are processed in the involved classes by using filters.

A filter is a boolean expression which is evaluated for every (active) resource. If the expression returns True then the resource is taken into account for the component execution, otherwise it is not taken into account. By default (empty filter), the return value is True.

The expression is written using the expression language, see Expression Functions. For each resource the filter expression has access to the values of all predicates.

Note: it is not possible to filter based on predicate values in multiple resources.

Known issue: In general, when specifying a predicate in an option or expression, the type extension can be left out. For example, one can normally write name instead of name.lit. However, if a predicate is mentioned both in an option and in a filter, then the extension has to be added explicitly in the filter.

After execution, the number of resources that were “filtered in” per class can be consulted in the section Counters of the component view.

Quality Control

Several components offer a mechanism for verifying the quality of the produced data.

Quality is expressed via numbers called quality measures. A quality measure can be either a quantity, e.g. the number of imported resources, or a ratio between two numbers, e.g. the number of resources that produced a failure divided by the total number of resources, in other words the fraction of failures.

Some quality measures are expected to be high numbers (higher is better), like the number of imported resources, others are expected to be low numbers (lower is better), like the fraction of failures.

A component can generate a warning or an error if a quality measure exceeds a user-specified threshold.

  • For a higher is better quality measure the warning or error is generated if the number is smaller than the threshold.
  • For a lower is better quality measure the warning or error is generated if the number is greater than the threshold.

Each component can offer different quality measures which are relevant for that component. For each quality measure, the user will be presented with two options to set the thresholds:

  • Error Level: if the quality measure exceeds the threshold, an error is produced.
  • Warning Level: if the quality measure exceeds the threshold, a warning is produced.

For the difference between warnings and errors, see Execution errors and warnings.

This kind of quality control is sometimes called in-component QC. It is not to be confused with the component Verify Data which offers an alternative, less component-specific way to verify quality.

Example

The Transform Literals component applies some transformation on each resource of a class. For some resources the transformation may fail. It is therefore natural for this component to offer a quality measure Fraction of Failed Transformations. This is an example of a ratio quality measure of the type lower is better.

Suppose the user specified Error Level = 0.05 and Warning Level = 0.02. Then this component will generate a warning if more than 2% of the resources failed, and an error if more than 5% of the resources failed.

Warning suppression

The user can suppress the reporting of specific warnings in specific components via component options.

The option-section Warnings contains an option for (almost) each type of warning that a component can generate.

Each option specifies a minimal count.

The warning will only be reported if its number of occurrences is greater than or equal to the minimal count. The default value is 1, so by default every warning will be reported.

In the example above (Transform Literals component) the warnings can be suppressed by setting Minimal count for warning “Error while processing a resource.” to a value above 8776.

Note: Quality Control warnings can be controlled are controlled by a separate mechanism of thresholds, see Quality Control.

9.4.2. Add Label

Uses the content of a predicate as a new label. The preferred label will be visible in DISQOVER. All other labels will be used a synonyms.

Description

This component adds zero or more labels to each resource in a class (Target Class) which is included in the filter.

Labels are stored in predicate disq:label.lit (or disq:label for short), which has a special meaning in DISQOVER.

The labels are copied from a literal predicate specified in option Literal Predicate.

Preferred Label

Similar to the concept of Preferred URI, each resource needs a unique Preferred Label in DISQOVER.

Each resource can have zero or more labels (stored in disq:label), but one of them is defined to be the Preferred Label.

The behavior depends on the option New Preferred Label.

If option New Preferred Label is True, then we want this component to define the Preferred Label for each resource. If a Preferred Label has already been defined by an earlier component, it will be overridden. In order to ensure that there is never more than one Preferred Label, the following rules apply for each resource:

Number of created labels No Preferred label yet Already has Preferred label
0 warning warning
1 OK OK, override
> 1 warning; labels added but pref. label not set warning; labels added but pref. label not overridden

If option New Preferred Label is False, then existing preferred labels are not changed.

Note that the mechanism and rules are subtly different compared to Preferred URI. Compare with Add URI.

Note that labels can not be defined via component Transform Literals, because that component cannot guarantee the uniqueness of Preferred Labels.

Example

Option Value
Literal Predicate name
New Preferred Label True

URIs have been abbreviated:

Preferred Labels are notated in boldface.

Target Class before applying the component:

disq:uri.uri disq:uri.huri name.lit disq:label.lit
[G:john_snow] [HURI(G:john_snow)] [“John Snow”] []
[G:sansa_stark] [HURI(G:sansa_stark)] [“Sansa Stark”] [“Sansa”]
[G:petyr_baelish] [HURI(G:petyr_baelish)]

[“Petyr Baelish”,

“Littlefinger”]

[]
[G:sandor_clegane] [HURI(G:sandor_clegane)]

[“Sandor Clegane”,

“The Hound”]

[“Sandor”]

Target Class after applying the component:

disq:uri.uri disq:uri.huri name.lit disq:label.lit
[G:john_snow] [HURI(G:john_snow)] [“John Snow”] [“John Snow”]
[G:sansa_stark] [HURI(G:sansa_stark)] [“Sansa Stark”]

[“Sansa”,

“Sansa Stark”]

[G:petyr_baelish] [HURI(G:petyr_baelish)]

[“Petyr Baelish”,

“Littlefinger”]

[]
[G:sandor_clegane] [HURI(G:sandor_clegane)]

[“Sandor Clegane”,

“The Hound”]

[“Sandor”]

Observe:

  • John Snow didn’t have a label yet, so he gets a new label, which is a copy of his name.
  • Sansa already had a label, to which a new label is added. Because the option New Preferred Label is True, this new label becomes the preferred label.
  • Petyr and Sandor have two names, so they both have two label candidates. Because it is not clear which one should be the Preferred Label, a warning is issued and neither of the labels are added! If option New Preferred Label would have been False, then Sandor would get two extra labels (not preferred), but Petyr wouldn’t.

Options

  • Class : The name of the class on which the action will be performed.
  • Literal Predicate : The predicate containing the value(s) that will be set as instance label. If the Preferred Label option is turned on, this predicate must be single-valued.
  • New Preferred Label [Optional] : If turned on, the value of the selected predicate will be set as the preferred label and overwrite the existing label (if a preferred label has been set in an earlier component). In that case, the literal predicate must be single-valued. The default value is True.
  • New Preferred Label selection strategy [Optional] : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
  • Filter [Optional] : Boolean expression returning true for resources which should be included.

Advanced

  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is 1.
  • Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is 1.
  • Minimal count for warning “The predicate ‘disq:label’ should not be used as an input predicate for the add label component.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘disq:label’ should not be used as an input predicate for the add label component.”. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.3. Add URI

Uses the content of a Literal Predicate as a new URI with an optional prefix. This URI can subsequently be used for creating relationships.

Description

This component adds zero or more “subject” URIs (disq:uri) to each resource in a class (Target Class) which is included in the filter.

The URIs are constructed from the values of a literal predicate specified in option Literal Predicate.

The following conversions are applied to each literal value:

  • If To Lowercase is True, the literal is converted to lowercase, otherwise it is left untouched (not converted to uppercase).
  • Then the literal is encoded as specified by Encoding (see below), unless Prefix is empty.
  • Finally, the value of Prefix is added in front.

Per resource, a URI is created for each of the values of the literal predicate, and these URIs are added to the values of disq:uri for that resource. If that predicate doesn’t exist yet, it is created. If, for a certain resource, the literal predicate has n values, then n URIs will be created.

Advanced: the URIs are actually added to disq:uri.uri, and their hashed values to disq:uri.huri.

Note

This is not the only component that can add URIs.

It is not possible to define URIs with component Transform Literals.

About encoding

To be a valid URI, reserved characters have to be URL-encoded, see percent_encoding .

The way literals are encoded depends on the option Encoding

  • Standard URL-encoding: ' ' (space character) is encoded as "%20", '@' is encoded as "%40", and so on.
  • ONTOFORCE encoding: Space (' '), comma (','), period ('.'), semicolon (';') and slash ('/') are encoded as an underscore character ('_'), all other special characters via standard URL-encoding

A simple example

Option Value
To Lower True
Prefix "http://got/"
Encoding ONTOFORCE encoding
New Preferred URI False

Target Class before applying the component:

name.lit
[“John Snow”]
[“Petyr Baelish”]

Target Class after applying the component:

name.lit disq:uri.uri disq:uri.huri
[“John Snow”] [http://got/john_snow] [HURI(http://got/john_snow)]
[“Petyr Baelish”] [http://got/petyr_baelish] [HURI(http://got/petyr_baelish)]

Preferred URI

Every resource can have zero or more (subject) URIs; as all predicates disq:uri is multivalued.

However, every resource needs to have a unique Preferred URI in DISQOVER. The preferred URI is one of the values of disq:uri, and there are some mechanisms to specify which one.

If option New Preferred URI is True, then we want this component to define the Preferred URI for each resource. If a Preferred URI has already been defined by an earlier component, it will be overridden. In order to ensure that there is never more than one Preferred URI, the following rules apply for each resource:

Number of created URIs No Preferred URI yet Already has Preferred URI
0 warning warning
1 OK OK, override
> 1 warning; URIs not added! warning; URIs not added!

If option New Preferred URI is False, we want this component to define the Preferred URI for each resource which has no URIs yet, so without overriding any previously set Preferred URI. For each resource the following rules apply:

Number of created URIs No Preferred URI yet Already has Preferred URI
0 warning warning
1 OK OK, don’t override
> 1 warning; URIs not added! OK, don’t override

A more complicated example

In the following example some resources already have a URI before this component is applied.

Option Value
To Lower True
Prefix "http://got/"
Encoding ONTOFORCE encoding
New Preferred URI False

URIs have been abbreviated:

Preferred URIs are notated in boldface.

Target Class before applying the component:

name.lit disq:uri.uri disq:uri.huri
[“John Snow”] [P:John%20Snow] [HURI(P:John%20Snow)]
[“Sansa Stark”] [] []
[“Petyr Baelish”, “Littlefinger”] [] []
[“Sandor Clegane”, “The Hound”] [P:Sandor%20Clegane] [HURI(P:Sandor%20Clegane)]
[] [P:anonymous] [HURI(P:anonymous)]
[] [] []

Target Class after applying the component:

name.lit disq:uri.uri disq:uri.huri
[“John Snow”] [P:John%20Snow, G:john_snow] [HURI(P:John%20Snow), HURI(G:john_snow)]
[“Sansa Stark”] [G:sansa_stark] [HURI(G:sansa_stark)]
[“Petyr Baelish”,
“Littlefinger”]
[] []
[“Sandor Clegane”,
“The Hound”]
[P:Sandor%20Clegane, G:sandor_clegane, G:the_hound] [HURI(P:Sandor%20Clegane), HURI(G:sandor_clegane), HURI(G:the_hound)]
[] [P:anonymous] [HURI(P:anonymous)]
[] [] []

Observe:

  • John Snow gets a second URI, but the first one is still the Preferred URI.
  • Sansa gets her first URI, so it becomes the Preferred URI.
  • Petyr didn’t have a URI yet, and has two names, so there are two URI candidates. Because it is not clear which one should be the Preferred URI, a warning is issued and neither of the URIs are added!
  • Sandor also has two candidate URIs, but since he already has a preferred URI and the option New Preferred URI is False, the URIs are just added, the preferred URI is not changed, and no warning is issued.
  • The next-to-last example (anonymous) poses no problem. Since there is no name, no URI is added.
  • The last example will issue a warning because it no preferred URI can be set.

Options

  • Class : The name of the class on which the action will be performed.
  • Literal Predicate : The Literal Predicate containing the value(s) that will be combined with a prefix to form the URI(s).
  • Force as new preferred URI [Optional] : If turned on, the created URI will be set as the preferred URI. If turned off, the URI will only be used as the preferred URI if the resource didn’t have one yet. The default value is False.
  • New Preferred URI selection strategy [Optional] : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
  • Filter [Optional] : Boolean expression returning true for resources which should be included.

URI encoding

  • Prefix [Optional] : The prefix to be used for the URI.
  • Encoding [Optional] : Determines how the part of the generated URI after the prefix will be encoded. The possible values are: Standard URL encoding (e.g. ' ' is converted to '%20'), ONTOFORCE encoding. This strips surrounding whitespace, replaces ;,. / characters with underscores and applies standard URL encoding to all other characters..
  • To Lowercase [Optional] : Convert the part of the generated URI after the prefix to lowercase. The default value is False.

Advanced

  • Make Auxiliary [Optional] : Make all generated predicates auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “The new Preferred URI overwrites the preferred URI assigned by federation synchronization (this may corrupt federation)” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The new Preferred URI overwrites the preferred URI assigned by federation synchronization (this may corrupt federation)”. The default value is 1.
  • Minimal count for warning “The URI could not be added because the literal predicate is empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is 1.
  • Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.4. Aggregate and Transform (resources)

Applies an expression to aggregate data from resources in one class, and then uses the aggregated data through an expression in another (or the same) class.

Description

This component operates in two phases. In the first phase it accumulates or aggregates information from one or more predicates of a class (Phase 1 Class). In the second phase it uses the aggregated information to transform predicates in the same or another class (Phase 2 Class).

Different kinds of accumulation are possible. Some examples:

  • calculate the maximum of all values of a numerical predicate,
  • calculate the sum of all values of a numerical predicate,
  • count the average (mean) of values of a numerical predicate,
  • count the number of values of a predicate,
  • count the number of distinct values of a predicate,
  • count the frequency of values of a predicate (a histogram).

The user can specify the exact aggregation behavior via an expression which is applied to every resource of Phase 1 Class. Example are given below.

In the second phase the aggregated information is used in a transformation, similar to Transform Literals. For example:

  • use the aggregated maximum value to transform a numerical predicate to a percentage relative to that maximum
  • use the aggregated average value to produce a predicate which indicates which values are below average
  • use the aggregated value frequency to produce a predicate which indicates which values are unique

Similar to other components, filters can be applied in both phases.

The $STORE object

Aggregated information is stored in a special variable $STORE, which is always a dictionary object (also called ‘dict’ or ‘map’), a collection of key-value pairs. The keys are always strings, the values can be any type (string, number, list, other dictionary, …)

As with any dictionary, the value corresponding to a key can be retrieved/changed using expression functions:

  • DictGet(d, k) returns the value of key k in dictionary d
  • DictSet(d, k, v) sets the value of key k in dictionary d equal to v

A literal dict can be specified like this: {"name": "John", "age": 33}.

Furthermore, it is possible to loop over a dictionary (with Map or Reduce) using the function DictKeys.

Phase 1: Initial Expression

At the start of phase1, a $STORE object is automatically created. Its value is an empty dictionary ({}).

It is possible to change this initial value using :opt:Phase 1 Initial Expression, typically via the function DictSet. For example, if you want to calculate the sum of some numerical predicate, you can introduce a key-value pair sum = 0 like this:

DictSet($STORE, "sum", 0)

Note that you cannot set $STORE via a construct like set $STORE = {...}.

It is possible to initialize multiple variables, e.g.:

DictSet($STORE, "count", 0);
DictSet($STORE, "sum", 0)

Phase 1: Resource Expression

This expression is applied to every resource of (Phase 1 Class). Any change in $STORE is carried over to the next resource.

For example, if you want to count resources:

set _count = DictGet($STORE, "count");

DictSet($STORE, "count", _count + 1)

Or, in one line:

DictSet($STORE, "count", DictGet($STORE, "count") + 1)

This whole construction (including initialization) is equivalent to the following traditional (pseudo-)code:

count = 0
for resource in resources:
  count = count + 1

Note that, in this case, initializing the count to zero in the Initial Expression is strictly not necessary, because DictGet takes an optional parameter specifying the default value (if the key is missing):

set _count = DictGet($STORE, "count", default=0);
...

To calculate the sum of a numerical predicate, say price, one has to convert the predicate values to numbers (remember that predicate values are always stored as lists of strings). If you know the predicate is single-valued, you can use the $$-notation:

set _total_price = DictGet($STORE, "total_price", default=0);

DictSet($STORE, "total_price", _total_price + Float($$price))

If the predicate is multi-valued you can use Reduce:

set _total_price = DictGet($STORE, "total_price", default=0);

set _resource_price = Reduce($price, 0, _tot, _el, _tot + Float(_el));

DictSet($STORE, "total_price", _resource_price + _total_price)

or Map (in this case you cannot use the auxiliary variable _total_price):

Map($price, _el, DictSet($STORE,
                         "total_price",
                         DictGet($STORE, "total_price", default=0)
                          + Float(_el)))

Note that it is forbidden to “write” predicates during the first phase.

Validating the aggregated data

There are two ways to validate the aggregated data.

In the first place expressions involving $STORE can be validated by providing a Unit Test. Its value before evaluation can be specified in the normal way, but a special syntax is required to specify the value after evaluation:

$price=["20", "40"],
$STORE={},
after $STORE={"total_price": 60};

$price=["20", "40"],
$STORE={"total_price": 100},
after $STORE={"total_price": 160};

In the second place, when the component is executed, the value of $STORE is reported in the component feedback.

Phase 2: Initial Expression

The second phase is essentially equivalent to the component Transform Literals, with the addition that the aggregated data in $STORE can also be used.

In some cases it is necessary to do a form of post-processing on $STORE after the first phase, before application to the individual resources.

Suppose, for example that you want to calculate the average value of a numerical predicate. This can be done by aggregating the total value (“sum”) and the number of values (“count”) in the first phase. The average can then be calculated in Phase 2 Initial Expression:

DictSet($STORE,
        "average",
        DictGet($STORE, "sum") / DictGet($STORE, "count"))

or, to avoid division by zero:

DictSet($STORE,
        "average",
        DictGet($STORE, "sum") / DictGet($STORE, "count", default=1))

Phase 2: Resource Expression

After the (optional) Initial Expression, a second sweep is executed on Phase 2 Class, applying Phase 2 Resource Expression to every resource.

In principle Phase 2 Class can be different from Phase 1 Class, but very often it is the same class. If that is the case, you can leave the option empty.

This expression can read and write predicates, and can use $STORE.

For example, if you have a single-valued predicate cost and aggregated the total cost in phase 1, you can produce a derived predicate cost_percent like this:

set _total_cost = DictGet($STORE, "total_cost");

set @cost_percent = [Str(Float($$cost) / _total_cost * 100)]

or, if the predicate is multi-valued:

set _total_cost = DictGet($STORE, "total_cost");

set @cost_percent = Map($cost, _el, Str(Float(_el) / _total_cost * 100))

Notes:

  • It is forbidden to change $STORE in Phase 2 Resource Expression.
  • Like Transform Literals, this component cannot write subject URIs (disq:uri) or subject Labels (disq:label). However, it can produce (auxiliary) predicates which can then be used in subsequent or Add URI or Add Label components.

Another example

Suppose you have resources with predicates ID and version, and that multiple resources can have the same ID, but in that case they have different versions:

ID.lit version.lit
[“id1”] [“1”]
[“id2”] [“1”]
[“id1”] [“2”]
[“id3”] [“1”]
[“id1”] [“3”]
[“id2”] [“2”]

For every ID you only want to keep the resource with the highest version. This can be achieved by removing resources (see Remove Resources), but you first need to produce a predicate which indicates which resources are to be removed.

For this purpose you can use aggregation. Instead of using a fixed key in $STORE you can use the IDs. The values are the maximum versions (per ID).

Phase 1 Initial Expression can be empty. Phase 1 Resource Expression can be:

set _this_version = Float($$version);
set _current_max_version = DictGet($STORE, $$ID, default=0);
DictSet($STORE, $$ID, Max(_this_version, _current_max_version))

This is an example of a Unit Test for this expression:

$STORE={}, $ID=["foo"], $version=["1"], after $STORE={"foo": 1};
$STORE={"foo": 1}, $ID=["foo"], $version=["3"], after $STORE={"foo": 3};
$STORE={"foo": 3}, $ID=["foo"], $version=["2"], after $STORE={"foo": 3};
$STORE={"foo": 3}, $ID=["bar"], $version=["2"], after $STORE={"foo": 3, "bar": 2};

For the example data above the value of $STORE after the first phase would be:

{'id1': 3,
 'id2': 2,
 'id3': 1}

In the second phase, you can produce a predicate max_version:

set @max_version = [Str(DictGet($STORE, $$ID))]

The situation after execution is then:

ID.lit version.lit max_version.lit
[“id1”] [“1”] [“3”]
[“id2”] [“1”] [“2”]
[“id1”] [“2”] [“3”]
[“id3”] [“1”] [“1”]
[“id1”] [“3”] [“3”]
[“id2”] [“2”] [“2”]

Resources can now be removed in a following Remove Resources component with the filter:

$$version != $$max_version

Options

Phase 1 (Aggregation)

  • Class : Class containing the predicates to be visited in the first phase.
  • Initial Expression [Optional] : The expression executed once at the start of the first phase to create key-value pairs in the $STORE dictionary.
  • Resource Expression : The transformation expression executed for each resource in the first phase. The values are accessed via the $STORE dictionary (see Dict Manipulation functions). This expression cannot create predicates.
  • Phase1 Filter [Optional] : A boolean expression returning True for resources to which the action should be applied in the first phase.

Phase 2 (Transformation)

  • Class [Optional] : Class containing the predicates to be visited in the second phase (can be the same as the first class).
  • Initial Expression [Optional] : The expression executed once at the start of the second phase. It creates key-value pairs in the $STORE dictionary.
  • Resource Expression : The transformation expression executed for each resource in the first phase. The values are accessed via the $STORE dictionary (see Dict Manipulation functions). This expression can create predicates.
  • Phase2 Filter [Optional] : A boolean expression returning True for resources to which the action should be applied in the second phase.

Advanced

  • Make Auxiliary [Optional] : Make all generated predicates auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during Phase 1 (Aggregation).” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during Phase 1 (Aggregation).”. The default value is 1.
  • Minimal count for warning “An error occurred during Phase 2 (Transformation).” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during Phase 2 (Transformation).”. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.5. Configure Canonical Type

Configure a Canonical Type together with its facets and properties when loading in DISQOVER.

Description

This component defines the configuration settings for a DISQOVER Canonical Type.

Each Canonical Type has a unique URI and a unique label, see the options below.

Resource Selection

Each instance in DISQOVER corresponds to a resource in the Data Ingestion Engine, with a well defined Preferred URI and a Preferred Label.

Instances are categorized in groups called Canonical Types. For example the Canonical Type ‘Movie’ could encompass all movie-instances.

Very often instances belong to one Canonical Type, but it is also possible that an instance belongs to multiple Canonical Types. E.g. an actor-instances could belong to Canonical Type ‘Actor’ and Canonical Type ‘Person’. In other words: Canonical Types are not necessarily non-overlapping.

The Publish in DISQOVER component uses the information in the Configure Canonical Type components to convert resources in the Data Ingestion Engine to instances in DISQOVER.

Two options are available to specify which instances belong to a Canonical Type:

  • Classes is a list of Class names. All valid resources in each of these classes are included.
  • Types allows more fine-grained selection. It is a list of Resource Types. It looks at the resources of all classes produced by the pipeline. All valid resources which have (at least) this Resource Type (rdf:type) are included.

Some examples:

Option Value
Label Movies
URI http://movies
Classes [DisneyMovies]
Types []

All enabled resources in class DisneyMovies which have a Preferred URI and a Preferred Label are converted to instances of the Canonical Type Movies. This is a simple one-to-one correspondence.

Option Value
Label Movies
URI http://movies
Classes [DisneyMovies, PixarMovies]
Types []

Now all valid resources from class DisneyMovies and from class PixarMovies are included.

Option Value
Label Movies
URI http://movies
Classes []
Types http://disney.org/movies/

All valid resources in any class which contain Resource Type http://disney.org/movies/ are included.

Resource Types for predicates can be specified in a number of ways:

Note that a resource may have multiple Resource Types (multiple values for rdf:type).

Also note that multiple Canonical Types may include the same resources. The instances produced by these resources belong to each of these Canonical Types.

Properties and Facets

Predicates in the Data Ingestion Engine can be exposed in DISQOVER by configuring them as Properties or as Facets. A Facet is a kind of property which can be used to partition the instances in groups. It is typically used for filtering.

For example, the predicate title in class Movies can be exposed as a Property labeled Title. The value of this Property for a particular instance is equal to the value of this predicate for the corresponding resource. (technical note: if the value is a list and it contains duplicates, then the duplicates are removed)

As most movies have a distinct title, this is not a very good candidate for a Facet. The predicate genre is a better candidate to expose as a Facet. Filtering on genre in DISQOVER would allow the user the show all movies of a particular genre.

Note that a property or facet can be configured to correspond to multiple predicates. The value of the property/facet for a particular instance is the union of the values of the contributing predicates.

A Canonical Type can have any number of Properties and any number of Facets. They are configured in this component. At a minimum, for each Property or Facet, the label, DISQOVER URI and the contributing predicates should be specified. More options are available to specify the description, the datatype etc. For more information, please refer to the list below.

Very often a predicate (or a set of predicates) is (are) used both as a Property and as a Facet. The Facet can be defined separately in Facets (note that the Facet-URI and the Property-URI should be different), but it is often easier to define the Property in Properties/Facets and use the special options within the Property definition to expand it to a Facet. At a minimum, you should provide a Facet-URI.

See also component Configure Sub-instance Type, which allows you to configure a Property as a sub-table of a Canonical Type.

Within Properties/Facets, the option Renderer defines how the property is rendered within the instance list or instance popout (see also section 4.2.3):

  • Html: allows html html markdown in the property, but disables any executable scripts.
  • Unsanitized Html: allows html markdown in the property. This option can be used to insert executable scripts into DISQOVER, and should therefore only be used when the source of the html is trustworthy beyond any doubt.
  • Date: shows data predicates (of the form “yyyy-mm-dd”) in a date format as specified within the browser settings of the user.
  • String: used to display text.
  • Image: if a property contains an external link to an image, this renderer shows the image.
  • External link: makes hyperlink properties clickable.
  • Paragraph, Sub table (deprecated) and Sub key value (deprecated): these options were used in previous versions of DISQOVER but are now deprecated. They have no effect when chosen.

Template Properties

The option Template allows to create a template property that does not have any predicates but is based on one or multiple other properties, for example to add a prefix to the value of a property, or to combine multiple properties.

The syntax to do so, is by using @<property_uri>@ in the template. If the referenced property has multiple values, you will see multiple template values. You can reference multiple properties and if they are all multivalued, multiple template values will be created using all combinations of the property values.

Another option for working with multivalued properties is to add a delimiter in the template, which looks like @<property_uri>@@separator@. You can combine as many properties (with or without a delimiter) as you want in a template.

Trees

You can configure facets to have tree data by using path predicates (xxx.path, generated by the Expand hierarchical paths component). Note that you should not specify the option Parent Facet in this case. A path predicate can also be used in a property for publishing to Remote Data Subscription, however the property will not be used when publising to DISQOVER. You will receive a warning in this case which you can supress by turning the option Publish to Disqover off.

Publishing the Configuration

The configuration defined in all Configuration Components (Canonical Types, Properties, Facets, …) can be transferred to DISQOVER in two ways:

  • “automatically” after successful execution of Publish in DISQOVER.
  • “manually” via the menu command “Generate configuration” in the Data Ingestion Engine frontend.

Manual publishing is typically used for cosmetic changes in the configuration, e.g. if a Property description was changed. If anything more structural was changed (such as adding a property, changing selected predicates etc.), you should execute Publish in DISQOVER.

Icon

The icon to be displayed in the canonical type tile in DISQOVER, which can be selected from a number of available in-house icons along with the font-awesome v5.9.0 icons. To use one of our in-house icons, prefix the icon name from the table below with “icon_type-”. For example: “icon_type-citation”. To use a font-awesome icon, prefix the font-awesome icon name with font-awesome. We currently only support the light style of fontawesome icons. For example: “font-awesome fa-house”.

These are the available in-house icons:

../_images/type-antibody.png

Figure 9.65 antibody

../_images/type-assay.png

Figure 9.66 assay

../_images/type-author.png

Figure 9.67 author

../_images/type-biospecimen.png

Figure 9.68 biospecimen

../_images/type-cellline.png

Figure 9.69 cellline

../_images/type-citation.png

Figure 9.70 citation

../_images/type-compound.png

Figure 9.71 compound

../_images/type-databank.png

Figure 9.72 databank

../_images/type-disease.png

Figure 9.73 disease

../_images/type-enzyme.png

Figure 9.74 enzyme

../_images/type-extra01.png

Figure 9.75 extra01

../_images/type-extra05.png

Figure 9.76 extra05

../_images/type-extra12.png

Figure 9.77 extra12

../_images/type-gene.png

Figure 9.78 gene

../_images/type-homology.png

Figure 9.79 homology

../_images/type-institution.png

Figure 9.80 institution

../_images/type-journal.png

Figure 9.81 journal

../_images/type-medicine.png

Figure 9.82 medicine

../_images/type-modelorganism.png

Figure 9.83 modelorganism

../_images/type-mouse.png

Figure 9.84 mouse

../_images/type-organism.png

Figure 9.85 organism

../_images/type-pathway.png

Figure 9.86 pathway

../_images/type-plasmide.png

Figure 9.87 plasmide

../_images/type-population.png

Figure 9.88 population

../_images/type-protein.png

Figure 9.89 protein

../_images/type-risk.png

Figure 9.90 risk

../_images/type-specialist.png

Figure 9.91 specialist

../_images/type-specialist2.png

Figure 9.92 specialist2

../_images/type-trial.png

Figure 9.93 trial

../_images/type-unknown.png

Figure 9.94 unknown

../_images/type-variant.png

Figure 9.95 variant

 

Example

Custom icon:

ICON: icon_type-antibody
image

Mouse icon:

ICON: custom mouse

image

Font-awesome icon:

ICON: font-awesome fa-area-chart
image

Federation

In a federated setting, this component might add to or hide features of a remote type, if the local URI matches up with the remote type.

A facet can also be defined together with a property. The facet will use the same predicates, label and description etc. as the property. The DISQOVER URI for the facet must be defined explicitly.

Options

  • Label : The display name of the canonical type.
  • Description [Optional] : The description of the canonical type.
  • Icon : The name of the icon of the canonical type. You can use icons from fontawesome.com. For example, to use the ‘handshake’ icon, fill in ‘font-awesome fa-handshake’.
  • Classes [Optional] : List of classes contributing to this canonical type. The default value is [].
  • Properties/Facets [Optional] : All properties of this canonical type. A list of sub-options with the following structure:
    • Label : The display name of the property.
    • URI : The URI of the property.
    • Description : The description of the property.
    • Subinstance Type : The subinstance type of the values if applicable.
    • Predicates : List of predicates mapping to this property.
    • Renderer : The way the property should be visualized. The possible values are: html, unsanitized html, date, string, image, paragraph (deprecated), sub table (deprecated), sub key value (deprecated), external link. The value can also be undefined.
    • Template : A template which generates values from other property values. For example, to add a prefix to a property, the value should be: “prefix”@<property_uri>@
    • Order By : The way the property values should be ordered within a single instance. The possible values are: Numeric order, Label order (Case sensitive), Label order (Case insensitive), Date order. The value can also be undefined.
    • Disable Sorting : Specify true if there is no need to make this property sortable. The default value is False.
    • Data Type : The data type of the property. Specify this if you want a property sortable by an integer or float property. The possible values are: int, float, lat-lon, location_tree. The value can also be undefined.
    • Not Text Searchable : Specify true if there is no need to make this property text-searchable. The default value is False.
    • Export to file : Specify true to include this property when exporting data to file (Turtle format). The default value is True.
    • Publish for Remote Data Subscription : Specify true to include this property when publishing for remote data subscription. The default value is True.
    • Publish to DISQOVER : Specify true to include this property when publishing to DISQOVER. The default value is True.
    • Visible for Groups : The user groups which are allowed to view the property. Leave unspecified if accessible for all.
    • Mixed Security Values : Specify true here if this property has individual property values which could be hidden. The default value is False.
    • Custom predicate used in Published Data Set : Specify a custom predicate name to be used in the published data set
    • Also create a facet using these options. : Use as facet The default value is False.
    • Facet URI : The URI of the facet.
    • Facet Parent Predicate : The predicate defining the hierarchy between the facet values.
    • Facet Not Annotated Label : A custom label for the “not annotated” item.
    • Facet Data Type : The data type of the facet. Specify if you need it in a histogram, otherwise leave undefined. The possible values are: location_tree, lat-lon, int, float, date. The value can also be undefined.
    • Facet Additive : Specify true if it makes sense to show the sum of the values to the user (if dataType is int or float).
    • Facet Precision : The number of decimals to show for a floating point number.
    • Facet Single Valued : Specifies if the values are single valued, default is false. The default value is False.
    • Facet View Type : The view type of the facet. The possible values are: countrymap, date, images, hierarchical, default, dataset. The value can also be undefined.
  • Resource Types [Optional] : List of resource types contributing to this canonical type. The default value is [].

Advanced

  • URI [Optional] : The URI of the canonical type.
  • Visible for Groups [Optional] : The identifiers of the user groups that will be allowed to see this canonical type. If this option is left empty, the canonical type will be accessible for everyone.
  • Default Hidden [Optional] : Specify true if the canonical type should not be visible on the dashboard. The default value is False.
  • Synonym as Property [Optional] : If turned on, a property named Synonym is automatically created, which contains all labels of a resource The default value is True.
  • Generate Semantic Hit [Optional] : If true an exact match of the label will generate a semantic hit for the concept. The default value is True.
  • In-house Canonical Type [Optional] : If turned on, the in-house icon will be shown for this canonicaltype. The default value is True. The default value is True.
  • Allow Mixing with Federated Data [Optional] : If turned on, local data will be mixed with federated data in this canonical type. The default value is True. The default value is True.
  • Disable Canonical Type [Optional] : If turned on, this canonical type will be completely disabled. The default value is False. The default value is False.
  • Label Renderer [Optional] : The way the label should be visualized. The possible values are: html, string.
  • Facets [Optional] : All facets of this canonical type. A list of sub-options with the following structure:
    • Label : The display name of the facet.
    • URI : The URI of the facet.
    • Description : The description of the facet.
    • Predicates : List of predicates mapping to this facet.
    • Parent Predicate : The predicate defining the hierarchy between the facet values.
    • Not Annotated Label : A custom label for the “not annotated” item.
    • Data Type : The data type of the facet. Specify if you need it in a histogram, otherwise leave undefined. The possible values are: location_tree, lat-lon, int, float, date. The value can also be undefined.
    • Additive : Specify true if it makes sense to show the sum of the values to the user (if dataType is int or float).
    • Precision : The number of decimals to show for a floating point number.
    • Single Valued : Specifies if the values are single valued, default is false. The default value is False.
    • View Type : The view type of the facet. The possible values are: countrymap, date, images, hierarchical, default, dataset. The value can also be undefined.
    • Export to file : Specify true to include this facet when exporting data to file (Turtle format). The default value is False.
    • Visible for Groups : The user groups which are allowed to view the facet. Leave unspecified if accessible for all.
    • Mixed Security Values : Specify true here if this facet has individual facet values which could be hidden. The default value is False.

Remote Data Subscription

  • Publish for Remote Data Subscription [Optional] : Publish as a data set for remote data subscription The default value is False.
  • Data Set Name [Optional] : Name of the data set to be used for remote data subscription (by default this is the last part of the URI of the canonical type).
  • Remote Data Groups [Optional] : The identifiers of the user groups that will be allowed to see this Remote Data Set. If this option is left empty, the data set will be accessible for everyone.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “A property was configured to contain HTML code that will be rendered on the page. Only do this if the source of the property is trustworthy, because malicious code could be executed during rendering.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “A property was configured to contain HTML code that will be rendered on the page. Only do this if the source of the property is trustworthy, because malicious code could be executed during rendering.”. The default value is 1.

9.4.6. Configure Sub-instance Type

Configure a Sub-instance Type which can be referenced by other Canonical Types when loading in DISQOVER

Description

This component defines the configuration settings for a DISQOVER sub-instance type. In a federated setting, this might add to or hide features from a remote sub-instance type, if the local URI matches up with the remote type. It does this by mapping types and predicates to a DISQOVER sub-instance type URI and a number of its property URIs.

Sub-instance types are used in those cases when you want to show a subtable as a property of a canonical type, and don’t want to define the items in the subtable as full-blown canonical types. Eg, you might want to show a subtable for a Disney character with cultural references:

Magazine Article Author
Mouse Monthly Minnie as a role model A. Mauser
Shoes, shoes, shoes Pitter patter H. Heels

If you don’t want a canonical type ‘Cultural Reference’, you can define it as a sub-instance type. Sub-instance types should have their own class with an instance uri. Here is the standard sequence of modeling this relationship for the given example. We’ll assume we have an import file for the parent class (disney_characters), and another one for cultural references, with the following structure:

Character Magazine Article Author
D:minnie_mouse Mouse Monthly Minnie as a role model A. Mauser
  • Import the disney_characters and add URI and label
  • Import the references into the class cultural_references
  • Add a URI to cultural_references, eg by combining character, magazine and article
  • Create a relationship by identifier from cultural_references to disney_characters, eg cultural_references:mentions
  • Create a sub-instance type for cultural_references, eg CulturalReference with URI D:cultural_reference.
  • Create or update the DisneyCharacter canonical type to have a property mentioned_in, that uses cultural_references:mentions.rev to populate the values, and specifies the sub_type to be D:cultural_reference.

Properties

A sub-instance type can have any number of properties. At a minimum the DISQOVER URI and the contributing predicates should be specified. For a list of additional optional arguments, please refer to the options list below.

Options

  • Label [Optional] : The display name of the subinstance type.
  • Classes [Optional] : List of classes contributing to this subinstance type. The default value is [].
  • Resource Types [Optional] : List of resource types corresponding to this subinstance type. The default value is [].
  • Properties : All properties of this subinstance type. A list of sub-options with the following structure:
    • Label : The display name of the property.
    • URI : The URI of the property.
    • Description : The description of the property.
    • Predicates : List of predicates mapping to this property.
    • Renderer : The way the property should be visualized. The possible values are: html, unsanitized html, date, string, image, paragraph (deprecated), sub table (deprecated), sub key value (deprecated), external link. The value can also be undefined.
    • Template : A template which generates values from other property values. For example, to add a prefix to a property, the value should be: “prefix”@<property_uri>@
    • Order By : The way the property values should be ordered within a single instance. The possible values are: Numeric order, Label order (Case sensitive), Label order (Case insensitive), Date order. The value can also be undefined.
    • Disable Sorting : Specify true if there is no need to make this property sortable. The default value is False.
    • Data Type : The data type of the property. Specify this if you want a property sortable by an integer or float property. The possible values are: int, float, lat-lon, location_tree. The value can also be undefined.
    • Not Text Searchable : Specify true if there is no need to make this property text-searchable. The default value is False.
    • Export to file : Specify true to include this property when exporting data to file (Turtle format). The default value is True.
    • Publish for Remote Data Subscription : Specify true to include this property when publishing for remote data subscription. The default value is True.
    • Publish to DISQOVER : Specify true to include this property when publishing to DISQOVER. The default value is True.
    • Visible for Groups : The user groups which are allowed to view the property. Leave unspecified if accessible for all.
    • Mixed Security Values : Specify true here if this property has individual property values which could be hidden. The default value is False.
    • Custom predicate used in Published Data Set : Specify a custom predicate name to be used in the published data set
    • Also create a facet using these options. : Use as facet The default value is False.
    • Facet URI : The URI of the facet.
    • Facet Parent Predicate : The predicate defining the hierarchy between the facet values.
    • Facet Not Annotated Label : A custom label for the “not annotated” item.
    • Facet Data Type : The data type of the facet. Specify if you need it in a histogram, otherwise leave undefined. The possible values are: location_tree, lat-lon, int, float, date. The value can also be undefined.
    • Facet Additive : Specify true if it makes sense to show the sum of the values to the user (if dataType is int or float).
    • Facet Precision : The number of decimals to show for a floating point number.
    • Facet Single Valued : Specifies if the values are single valued, default is false. The default value is False.
    • Facet View Type : The view type of the facet. The possible values are: countrymap, date, images, hierarchical, default, dataset. The value can also be undefined.

Advanced

  • URI [Optional] : The URI of the subinstance type.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.

9.4.8. Configure User Views (DEPRECATED)

User views specify which different reduced views on the data will be available for the user when publishing in DISQOVER.

Options

  • User Views Triples [Optional] : Triples defining the user views.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.

9.4.9. Create Compact Class

Copies all enabled resources (i.e. resources that have not been removed) from the source class to the destination class. This can be used as an optimization after removing resources.

Description

This component copies all active resources from a class (Source Class) to another class (Target Class).

Components like Remove Resources, Merge Classes, and Merge within Class deactivate resources in a class. Deactivating records slows down further processing of the class, because the deactivated records must be “skipped” each time. Copying the active records to a new alignment can substantially improve the performance of further processing components.

Advanced

  • This component doesn’t copy auxiliary columns.
  • Preferredness of URIs and labels is taken over.

Options

  • Source Class : The class containing disabled resources which will not be transferred to the Target Class.
  • Target Class : The new class which will only contain enabled resources.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.

9.4.10. Create Relationship (by identifier)

Create a relationship between the target class and the matching class by matching the content of an identifier (literal predicate) in the target class to a URI in the matching class.

Description

This component creates a relationship (or link) between two classes, by matching literals in the Target Class to existing URIs in Matching Class.

The created relationship bidirectionial, i.e. it is stored in two predicates:

  • a “forward” predicate in Target Class.
  • a “reverse” predicate in Matching Class.

The name of these predicates is specified in option Relationship Predicate.

More in detail, the component works as follows:

  • For each resource T in Target Class, all values of the literal predicate specified in option Matching Identifier are considered.
  • For each literal value, a corresponding URI is created in the same way as in component Add URI, taking into account options Prefix, To lowercase, Encoding.
  • Each URI is searched in predicate disq:uri (subject URI) of the Matching class:
    • If a resource M is found whose subject URI is equal to the URI (a match), then the URI is added to predicate RRR.uri, and its hashed value to predicate RRR.fwd, where we used RRR to denote the predicate specified in option Relationship Predicate. The reverse link is also created, by adding the (first) subject URI of resource T to predicate RRR.rev in resource M.
    • It the URI is not found (no match) then it is added to predicate RRR.err (this can be used, e.g. for debugging).

Prerequisite: both classes must have a predicate disq:uri (.huri to be precise).

Filters can be defined on both classes.

Example

Option Value
Target Class DisneyCharacters
Matching Class Animals
Matching Identifier animal_name
Relationship Predicate animal
Prefix "http://animals.org/"
To lowercase True
Encoding ONTOFORCE encoding

URIs have been abbreviated:

  • ‘D:’ stands for http://disney.org/
  • ‘A:’ stands for http://animals.org/

Target Class DisneyCharacters before applying the component:

disq:uri.uri disq:uri.huri animal_name.lit
[D:mickey_mouse] [HURI(D:mickey_mouse)] [“Mouse”, “House Mouse”]
[D:pluto] [HURI(D:pluto)] [“Dog”]
[D:goofy)] [HURI(D:goofy)] [“Dog”, “Human”]
[D:donald_duck] [HURI(D:donald_duck)] [“Duck”]

Matching Class Animals before applying the component:

disq:uri.uri disq:uri.huri
[A:dog] [HURI(A:dog)]
[A:house_mouse)] [HURI(A:house_mouse)]
[A:mouse] [HURI(A:mouse)]

Target Class DisneyCharacters after applying the component:

disq:uri.uri disq:uri.huri animal_name.lit animal.uri animal.err animal.fwd
[D:mickey_mouse] [HURI(D:mickey_mouse)] [“Mouse”, “House Mouse”] [A:mouse A:house_mouse] [] [HURI(A:mouse), HURI(A:house_ mouse)]
[D:pluto] [HURI(D:pluto)] [“Dog”] [A:dog] [] [HURI(A:dog)]
[D:goofy)] [HURI(D:goofy)] [“Dog”, “Human”] [A:dog] [A: human] [HURI(A:dog)]
[D:donald_duck] [HURI(D:donald_duck)] [“Duck”] [] [A:duck] []

Matching Class Animals after applying the component:

disq:uri.uri disq:uri.huri animal.rev
[A:dog] [HURI(A:dog)] [HURI(D:goofy), HURI(D:pluto)]
[A:house_mouse)] [HURI(A:house_mouse)] [HURI(D:mickey_mouse)]
[A:mouse] [HURI(A:mouse)] [HURI(D:mickey_mouse)]
[A:human)] [HURI(A:human)] [HURI(D:goofy)]

Observe:

  • The literal value “House Mouse” is converted to URI http://animals.org/house_mouse, according to the options Prefix, To Lowercase and Encoding.
  • Mickey Mouse has two values for the literal identifier, both matching a subject URI in Animals, so both these URIs are written to animal.uri and their hashed values to animal.fwd. Conversely, Mickey’s hashed subject URI (HURI(http://disney.org/mickey_mouse)) is added in animal.rev for both animals.
  • Goofy also has two values for the literal identifier, but one of them (human) has no counterpart in Animals, so that URI ends up in animal.err and only ‘dog’ is added in animal.fwd.
  • There are two dogs, Pluto and Goofy, so animal.rev gets two HURIs. Note that their order is undetermined!
  • Donald does not get linked to an animal, because http://animals.org/duck is not subject URI in Animals.

Options

Target Class

  • Target Class : The class containing the literal predicate used for matching. This class will receive the forward predicate of the link.
  • Matching Predicate : The literal predicate used to match URIs in the Matching Class.
  • Relationship Predicate : The new predicate which will contain the links.
  • Prefix [Optional] : The prefix to be used for the URI.
  • Encoding [Optional] : Determines how the part of the generated URI after the prefix will be encoded. The possible values are: No encoding, Standard URL encoding (e.g. ' ' is converted to '%20'), ONTOFORCE encoding. This strips surrounding whitespace, replaces ;,. / characters with underscores and applies standard URL encoding to all other characters..
  • To Lowercase [Optional] : Convert the part of the generated URI after the prefix to lowercase. The default value is False.
  • Target Class Filter [Optional] : A boolean expression returning True for resources in the Target Class to which the action should be applied.

Matching Class

  • Matching Class : The class containing the URIs used for matching. This class will receive the reverse predicate of the link.
  • Matching Class Filter [Optional] : A boolean expression returning True for resources in the Matching Class to which the action should be applied.

Advanced

  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Quality Control

  • Fraction of Unmatched Identifiers [Optional] : The fraction of unmatched URIs. (lower is better)

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.

9.4.11. Create Relationship (by label)

Creates a relationship between 2 classes by matching labels, or another literal predicate.

Description

This component creates relationships (or links) between Target Class and Matching Class, by looking at matching literals.

The created relationship bidirectionial, i.e. it is stored in two predicates:

  • a “forward” predicate in Target Class.
  • a “reverse” predicate in Matching Class.

The name of these predicates is specified in option Relationship Predicate.

More in detail:

  • All values of the literal predicate Matching Predicate in Target Class and all values of the literal predicate Matching Predicate in Matching Class are examined.
  • If a resource T in Target Class has one or more values in common with a resource M in Matching Class, then a relationship is created:
    • The hashed value of the (first) subject URI of M is added to predicate RRR.fwd in the Target Class.
    • The hashed value of the (first) subject URI of T is added to predicate RRR.rev in the Matching Class, where we used RRR to denote the predicate specified in option Relationship Predicate.

By default both predicates Matching Predicate are equal to disq:label.lit (or disq:label for short), because comparing by label is a common operation.

The way literals are compared to each other can be tailored via two options:

  • Case Sensitive determines whether uppercase/lowercase differences matter. For example, if False, “dog” is considered to be equal to “Dog”.
  • Remove Dashes and Spaces determines whether differences due to dashes ('-') or spaces (' ') matter. For example, if True “my-dog” is considered to be equal to “my dog” and to “mydog”.

Example

Option Value
Target Class DisneyCharacters
Matching Predicate DEFAULT (disq:label)
Relationship Predicate animal
Matching Class Animals
Matching Predicate name
Case Sensitive True
Remove Dashes and Spaces True

URIs have been abbreviated:

and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.

Target Class DisneyCharacters before applying the component:

disq:uri.uri disq:label.lit
[D:mickey_mouse] [“Mouse”, “House Mouse”]
[D:pluto] [“Dog”]
[D:goofy)] [“Dog”, “Human”]
[D:donald_duck] [“Duck”]

Matching Class Animals before applying the component:

disq:uri.uri name.lit
[A:123] [“dog”]
[A:482)] [“house-mouse”]
[A:392] [“mouse”]

Target Class DisneyCharacters after applying the component:

disq:uri.uri disq_label.lit animal.fwd
[D:mickey_mouse] [“Mouse”, “House Mouse”] [HURI(A:392), HURI(A:482)]
[D:pluto] [“Dog”] [HURI(A:123)]
[D:goofy)] [“Dog”, “Human”] [HURI(A:123)]
[D:donald_duck] [“Duck”] []

Matching Class Animals after applying the component:

disq:uri.uri name.lit animal.rev
[A:123] [“dog”] [HURI(D:goofy), HURI(D:pluto)]
[A:482)] [“house-mouse”] [HURI(D:mickey_mouse)]
[A:392] [“mouse”] [HURI(D:mickey_mouse)]

Options

Target Class

  • Target Class : The class containing the literal predicate used for label matching. This class will receive the forward predicate of the link.
  • Matching Predicate [Optional] : The predicate of the Target Class used for matching. The default is the label (disq:label.lit). The default value is disq:label.lit.
  • Relationship Predicate : The new predicate which will contain the links.
  • Target Class Filter [Optional] : A boolean expression returning True for resources in the Target Class to which the action should be applied.

Matching Class

  • Matching Class : The class to be matched against. This class will receive the reverse predicate of the link
  • Matching Predicate : The literal predicate of the Matching Class used for matching.
  • Matching Class Filter [Optional] : A boolean expression returning True for resources in the Matching Class to which the action should be applied.

Text Matching

  • Case Sensitive [Optional] : Case sensitive matching of literals. The default value is False.
  • Remove Dashes and Spaces [Optional] : Remove dashes and spaces when matching literals. The default value is False.

Advanced

  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Quality Control

  • Fraction matched [Optional] : The fraction of resources in Matching Class that have been matched successfully. (higher is better)

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.12. Define Datasource

Define the meta-data of a data source.

Description

With this component, the meta-data of a data source is set. Each import component in the pipeline must be assigned to a data source. This data source is then used to indicate the provenance of the imported data. For more details see Data Overview . A single data source can be assigned to multiple import components.

All pipelines in the Data Ingestion Engine will start with one or multiple Define Datasource components. When creating a pipeline, the first thing to do is importing the data, but the import components must be proceeded by a Define Datasource component (Except the ‘Import Remote Data Set’ component).

‘Outdated’ pipeline run

When executing the pipeline, the Data Ingestion Engine checks if a component is outdated before it executes the component. A Define Datasource component is outdated if, as with all other components, the user has changed an option value, e.g. the Label of the data source.

For the Define Datasource component there is a second way the component can become outdated: by using the Info File. The Info File is a JSON file which contains the modification date of the data source:

{
  "date_modified": "2000-01-01"
}

This location of this file is set via the Info File Path option and is stored somewhere where the Data Ingestion Engine has access to the file (this means in the source data directory, probably close to the actual source files of the data source).

When the Define Datasource component is executed, the date mentioned in the ‘info’ file is stored by the Data Ingestion Engine. During a subsequent pipeline run, the Data Ingestion Engine compares the modification date in that file to what the value was the previous time. If the date is more recent than the stored date, the Define Datasource component is flagged as ‘outdated’. The component and its successor components are then executed again, while the new modification date is stored.

Note (1) that this principle is used not only if you select ‘Outdated’ pipeline run, but also in ‘Differential’ and ‘Incremental’ mode.

Note (2) You can also set the modification date using the Modification Data option in the component. This way, you don’t need to create an info file. The effect is the same since the component options will be outdated if you adapt the date. The use of info-files is, however, very convenient when the download of the data source files is automated. The automated should then overwrite the file each time the downloaded files are renewed with the most recent modification date.

If you want more control over when a data source needs to be ‘outdated’, for example if multiple source file updates happen during one day, you can use an other parameter in the ‘info’ file. You can specify a so-called version tag, like this:

{
 "date_modified": "2000-01-01",
 "version_tag": "1.14.5"
}

The version tag can be adapted at any time, and can be formatted in any way you like. In fact, the tag does not need to be a string, but can also be an integer or decimal value. Being able to trigger multiple runs based on ‘outdated’ Define Datasource components can be very important when you are using incremental data ingestion.

Options

  • Label : The name of the datasource.
  • Short Label [Optional] : The short label of the datasource.
  • Homepage [Optional] : The URL of the homepage.
  • Description [Optional] : The description of the datasource.
  • Modification Date [Optional] : The last modification date of the datasource.
  • Info File Path [Optional] : It is possible to define properties of a datasource (such as the modification date) using a separate JSON file. This option specifies the relative path of that JSON file. If this option is filled in, the properties will be retrieved from the file and not from within this component. The file must contain the “date_modified” key.

Advanced

  • URI [Optional] : A URI that uniquely identifies the datasource.
  • URI Scheme [Optional] : The URI scheme of the datasource.
  • Example URI Scheme [Optional] : A URI scheme example for the datasource.
  • Visible for Groups [Optional] : The identifiers of the user groups that will have access to this data source. If this option is left empty, the data source will be accessible for everyone.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “There was a problem reading the Info file.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “There was a problem reading the Info file.”. The default value is 1.
  • Minimal count for warning “The file path should be relative to the DISQOVER source_data folder. Absolute paths are deprecated.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The file path should be relative to the DISQOVER source_data folder. Absolute paths are deprecated.”. The default value is 1.

9.4.13. Expand Hierarchical Paths

Expands a forward link to a hierarchical class to a new predicate containing the full paths.

Description

This component expands a forward link to a hierarchical class to a new predicate containing the full paths.

Options

Target Class

  • Target Class : The class containing the relationship predicate to the hierarchical class.
  • Target Relationship Predicate : Predicate containing the relationship to the hierarchical class..
  • Target Class Filter [Optional] : Boolean expression returning true for resources which should be included.
  • Target Path Predicate [Optional] : Predicate to store the generated path.

Hierarchical Class

  • Hierarchical Class : The class containing the parent child relationships.
  • Parent Relationship Predicate : Predicate containing the parent relationship..
  • Hierarchical Class Filter [Optional] : Boolean expression returning true for resources which should be included.

Example

We have a list of Disney characters and information about where they live. We also have a class which contains hierarchical location data:

         United States
           /    \
          /      \
     Calisota    Washagon
     /     \         \
    /       \         \
Duckburg  Mouseton    Zenith

URIs have been abbreviated:

and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.

Preferred URIs are notated in boldface.

Target Class DisneyCharacters before applying the component:

disq:uri.uri location.fwd
[D:donald_duck] [HURI(L:duckburg)]
[D:mickey_mouse] [HURI(L:mouseton)]
[D:daisy_duck] [HURI(L:duckburg)]

Hierarchical Class DisneyLocations before applying the component:

disq:uri.uri disq:label.lit name.rev parent.fwd parent.rev
[L:duckburg] [“Duckburg”] [HURI(D:donald_duck), HURI(D:daisy_duck)] [HURI(L:calisota)] []
[L:mouseton] [“Mouseton”] [HURI(D:mickey_mouse)] [HURI(L:calisota)] []
[L:zenith] [“Zenith”] [HURI(D:mickey_mouse)] [HURI(L:washagon)] []
[L:calisota] [“Calisota”] [] [HURI(L:us)] [HURI(L:duckburg), HURI(L:mouseton)]
[L:washagon] [“Washagon”] [] [HURI(L:us)] []
[L:us] [“United States”] [] [] [HURI(L:calisota), HURI(L:washagon)]

A predicate location_path.path (value of Target Path Predicate) is added to the Target Class. There are no changes in the Hierarchical class. The Target Path predicate contains the complete hierarchical path of the Parent Relationship Predicate:

disq:uri.uri location_path.path location.fwd
[D:donald_duck] [(HURI(L:duckburg), HURI(L:calisota)), (HURI(L:calisota), HURI(L:united_states)] [HURI(L:duckburg)]
[D:mickey_mouse] [(HURI(L:mouseton), HURI(L:calisota)), (HURI(L:calisota), HURI(L:united_states)] [HURI(L:mouseton)]
[D:daisy_duck] [(HURI(L:duckburg), HURI(L:calisota)), (HURI(L:calisota), HURI(L:united_states)] [HURI(L:duckburg)]

Options

Target Class

  • Target Class : The class containing the relationship predicate to the hierarchical class.
  • Target relationship predicate : Predicate containing the relationship to the hierarchical class.
  • Target Class Filter [Optional] : A boolean expression returning True for resources in the Target Class to which the action should be applied.
  • Target path Predicate : Predicate to store the generated path.

Hierarchical Class

  • Hierarchical Class : The class containing the parent child relationships.
  • Parent relationship predicate : Predicate containing the parent relationship.
  • Hierarchical Class Filter [Optional] : A boolean expression returning True for resources in the Hierarchical Class to which the action should be applied.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “Hierarchy contains loops.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Hierarchy contains loops.”. The default value is 1.

9.4.14. Extract Class

Move or copy resources to a new class (all predicates or a selection of predicates).

Description

This component copies or moves resources from a class (Source Class) to a new class (Target Class).

The resources to be copied are specified via the Source Class Filter.

By default all predicates are copied, except auxiliary predicates. Predicates to include allows you to copy only a specific set of predicates. Predicates to exclude allows you to copy all predicates except a specific set. These options cannot be filled in at the same time.

By default copied resources are removed from the Source Class (or more accurately: these resources are disabled). This behavior can be changed with the option Remove Copied Resources, but be aware that this might introduce duplicate URIs.

If all predicates are copied and the source resources are removed, this amounts to moving the resources to the new class, or, in other words, splitting the class (the reverse of merging). This is the default behavior and has the same effect as the component Create Compact Class.

Advanced

If subject label (disq:label.lit) or preferred label are included in the predicates to be copied, they will both be copied. Likewise, if subject URI (disq:uri.uri), hashed URI, Preferred URI of hashed Preferred URI are included, they will all be copied.

Auxiliary columns are not copied.

Options

Source Class

  • Source Class : Class containing resources to be extracted.
  • Filter [Optional] : Boolean expression returning true for resources which should be included.
  • Predicates to include [Optional] : List of predicates to be copied. Empty means all predicates.
  • Predicates to exclude [Optional] : List of predicates to be excluded from extraction.
  • Remove Copied Resources [Optional] : Remove the original resources after extracting. The default value is True.

Target Class

  • Target Class : New class to which the resources will be extracted.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “The source class contains links that can become broken in the extracted class.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The source class contains links that can become broken in the extracted class.”. The default value is 1.

9.4.15. Extract Class (distinct)

Creates a new class containing distinct values derived from a predicate in a given class.

Description

This component extracts distinct values from a literal predicate in Source Class to a new Target Class.

A new Target Class is created (it’s an error if there is already a class with that name), with the following predicates:

  • disq:uri.uri (subject URI)
  • disq:uri.huri (its hashed value)
  • an output predicate defined by Values Predicate (by default disq:label.lit)
  • rdf:type.lit (if Resource Type is filled in)
  • RRR.rev (back link) where RRR is the value of Relationship Predicate.

In the Source Class a forward link is created in

  • RRR.fwd.

The component reads values from Aimed Predicate in Source Class, converts them, one by one, via the expression given in Value Expression (if not empty), and transforms the results to URIs, similar to Add URI:

  • convert to lowercase if To Lowercase is True.
  • URL-encode according to Encoding (unless Prefix is empty).
  • add Prefix in front.

For each unique URI created (extracted) in this way:

  • a resource is created in Target Class.
  • the extracted URI is written to disq:uri.uri, and its hashed value to disq:uri.huri.
  • all literal values which yield this URI are added to Values Predicate.
  • the hashed URIs of all resources contributing to the extracted URI are added to RRR.rev.

Conversely,

  • the hashed URI of each extracted URI is added to RRR.fwd in :option`Source Class` for each resource contributing to that extracted URI.

The option Value Expression can be used to transform (or extract from) literals prior to determining unique values. The expression can only depend on 1 string variable called $value and must produce a single string value. A typical use case is importing complete JSON- or XML-blobs and extracting a unique identifier. For example, XmlGetTextFirst($value, "./code") extracts the value of subnode <code> of an XML-node.

If the option Resource Type is filled in, then an extra predicate rdf:type is created (with this value for each resource). See Configure Canonical Type for more information about Resource Types.

For more details about encoding, see Add URI.

Note that the order of extracted resources is undetermined.

Example

Option Value
Source Class DisneyCharacters
Aimed Predicate animal.lit
Value Expression empty
Relationship Predicate animal
prefix http://animals.org
Encoding ONTOFORCE encoding
To lowercase True
Target Class Animals
Values Predicate empty (disq:label.lit by default)
Type http://ontology/animal

URIs have been abbreviated:

and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.

Source Class DisneyCharacters before applying the component:

disq:uri.uri animal.lit
[D:mickey_mouse] [“Mouse”, “House Mouse”]
[D:pluto] [“dog”]
[D:goofy)] [“Dog”, “Human”]
[D:minnie_mouse] [“mouse”]
[D:donald_duck] []

Source Class DisneyCharacters after applying the component:

disq:uri.uri animal.lit animal.fwd
[D:mickey_mouse] [“Mouse”, “House Mouse”] [HURI(A:mouse), HURI(A:house_mouse)]
[D:pluto] [“dog”] [HURI(A:dog)]
[D:goofy)] [“Dog”, “Human”] [HURI(A:dog), HURI(A:human)]
[D:minnie_mouse] [“Mouse”] [HURI(A:mouse)]
[D:donald_duck] [] []

Target Class Animals after applying the component:

disq:uri.uri disq:label.lit rdf:type.lit animal.rev
[A:house_mouse] [“House Mouse”] [”http://ontology/animal”] [HURI(D:mickey_mouse)]
[A:dog] [“Dog”, “dog”] [”http://ontology/animal”] [HURI(D:goofy), HURI(D:pluto)]
[A:mouse] [“Mouse”] [”http://ontology/animal”] [HURI(D:mickey_mouse), HURI(D:minnie_mouse)]
[A:human] [“Human”] [”http://ontology/animal”] [HURI(D:goofy)]

Note:

  • For Relationship Predicate we chose the same name (animal) as the Aimed Predicate; this is not required.
  • “House Mouse” converts to “house_mouse”
  • Pluto and Goofy both link to A:dog because “dog” and “Dog” both convert to “dog”;

Options

Source Class

  • Class : The class containing the values to be extracted.
  • Aimed Predicate : Literal predicate from which values will be extracted. Each distinct value produces a resource in the Target Class with a URI derived from the value.
  • Value Expression [Optional] : An expression that creates the identifier by transforming the Aimed Predicate. In this expression, the Aimed Predicate is represented by the $value variable. Both the input and output are Strings.
  • Filter [Optional] : Boolean expression returning true for resources which should be included.
  • Relationship Predicate : The predicate containing the created relationship.
  • Prefix [Optional] : The prefix to be used for the URI.
  • Encoding [Optional] : Determines how the part of the generated URI after the prefix will be encoded. The possible values are: Standard URL encoding (e.g. ' ' is converted to '%20'), ONTOFORCE encoding. This strips surrounding whitespace, replaces ;,. / characters with underscores and applies standard URL encoding to all other characters..
  • To Lowercase [Optional] : Convert the part of the generated URI after the prefix to lowercase. The default value is True.

Target Class

  • Class : The new class which will contain the distinct resources.
  • Values Predicate [Optional] : This is the predicate of the Target Class that will contain the distinct values from the Aimed Predicate. By default: disq:label.lit. The default value is disq:label.lit.
  • Preferred Label selection strategy [Optional] : Determines which value to pick as preferred label when the values predicate is ‘disq:label’ and it has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
  • Resource Type [Optional] : The Resource type for all extracted resources.

Advanced

  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred while processing the Value Expression.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred while processing the Value Expression.”. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.16. Extract Hierarchical Class

Extracts and creates node resources from a “Tree Path” predicate.

Description

This component produces a hierarchical class (or tree class) (Extracted Class) based on parent-child-information in a literal predicate in Target Class.

Terminology

A hierarchical system can be thought of as a collection of nodes, where each node can have zero or more child nodes, and zero or more parent nodes. A tree is a very common hierarchical system in which every node (except the root node) has exactly one parent.

Consider the following example:

       A            G
      / \          /
     /   \        /
    B     C      H
  /   \    \
 /     \    \
D       E    F

This hierarchical system has 8 nodes. Parents are shown above their children, so A is a parent of B and C, B is a parent of D and E etc. There are 2 trees, with root nodes A and G. Nodes without children, like D an H, are called leaf nodes.

For every node we can define its path as the list of parent-nodes we have to follow up until we reach the root of the tree. So the path of D would be [D, B, E], and the path of C would be [C, A].

Interestingly, the hierarchy can be constructed based on the paths of all leaf nodes (in this case [D, B, A], [E, B, A], [F, C, A], [H, G]), and that is exactly what this component does.

Implementation

In Data Ingestion Engine every node will be represented by a resource in the hierarchical class, which has a parent-child relationship with itself. This relationship is conventionally stored in predicates parent.fwd and parent.rev (although the name parent has no special meaning). For each resource, the hashed URI of its parent is stored in parent.fwd, and the hashed URIs of its children in parent.rev.

Following the example above, in resource B parent.fwd would have a single value, namely the hashed URI of A, and parent.rev would have two values, namely the hashed URIs of D and E.

This component creates the hierarchy based on path information that is stored in Literal Predicate, for each leaf node.

Path information is essentially made up of labels and URIs of all nodes in the path (up to the root). In the example above, the path information for node F would be

  • label-of-F, URI-of-F
  • label-of-C, URI-of-C
  • label-of-A, URI-of-A

This is encoded in a single string, separating each item by "||". So for node F that would be:

"label-of-F||URI-of-F||label-of-C||URI-of-C||label-of-A||URI-of-A"

Path information is normally produced by a Transform Literals component. Two special functions are dedicated to this task:

  • CreateTreePath
  • CreatePersonPath

See Tree Utility Functions.

Some special cases may arise if path informations for different nodes are “incompatible”:

  • Different parents:

    "A||||URI_A||P1||URI_P1"
    "A||||URI_A||P2||URI_P2"
    

    In this case node A is mentioned twice, but with different parents. This is not a problem, only this will not be a real tree.

  • Different labels:

    "A||||URI_A||P1||URI_P"
    "B||||URI_A||P2||URI_P"
    

    In this case nodes A and B have the same parent P, but give P different labels. In the current implementation only one of the labels will be retained, the other one will be discarded (the choice is arbitrary).

Options

The option Add to existing class determines whether this component should create a new class for the extracted resources, or add to an existing class (normally also produced by another Extract Hierarchical Class component). In the latter case the hierarchy in that class will be expanded using the path information in the Target Class. Note that extracting different classes to a single hierarchy class is preferrable over extracting to multiple hierarchies and merging them using Merge Classes.

If the option Resource Type is filled in, then an extra predicate rdf:type is created (with this value for each resource). See Configure Canonical Type for more information about Resource Types.

Example

In this example, a name-tree is extracted from name-information. Special nodes are created for abbreviated names, such as “Duck, D.” and “Duck”. “Duck, Donald” and “Duck, Daisy” have the same initials, so both have the same parent “Duck. D”.

            Duck
           /    \
          /      \
     Duck, D      Duck, H
     /     \         \
    /       \         \
Duck,       Duck,     Duck,
Donald      Daisy     Huey

Note: this is not a family tree!

In preparation of this component, the name path information (probably created using the function CreatePersonPath in a Transform Literals component) has been stored in name_path.lit.

Option Value
Target Class DisneyCharacters
Literal Predicate name_path
Link Predicate name
Extracted Class DisneyNames

URIs have been abbreviated:

and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.

Target Class DisneyCharacters before applying the component:

disq:uri.uri name_path.lit
[D:donald_duck] [“Duck, Donald||N:duck_donald||Duck, D||N:duck_d||Duck||N:duck”]
[D:daisy_duck] [“Duck, Daisy||N:duck_daisy||Duck, D||N:duck_d||Duck||N:duck”]
[D:huey_duck] [“Duck, Huey||N:duck_huey||Duck, H||N:duck_h||Duck||N:duck”]

Target Class DisneyCharacters after applying the component (the values of name_path are left out):

disq:uri.uri name_path.lit name.fwd
[D:donald_duck] [HURI(N:duck_donald)]
[D:daisy_duck] [HURI(N:duck_daisy)]
[D:huey_duck] [HURI(N:duck_huey)]

Extracted Class DisneyNames after applying the component:

disq:uri.uri disq:label.lit name.rev parent.fwd parent.rev
[N:duck] [“Duck”] [] [] [HURI(N:duck_d), HURI(N:duck_h)]
[N:duck_d] [“Duck, D”] [] [HURI(N:duck)] [HURI(N:duck_donald), HURI(N:duck_daisy)]
[N:duck_donald] [“Duck, Donald”] [HURI(D:donald_duck)] [HURI(N:duck_d)] []
[N:duck_daisy] [“Duck, Daisy”] [HURI(D:daisy_duck)] [HURI(N:duck_d)] []
[N:duck_h] [“Duck, H”] [] [HURI(N:duck)] [HURI(N:duck_huey)]
[N:duck_huey] [“Duck, Huey”] [HURI(D:donald_huey)] [HURI(N:duck_h)] []

Options

Target Class

  • Target Class : The class containing the “Tree Path” predicate to be extracted.
  • Literal Predicate : Literal predicate that contains the “Tree Path”.
  • Filter [Optional] : Boolean expression returning true for resources which should be included.
  • Link Predicate : Predicate to store the forward link in.

Extracted Class

  • Extracted Class : The class to which the resources will be extracted. If this class does not already exist, it will be created.
  • Add to existing class [Optional] : Needs to be turned on if the tree is added to an existing class. The default value is True.
  • Resource Type [Optional] : The Resource type for all extracted resources.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “The value for the Tree Path is invalid.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The value for the Tree Path is invalid.”. The default value is 1.
  • Minimal count for warning “The extracted class is not a pure tree because some resources have multiple parents” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The extracted class is not a pure tree because some resources have multiple parents”. The default value is 1.
  • Minimal count for warning “Some resources in the extracted class have multiple labels.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Some resources in the extracted class have multiple labels.”. The default value is 1.
  • Minimal count for warning “Some labels and/or parents in the extracted tree have changed during the incremental run. Any tree facets derived from it will not show the changes correctly.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Some labels and/or parents in the extracted tree have changed during the incremental run. Any tree facets derived from it will not show the changes correctly.”. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.17. Import CSV

Imports CSV files.

Description

This component imports CSV files. It can be used to import all types of delimiter-separated files by specifying the Delimiter and it can handle different types of quoting by setting the Quoting option. The column headers are used to transfer the data in the columns to resource predicates.

All data is imported into a single class, specified in the option Class Name.

Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.

The option Files is used to specify which files to import. This is a list of file-paths. It is possible to use wildcards (like ‘*’). As a security measure, all file-paths must resolve to locations inside the Source Data directory (by default /disqover/data/source_data/, but configurable by an administrator). The use of absolute paths is discouraged and will cause a warning. Relative paths are relative to the Source Data directory.

All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.

You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.

At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.

Example

As an example, the following file is imported:

movies/disney_movies.csv

title,release_date,runtime
Snow White,"December 21, 1937",83
Pinocchio,"February 7, 1940",88
Dumbo,"October 23, 1941",64
Bambi,"August 13, 1942",70
Cinderella,"February 15, 1950",74

The file is comma-separated and the quoting used in the values is the double quote, so the default values for Delimiter and Quote Character can be used.

Option Value
Class Name DisneyMovies
Data Source http://disney.org/movies/
Files movies/disney_movies.csv
Columns See below

In the file there are three columns, and we want to import each of these columns to a predicate. The header line of the CSV file is used to import these columns to different predicates. For the Columns option use the following:

File Column Predicate
title movie:title
release_date movie:release_date
runtime movie:runtime

Class DisneyMovies after the CSV import component:

movie:title.lit movie:release_date.lit movie:runtime.lit
[“Snow White”] [“December 21, 1937”] [“83”]
[“Pinocchio”] [“February 7, 1940”] [“88”]
[“Dumbo”] [“October 23, 1941”] [“64”]
[“Bambi”] [“August 13, 1942”] [“70”]
[“Cinderella”] [“February 15, 1950”] [“74”]

Each movie resource has has three predicates which all have a single value.

Options

  • Class : The name of the new class which will contain the imported data.
  • Data Source : The URI of the data source.
  • Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
  • Columns : The properties to import from the input file(s). This requires the name of the existing column in the input file and the name of the predicate in which the information will be stored. If the input file does not contain headers, the Field Names option should be used to define headers. A list of sub-options with the following structure:
    • File Column : Field in the file.
    • Predicate : The destination predicate for the imported field.
    • Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
    • Use as URI : Use the predicate as an URI for the resources in the class The default value is False.
    • Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is False.
    • Prefix : The prefix to be used for the URI.
    • New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is 1.
    • Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is 1.
    • Use as label : Use the predicate as a label for the resources in the class The default value is False.
    • Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is False.
    • New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is 1.
    • Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is 1.
  • Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
  • Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
  • Delimiter [Optional] : Delimiter used in the CSV file. The default value is ,.
  • Quote Character [Optional] : Override the default quote character. The default value is ".
  • Escape Character [Optional] : A character that removes any meaning from the character following it.
  • Field Names [Optional] : If the input file does not contain headers, this option defines names for the columns which can then be used in the Columns option.

Advanced

  • Filename Predicate [Optional] : Predicate in which the filename will be stored.
  • Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is False.
  • Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
  • Ignore Quotes [Optional] : Ignore all quoting. Can be used when the file does not contain consistent quoting. The default value is False.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is 1.
  • Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is 1.

9.4.18. Import Excel

Imports from Excel spreadsheet files.

Description

This component imports Excel files. It is used to import data from Excel files by specifying the columns that need to be imported. If there is a header row, you can use the header labels (see also the ‘Has Header Row’ option ), otherwise you can use the column names (‘A’, ‘B’, …).

All data is imported into a single class, specified in the option Class.

Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.

The option Files is used to specify which files to import. This is a list of file-paths. It is possible to use wildcards (like ‘*’). As a security measure, all file-paths must resolve to locations inside the Source Data directory (by default /disqover/data/source_data/, but configurable by an administrator). The use of absolute paths is discouraged and will cause a warning. Relative paths are relative to the Source Data directory.

All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.

You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.

At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.

Options

  • Class : The name of the new class which will contain the imported data.
  • Data Source : The URI of the data source.
  • Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
  • Columns : The properties to import from the input file(s). This requires the name of the existing column in the input file and the name of the predicate in which the information will be stored. If there is a header row (see option Has Header Row) use those headers, otherwise use the column name (‘A’, ‘B’, …) A list of sub-options with the following structure:
    • File Column : Field in the file.
    • Predicate : The destination predicate for the imported field.
    • Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
    • Use as URI : Use the predicate as an URI for the resources in the class The default value is False.
    • Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is False.
    • Prefix : The prefix to be used for the URI.
    • New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is 1.
    • Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is 1.
    • Use as label : Use the predicate as a label for the resources in the class The default value is False.
    • Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is False.
    • New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is 1.
    • Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is 1.
  • Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
  • Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
  • Has Header Row [Optional] : The first row contains column headers. The default value is False.

Advanced

  • Filename Predicate [Optional] : Predicate in which the filename will be stored.
  • Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is False.
  • Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
  • Sheet or Cell Range [Optional] : Specifies which data will be imported. It is either a sheet name (in which case the whole sheet is imported), a Defined Name denoting a named range, or a range of the form ‘A1:E4’ or ‘sheet-title!A1:E4. If option Has Header Row is True, make sure the cell range includes the header row. The default is the complete first sheet.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is 1.
  • Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is 1.

9.4.19. Import Identifier Block

Imports column-based files concatenating all subsequent lines which share the column ‘ID Column’ into one resource.

Description

This component is an advanced Import CSV component. For the main usage of the component refer to the CSV importer component. It works in a similar way as the CSV importer, i.e. it can import any column-based (delimiter-separated) file, but it concatenates subsequent lines which share the value of a column: the ID Column. The rows in the imported file must be ordered by the ID Column for the concatenating to work.

All data is imported into a single class, specified in the option Class Name.

Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.

The option Files is used to specify which files to import. This is a list of file-paths. It is possible to use wildcards (like ‘*’). As a security measure, all file-paths must resolve to locations inside the Source Data directory (by default /disqover/data/source_data/, but configurable by an administrator). The use of absolute paths is discouraged and will cause a warning. Relative paths are relative to the Source Data directory.

All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.

You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.

At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.

Example

As an example, the following file is imported:

movies/disney_characters.csv

character,movie
Snow White,Snow White
Bashful,Snow White
Doc,Snow White
Dopey,Snow White
Grumpy,Snow White
Happy,Snow White
Sleepy,Snow White
Sneezy,Snow White
Geppetto,Pinocchio
Pinocchio,Pinocchio
Dumbo,Dumbo
Jumbo,Dumbo

The header line of the column-based file indicates there are two columns: character and movie. The file contains 12 different movie characters from three different movies. A comma is used to separate the data so the default value for Delimiter can be used. To import this file to the DisneyMovies class, select the movie column as ID Column:

Option Value
Class Name DisneyMovies
Data Source http://disney.org/movies/
Files movies/disney_characters.csv
ID Column movie
Columns See below

For the Columns option use the following:

File Column Predicate
movie movie:title
character movie:character

Class DisneyMovies after execution:

movie:title.lit movie:character.lit
[“Snow White”] [“Snow White”, “Bashful”, “Doc”, “Dopey”, “Grumpy”, “Happy”, “Sleepy”, “Sneezy”]
[“Pinocchio”] [“Geppetto”, “Pinocchio”]
[“Dumbo”] [“Dumbo”, “Jumbo”]

The resources have been grouped per movie title and each movie has a list of characters in the movie:character.lit predicate.

Options

  • Class : The name of the new class which will contain the imported data.
  • Data Source : The URI of the data source.
  • Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
  • Columns : The properties to import from the input file(s). This requires the name of the existing column in the input file and the name of the predicate in which the information will be stored. If the input file does not contain headers, the Field Names option should be used to define headers. A list of sub-options with the following structure:
    • File Column : Field in the file.
    • Predicate : The destination predicate for the imported field.
    • Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
    • Use as URI : Use the predicate as an URI for the resources in the class The default value is False.
    • Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is False.
    • Prefix : The prefix to be used for the URI.
    • New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is 1.
    • Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is 1.
    • Use as label : Use the predicate as a label for the resources in the class The default value is False.
    • Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is False.
    • New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is 1.
    • Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is 1.
  • Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
  • Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
  • ID Column : The column that will be used as identifier. All consecutive lines with the same value for “ID column” will be aggregated into a single instance.
  • Delimiter [Optional] : Delimiter used in the CSV file. The default value is ,.
  • Quote Character [Optional] : Override the default quote character. The default value is ".
  • Escape Character [Optional] : A character that removes any meaning from the character following it.
  • Field Names [Optional] : If the input file does not contain headers, this option defines names for the columns which can then be used in the Columns option.

Advanced

  • Filename Predicate [Optional] : Predicate in which the filename will be stored.
  • Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is False.
  • Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
  • Ignore Quotes [Optional] : Ignore all quoting. Can be used when the file does not contain consistent quoting. The default value is False.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is 1.
  • Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is 1.

9.4.20. Import JSON

Imports JSON files.

Description

This component imports resources from JSON files. The JSON file can simply contain a list of resources but it can also have a more complicated structure. In that case the list of resources to be imported from the file can be determined by setting the Instance Path.

All data is imported into a single class, specified in the option Class Name.

Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.

The option Files is used to specify which files to import. This is a list of file-paths. It is possible to use wildcards (like ‘*’). As a security measure, all file-paths must resolve to locations inside the Source Data directory (by default /disqover/data/source_data/, but configurable by an administrator). The use of absolute paths is discouraged and will cause a warning. Relative paths are relative to the Source Data directory.

All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.

You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.

At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.

Example

As an example, the following file is imported:

movies/disney_movies.json

{
    "movies":
    [
      {
        "title": "Snow White",
        "release_date": "December 21, 1937",
        "runtime": 83
      },
      {
        "title": "Pinocchio",
        "release_date": "February 7, 1940",
        "runtime": 88
      },
      {
        "title": "Dumbo",
        "release_date": "October 23, 1941",
        "runtime": 64
      },
      {
        "title": "Bambi",
        "release_date": "August 13, 1942",
        "runtime": 70
      },
      {
        "title": "Cinderella",
        "release_date": "February 15, 1950",
        "runtime": 74
      }
    ]
}

The file contains a list of entries representing a movie, and each entry has three fields.

Option Value
Class Name DisneyMovies
Data Source http://disney.org/movies/
Files movies/disney_movies.json
Instance Path movies
Columns See below

Each instance entry in the JSON list has three fields. To import each of these fields to a predicate use the following for the Columns option:

File Column Predicate
title movie:title
release_date movie:release_date
runtime movie:runtime

Class DisneyMovies after the JSON import component:

movie:title.lit movie:release_date.lit movie:runtime.lit
[“Snow White”] [“December 21, 1937”] [“83”]
[“Pinocchio”] [“February 7, 1940”] [“88”]
[“Dumbo”] [“October 23, 1941”] [“64”]
[“Bambi”] [“August 13, 1942”] [“70”]
[“Cinderella”] [“February 15, 1950”] [“74”]

Each movie resource has has three predicates which all have a single value.

Options

  • Class : The name of the new class which will contain the imported data.
  • Data Source : The URI of the data source.
  • Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
  • Columns : The properties to import from the input file(s). This requires the JSON path or key and the name of the predicate in which the information will be stored. The JSON keys are relative to the instance path. A list of sub-options with the following structure:
    • File Column : Field in the file.
    • Predicate : The destination predicate for the imported field.
    • Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
    • Use as URI : Use the predicate as an URI for the resources in the class The default value is False.
    • Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is False.
    • Prefix : The prefix to be used for the URI.
    • New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is 1.
    • Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is 1.
    • Use as label : Use the predicate as a label for the resources in the class The default value is False.
    • Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is False.
    • New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is 1.
    • Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is 1.
  • Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
  • Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
  • Resource Path [Optional] : The JSON path to the resources.

Advanced

  • Filename Predicate [Optional] : Predicate in which the filename will be stored.
  • Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is False.
  • Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is 1.
  • Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is 1.

9.4.21. Import RDF (DEPRECATED)

Imports RDF files.

Description

This component imports RDF files (e.g. turtle or ntriples). In contrast to the CSV, JSON and XML importers the resources in the RDF files already have a URI and an RDF type, so these will be set during import. The resources to be imported are determined by specifying the Selected RDF Type, only resources with the following triple will be imported:

<subject> a <selected_rdf_type> .

All data is imported into a single class, specified in the option Class Name.

Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.

The option Files is used to specify which files to import. This is a list of file-paths. It is possible to use wildcards (like ‘*’). As a security measure, all file-paths must resolve to locations inside the Source Data directory (by default /disqover/data/source_data/, but configurable by an administrator). The use of absolute paths is discouraged and will cause a warning. Relative paths are relative to the Source Data directory.

All imported resources are assigned a Resource Type (rdf:type) during import. If the option Resource Type is filled in, that value is used. If the option Resource Type is not filled in, the RDF types from the files are used. Note: since Selected RDF Type allows to import multiple RDF types, different imported resources can have different Resource Types. See Configure Canonical Type for more information about Resource Types.

You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.

At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.

Example

As an example, the following turtle file is imported:

/data/source_data/movies/disney_movies.ttl

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
@prefix dbpedia: <http://dbpedia.org/ontology/>
@prefix disney: <http://disney.org/movies/>
@prefix prop: <http://disney.org/properties#>

disney:snow_white a dbpedia:film ;
    rdfs:label "Snow White" ;
    prop:release_date: "1937-12-21" ;
    prop:runtime: "83" .

disney:pinocchio a dbpedia:film ;
    rdfs:label "Pinocchio" ;
    prop:release_date: "1940-02-07" ;
    prop:runtime: "88" .

disney:dumbo a dbpedia:film ;
    rdfs:label "Dumbo" ;
    prop:release_date: "1941-10-23" ;
    prop:runtime: "64" .

disney:bambi a dbpedia:film ;
    rdfs:label "Bambi" ;
    prop:release_date: "1940-02-07" ;
    prop:runtime: "70" .

disney:cinderella a dbpedia:film ;
    rdfs:label "Cinderella" ;
    prop:release_date: "1950-02-15" ;
    prop:runtime: "74" .

This file contains five http://dbpedia.org/ontology/film instances we want to import. The data contains type specifiers, these can be removed during import by enabling the Remove Type Specifiers option. The Class Type is set to http://ns.ontoforce.com#movie to use a new ontology.

Option Value
Class Name DisneyMovies
Data Source http://disney.org/movies/
Files /data/source_data/movies/disney_movies.ttl
File Type turtle
Selected RDF Type http://dbpedia.org/ontology/film
Remove Type Specifiers True
Properties See below

Each http://dbpedia.org/ontology/film subject has three properties we want to import. To import each of these properties to a predicate use the following for the Properties option:

File Property Predicate
<http://www.w3.org/2000/01/rdf-schema#label> movie:title
<http://disney.org/properties#release_date> movie:release_date
<http://disney.org/properties#runtime> movie:runtime

Note that the angular brackets (< >) around the URIs are optional.

Class DisneyMovies after the RDF import component:

disq:uri.uri movie:title.lit movie:release_date.lit movie:runtime.lit
[D:snow_white] [“Snow White”] [“December 21, 1937”] [“83”]
[D:pinocchio] [“Pinocchio”] [“February 7, 1940”] [“88”]
[D:dumbo] [“Dumbo”] [“October 23, 1941”] [“64”]
[D:bambi] [“Bambi”] [“August 13, 1942”] [“70”]
[D:cinderella] [“Cinderella”] [“February 15, 1950”] [“74”]

URIs have been abbreviated:

  • ‘D:’ stands for http://disney.org/movies/

In contrast to the CSV, JSON and XML import components, the resources are assigned a URI during import. A (new) RDF type has been set for each of the imported resources.

Other cases

If the RDF file to import contains multiple RDF types, e.g. http://dbpedia.org/ontology/film and http://dbpedia.org/ontology/movie, we can import both types at once using the following options.

Option Value
Selected RDF Type

[http://dbpedia.org/ontology/film,

http://dbpedia.org/ontology/movie]

The default predicate used to import resources from an RDF file is http://www.w3.org/1999/02/22-rdf-syntax-ns#type or commonly written as a. If we want to import resources by an other predicate we can set it via the rdf_type_predicate option.

If we have a file with movies produced by Walt Disney:

/data/source_data/movies/movie_collection.ttl

@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
@prefix person: <http://www.w3.org/ns/person#>
@prefix disney: <http://disney.org/movies/>
@prefix prop: <http://disney.org/properties#>

disney:snow_white prop:produced_by person:disney_w ;
    rdfs:label "Snow White" ;
    prop:release_date: "1937-12-21" ;
    prop:runtime: "83" .

disney:pinocchio prop:produced_by person:disney_w ;
    rdfs:label "Pinocchio" ;
    prop:release_date: "1940-02-07" ;
    prop:runtime: "88" .

We can import the movies as follows

Option Value
RDF Type predicate <http://disney.org/properties#produced_by>

Options

  • Class : The name of the new class which will contain the imported data.
  • Data Source : The URI of the data source.
  • Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
  • Properties : The properties to import from the input file(s). This requires the existing RDF predicate name and the name of the predicate in which the information will be stored. A list of sub-options with the following structure:
    • File Property : Field in the file.
    • Predicate : The destination predicate for the imported field.
    • Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
    • Predicate Type : The type of object predicate. The possible values are: Literal, URI.
    • Target Classes : Target Classes for this predicate (only applicable if it is an object URI). The default value is [].
    • Use as URI : Use the predicate as an URI for the resources in the class The default value is False.
    • Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is False.
    • Prefix : The prefix to be used for the URI.
    • New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is 1.
    • Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is 1.
    • Use as label : Use the predicate as a label for the resources in the class The default value is False.
    • Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is False.
    • New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is 1.
    • Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is 1.
  • Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
  • Selected RDF Type [Optional] : The type or types of the instances in the source RDF data. The default value is [].
  • File Type : The RDF file format of the imported file(s). The possible values are: ntriples, rdfxml, rdfxml-xmp, rdfxml-abbrev, rss-1.0, atom, dot, json-triples, json, html, nquads, turtle.
  • RDF Type Predicate [Optional] : The RDF predicate used to select type. The default value is http://www.w3.org/1999/02/22-rdf-syntax-ns#type.
  • Remove Type Specifiers [Optional] : Remove the type and language specifiers of the objects during import. The default value is False.

Advanced

  • Filename Predicate [Optional] : Predicate in which the filename will be stored.
  • Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is False.
  • Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is 1.
  • Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is 1.

9.4.22. Import RDF (multiple classes)

Imports RDF files into multiple classes and creates internal relationships.

Description

This component can import RDF-data, store it in multiple classes, and automatically create relationships present in the RDF-data.

Execution of the component triggers one or more single-class Import RDF (DEPRECATED) sub-components, followed by zero or more Create Relationship (by identifier) components.

For the most part, the options of this component correspond to the options of component Import RDF (DEPRECATED). However, you can add more than one target class. Some options, like Selected RDF Type or Properties can be different from class to class, while other options, like Files and Data Source apply to all classes.

A typical use case is to import resources corresponding to different RDF types into different classes.

The component can also create relationships between resources. If the RDF-data contains a triple

s1 p1 o1 .

then the relationship s1 -> o1 (forward and backward link) can be created if s1 and o1 both have a type:

s1 a t1 .
o1 a t2 .

and if both types (t1 and t2) are imported (in the same or in different classes). Such a relationship could be called internal. In order for this relation to be created, Predicate Type (in this case of predicate p1) has to be set to ‘URI’, and one or more Target Class es have to be filled in (in this case the only target class would be the class into which type t2 is imported).

External relationships (i.e. relationships between resources not imported by this component) still need to be created using component Create Relationship (by identifier).

Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.

The option Files is used to specify which files to import. This is a list of file-paths. It is possible to use wildcards (like ‘*’). As a security measure, all file-paths must resolve to locations inside the Source Data directory (by default /disqover/data/source_data/, but configurable by an administrator). The use of absolute paths is discouraged and will cause a warning. Relative paths are relative to the Source Data directory.

All imported resources are assigned an RDF type during import. When the Class Type is not set, the resources are assigned their original RDF types. Since Selected RDF Type allows to import multiple RDF types, not all imported resources will have the same RDF type. When the Class Type is set, all resources are assigned the new RDF type. This option be specified for each class.

You can specify for each predicate you configure in the importer whether you want the predicate to be used as a (preferred) label. Note that you can have multiple labels for a resource, but only one preferred label.

For more advanced use cases you can use the designated Add Label component.

At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.

Example

As an example, the following turtle file is imported:

movies/pixar_movies.ttl

@prefix pred: <http://pixar/predicates/> .
@prefix type: <http://pixar/types/> .
@prefix movie: <http://pixar/movies/> .
@prefix char: <http://pixar/movies/character/> .

movie:toystory a type:movie ;
    pred:title "Toy Story" ;
    pred:year "1995" .
movie:walle a type:movie ;
    pred:title "Wall E" ;
    pred:year "2008" .
movie:findingnemo a type:movie ;
    pred:title "Finding Nemo" ;
    pred:year "2003" .

char:nemo a type:character ;
    pred:name "Nemo" ;
    pred:debut movie:findingnemo .
char:buzzlightyear a type:character ;
    pred:name "Buzz Lightyear" ;
    pred:debut movie:toystory .
char:mrincredible a type:character ;
    pred:name "Mr. Incredible" ;
    pred:debut movie:incredibles .

It contains information about Pixar movies and about characters appearing in those movies. Each character has a predicate pred:debut that links it to the movie it first appeared in.

Option Value
Data Source http://movies.org
Files movies/pixar_movies.ttl
File Type turtle
Class Options See below

In order to import the movies to class movies and characters to class characters, add two Class Options`:

Option Value
Class movies
Selected RDF Type http://pixar/types/movie
Properties See below

with properties

File Property Predicate Predicate Type
<http://pixar/predicates/title> mov:title Literal
<http://pixar/predicates/year> mov:year Literal

and

Option Value
Class characters
Selected RDF Type http://pixar/types/character
Properties See below

with properties

File Property Predicate Predicate Type Target Classes`
<http://pixar/predicates/name> char:name Literal  
<http://pixar/predicates/debut> char:debut URI movies

These settings will trigger two single-class Import RDF (DEPRECATED) sub-components (one creating class movies and one creating class characters), followed by a Create Relationship (by identifier) component, establishing the debut-link between the two classes.

Class movies after executing the component:

disq:uri.uri mov:title.lit mov:year.lit char:debut.rev
[M:toystory] [“Toy Story”] [“1995”] [HURI(C:buzzlightyear)]
[M:walle] [“Wall E”] [“2008”] []
[M:findingnemo] [“Finding Nemo”] [“2003”] [HURI(C:nemo)]

Class characters after executing the component:

disq:uri.uri char:name.lit char:debut.fwd char:debut.err char:debut.uri
[C:nemo] [“Nemo”] [HURI(M:findingnemo)] [] [M:findingnemo]
[C:buzzlightyear] [“Buzz Lightyear”] [HURI(M:toystory)] [] [M:toystory]
[C:mrincredible] [“Mr. Incredible”] [] [M:incredibles] [M:incredibles]

URIs have been abbreviated:

  • ‘M:’ stands for http://pixar/movies/
  • ‘C:’ stands for http://pixar/characters/

Next to the predicates shown in the tables above, each class will also have a predicate disq:uri.huri with hashed URIs, and a predicate rdf:type.lit with value http://pixar/types/movie for each resource in movies, and with value http://pixar/types/character for each resource in characters.

Options

  • Data Source : URI for the data source.
  • Files : Full paths to files to import. May contain wildcards (e.g. ‘*’).
  • File Type : The RDF file format of the imported file(s). The possible values are: ntriples, rdfxml, rdfxml-xmp, rdfxml-abbrev, rss-1.0, atom, dot, json-triples, json, html, nquads, turtle.
  • Class Options [Optional] : The options for each generated class A list of sub-options with the following structure:
    • Class : The class containing the stored data.
    • RDF Type Predicate : The RDF predicate used to select type. The default value is http://www.w3.org/1999/02/22-rdf-syntax-ns#type.
    • Selected RDF Type : The type or types of the instances in the source RDF data. The default value is [].
    • Properties : The properties to import from the input file(s). A list of sub-options with the following structure:
      • File Property : Field in the file.
      • Predicate : The destination predicate for the imported field.
      • Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
      • Predicate Type : The type of object predicate. The possible values are: Literal, URI.
      • Target Classes : Target Classes for this predicate (only applicable if it is an object URI). The default value is [].
      • Use as label : Use the predicate as a label for the resources in the class The default value is False.
      • Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is False.
      • New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
      • Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is 1.
      • Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is 1.
    • Resource Type : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.

Advanced

  • Remove Type Specifiers [Optional] : Remove the type and language specifiers of the objects during import. The default value is False.
  • Filename Predicate [Optional] : Predicate in which the filename will be stored.
  • Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is False.
  • Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is 1.
  • Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is 1.

9.4.23. Import Remote Data Set

Import a Remote Data Set from a Remote Data Subscription

Description

This component imports a remote data set from a remote data subscription (see Subscriber how to subscribe to a remote data publisher). It can be used to import the desired data set from the remote data publisher using the identifier of the remote data publisher and the name of the remote data set. All data is imported into a single class, specified in the option Class.

Options

  • Remote Data Publisher Identifier [Optional] : The identifier of the Remote Data Publisher to import from.
  • Remote Data Set [Optional] : The name of the Remote Data Set to import.
  • Class [Optional] : The class containing the data (by default equal to Remote Data Set.
  • Predicates [Optional] : The predicates to import from the Remote Data Set. If empty, all predicates will be imported. A list of sub-options with the following structure:
    • Data Set Predicate : Name of the predicate in the Remote Data Set.
    • Predicate : Name of the resulting imported predicate. If empty, will be equal to Data Set Predicate
    • Auxiliary : Make the imported predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
  • Predicate Prefix [Optional] : String that is used as a prefix in the name of all imported predicates that are not explicitly renamed and that have no predicate prefix yet. E.g. if Predicate Prefix is ‘abc’ then predicate ‘p’ will be named ‘abc:p’ unless it is explicitly renamed, but predicate ‘xyz:q’ will not be changed.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.

9.4.24. Import Separator Block

Imports files chunked by a separator line.

Description

This component imports files which contains data separated by a Separation Line. The data between two consecutive separation lines (a data block) is added to a single resource in the specified class. All data is imported to a single predicate determined by the Chunk Predicate option. Each line in the data block is added as a single value to the Chunk Predicate.

All data is imported into a single class, specified in the option Class Name.

Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.

The option Files is used to specify which files to import. This is a list of file-paths. It is possible to use wildcards (like ‘*’). As a security measure, all file-paths must resolve to locations inside the Source Data directory (by default /disqover/data/source_data/, but configurable by an administrator). The use of absolute paths is discouraged and will cause a warning. Relative paths are relative to the Source Data directory.

All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.

You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.

At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.

Example

As an example, the following file is imported:

movies/disney_movies.sdf

title: "Snow White"
release_date: "December 21, 1937"
runtime: 83
-----
title: "Pinocchio"
release_date: "February 7, 1940"
runtime: 88
-----
title: "Dumbo"
release_date: "October 23, 1941"
runtime: 64
-----
title: "Bambi"
release_date: "August 13, 1942"
runtime: 70
-----
title: "Cinderella"
release_date: "February 15, 1950"
runtime: 74

This file contains five chunks of data which contains information about a movie. The chunks are separated by -----. To import all data blocks to the batch_data predicate in the DisneyMovies use the following:

Option Value
Class Name DisneyMovies
Data Source http://disney.org/movies/
Files movies/disney_movies.sdf
Chunk Predicate batch_data
Separation Line -----

Class DisneyMovies after the Separator Block import component:

batch_data.lit

[‘title: “Snow White”’,

‘release_date: “December 21, 1937”’,

‘runtime: 83’]

[‘title: “Pinocchio”’,

‘release_date: “February 7, 1940”’,

‘runtime: 88’]

[‘title: “Dumbo”’,

‘release_date: “October 23, 1941”’,

‘runtime: 64’]

[‘title: “Bambi”’,

‘release_date: “August 13, 1942”’,

‘runtime: 70’]

[‘title: “Cinderella”’,

‘release_date: “February 15, 1950”’,

‘runtime: 74’]

All data is imported to a single predicate: batch_data.lit. Remark that all values are literals, therefor this component is usually followed by a Transform Literals component.

Options

  • Class : The name of the new class which will contain the imported data.
  • Data Source : The URI of the data source.
  • Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
  • Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
  • Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
  • Separation Line : The line format that separates chunks.
  • Chunk Predicate [Optional] : Predicate to write chunks to. The default value is main.
  • Trim leading/trailing whitespace [Optional] : Remove leading and trailing whitespace from each line. The default value is True.

Advanced

  • Filename Predicate [Optional] : Predicate in which the filename will be stored.
  • Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is False.
  • Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is 1.
  • Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is 1.

9.4.25. Import XML

Imports XML files.

Description

This component imports resources from XML files.

The list of resources to be imported from the file is determined by setting the Instance X Path.

All data is imported into a single class, specified in the option Class Name.

Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.

The option Files is used to specify which files to import. This is a list of file-paths. It is possible to use wildcards (like ‘*’). As a security measure, all file-paths must resolve to locations inside the Source Data directory (by default /disqover/data/source_data/, but configurable by an administrator). The use of absolute paths is discouraged and will cause a warning. Relative paths are relative to the Source Data directory.

All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.

You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.

At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.

Example

As an example, the following file is imported:

movies/disney_movies.xml

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <row>
    <title>Snow White</title>
    <release_date>December 21, 1937</release_date>
    <runtime>83</runtime>
  </row>
  <row>
    <title>Pinocchio</title>
    <release_date>February 7, 1940</release_date>
    <runtime>88</runtime>
  </row>
  <row>
    <title>Dumbo</title>
    <release_date>October 23, 1941</release_date>
    <runtime>64</runtime>
  </row>
  <row>
    <title>Bambi</title>
    <release_date>August 13, 1942</release_date>
    <runtime>70</runtime>
  </row>
  <row>
    <title>Cinderella</title>
    <release_date>February 15, 1950</release_date>
    <runtime>74</runtime>
  </row>
</root>

The file contains a list of entries representing a movie, and each movie entry has three properties to import.

Option Value
Class Name DisneyMovies
Data Source http://disney.org/movies/
Files movies/disney_movies.xml
Instance X Path ./root/row
X Paths See below

Each instance entry in the XML file has three nodes with text. To import each of these nodes to a predicate use the following for the Columns option:

File X Path XPath Type Predicate
./title Text movie:title
./release_date Text movie:release_date
./runtime Text movie:runtime

Class DisneyMovies after the XML import component:

movie:title.lit movie:release_date.lit movie:runtime.lit
[“Snow White”] [“December 21, 1937”] [“83”]
[“Pinocchio”] [“February 7, 1940”] [“88”]
[“Dumbo”] [“October 23, 1941”] [“64”]
[“Bambi”] [“August 13, 1942”] [“70”]
[“Cinderella”] [“February 15, 1950”] [“74”]

Each movie resource has three predicates which all have a single value.

Note that if your XML contains explicit prefixes:

<?xml version="1.0" encoding="UTF-8"?>
<root dis="http://disney.org/movies">
  <dis:row>
    <dis:title>Snow White</title>
  </dis:row>
</dis:root>

Then these prefixes should be included in the imported XPaths, as well as being defined in the Prefix Dictionary:

File X Path:
./dis:row/dis:title

Prefix Dictionary:
dis: 'http://disney.org/movies'

If your XML lacks explicit prefixes, but does contain a defined prefix dictionary at the beginning of the file:

<?xml version="1.0" encoding="UTF-8"?>
<root dis="http://disney.org/movies">
  <row>
    <title>Snow White</title>
  </row>
</root>

Then these prefixes should be defined in the Prefix Dictionary and manually added to your XPaths:

File X Path:
./anything:row/anything:title

Prefix Dictionary:
anything: 'http://disney.org/movies'

Options

  • Class : The name of the new class which will contain the imported data.
  • Data Source : The URI of the data source.
  • Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
  • X Paths : The list of properties to import from the input file(s). This requires the XML path and the name of the predicate in which the information will be stored. The XML paths are relative to the instance path. A list of sub-options with the following structure:
    • File X Path : Field in the file.
    • Predicate : The destination predicate for the imported field.
    • Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
    • XPath Type : Type of xpath. The possible values are: Text, Node, Attribute.
    • Use as URI : Use the predicate as an URI for the resources in the class The default value is False.
    • Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is False.
    • Prefix : The prefix to be used for the URI.
    • New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is 1.
    • Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is 1.
    • Use as label : Use the predicate as a label for the resources in the class The default value is False.
    • Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is False.
    • New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
    • Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is 1.
    • Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is 1.
  • Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
  • Resource XPath : The XPath to the resources.
  • Prefix Dictionary [Optional] : A list of prefixes used in the XML file.

Advanced

  • Filename Predicate [Optional] : Predicate in which the filename will be stored.
  • Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is False.
  • Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
  • Allow Huge Tree [Optional] : Allow importing files with very deep trees. Make sure the files are from trusted sources, setting this option will disable protection against certain malicious XML content. The default value is False.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is 1.
  • Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is 1.
  • Minimal count for warning “One or more files do not contain any resources as defined by the Resource XPath.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “One or more files do not contain any resources as defined by the Resource XPath.”. The default value is 1.
  • Minimal count for warning “The XPath expression could not be evaluated.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The XPath expression could not be evaluated.”. The default value is 1.

9.4.26. Infer by Relationship (DEPRECATED)

Infers a new predicate using an existing relationship.

See Infer by Relationship (multiple predicates)

Options

Target Class

  • Target Class : The class containing the relationship.
  • Relationship Predicate (existing) : Link predicate (either fwd or rev) linking Target Class to Relationship Class.
  • Resulting Predicate : The resulting predicate. Its type must match that of Aimed Predicate (existing).
  • Target Class Filter [Optional] : Boolean expression returning true for resources which should be included.

Relationship Class

  • Relationship Class : Relationship class.
  • Aimed Predicate (existing) : Predicate to infer, either a literal or a link (fwd or rev) to Aimed Relationship Class (optional).
  • Relationship Class Filter [Optional] : Boolean expression returning true for resources which should be included.

Aimed Relationship Class (optional)

  • Aimed Relationship Class (optional) [Optional] : Aimed relationship class (only relevant if Aimed Predicate (existing) is a link).

Quality Control

  • Fraction of resources with at least one match [Optional] : The fraction of filtered resources in Target Class which matched at least one resource in Relationship Class. (higher is better)

Advanced

  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.27. Infer by Relationship (multiple predicates)

Infers new predicates using an existing relationship.

Description

This component copies information from resources in class (Relationship Class) to resources in another class (Target Class) if there is a relationship between the resources.

From the statements
Mickey Mouse is a mouse
and
all mice are mortal
we can infer that
Mickey Mouse is mortal.

In Data Ingestion Engine “is a mouse” would correspond to a relationship from class DisneyCharacters to class Animals, while “is mortal” would be a literal predicate (“Yes” or “No”) in class Animals. The conclusion (inference) that Mickey Mouse is mortal is implicitly present in this data. This component can be used to make the inference “explicit”, i.e. to transfer the literal predicate “is mortal” to class DisneyCharacters.

The component works like this:

  • It looks at the relationship from Target Class to Relationship Class, specified in Relationship Predicate.
  • If a relationship exists between resource T in the Target Class and resource M in the Relationship Class, then all values of Aimed Predicate of M are added to to the (new) predicate Resulting Predicate of T.

Note that we wrote “relation from … to …”, whereas relationships are bidirectional in general: forward and reverse hashed URIs are stored in different predicates in both involved classes. This component uses one of these relationship predicates, either forward or reverse. In case of ambiguity, the user can specify the direction in Relationship Predicate by adding extension .fwd or .rev.

Note that it is also possible to infer information within a single class, if a class has a relationship with itself, e.g. a parent-child relationship.

Two cases can be distinguished:

  1. infer a literal
  2. infer a relationship

The second case is a bit more complicated. Let’s look at inferring literals first.

1. Inferring a literal

This is the simplest case. Relationship Predicate should be a literal predicate (.lit), and Aimed Relationship Class should be left empty.

URIs have been abbreviated:

and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.

Target Class DisneyCharacters before applying the component:

disq:uri.uri animal.fwd
[D:mickey_mouse] [HURI(A:mouse)]
[D:hades] [HURI(A:god)]

Relationship Class Animals before applying the component:

disq:uri.uri is_mortal.lit
[A:mouse] [“Yes”]
[A:god] [“No”]

Target Class DisneyCharacters after applying the component:

disq:uri.uri animal.fwd mortal.lit
[D:mickey_mouse] [HURI(A:mouse)] [“Yes”]
[D:hades] [HURI(A:god)] [“No”]

Relationship Class Animals is unchanged.

2. Inferring a relationship

This case is bit more complicated as it involves 3 classes, and because we want to preserve bidirectionality of the relationships.

Option Aimed Predicate is now a relationship predicate, either forward (.fwd) or reverse (.rev), which defines a relationship from Relationship Class to Aimed Relationship Class.

Option Aimed Relationship Class should be filled in.

The mechanism is the same as for above, except that

  • instead of copying values from a literal predicate, we now copy (HURI) values from the relationship predicate Aimed Predicate to Resulting Predicate in Target Class, thus creating a relationship from Target Class to Aimed Relationship Class
  • back links, from Aimed Relationship Class to Target Class, are stored in a predicate Resulting Predicate, but with opposite extension: if the copied Aimed Predicate is forward, then the back link is reverse, and vice versa.

URIs have been abbreviated:

and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.

Target Class DisneyCharacters before applying the component:

disq:uri.uri animal.fwd
[D:mickey_mouse] [HURI(A:mouse)]
[D:hades] [HURI(A:god)]

Relationship Class Animals before applying the component:

disq:uri.uri genus.fwd
[A:mouse] [HURI(G:Mus)]
[A:god] [HURI(G:Deus)]

Aimed Relationship Class Genus before applying the component:

disq:uri.uri
[G:Homo ]
[G:Deus]
[G:Mus]

Target Class DisneyCharacters after applying the component:

disq:uri.uri animal.fwd gen.fwd
[D:mickey_mouse] [HURI(A:mouse)] [HURI(G:Mus)]
[D:hades] [HURI(A:god)] [HURI(G:Deus)]

Relationship Class Animals is unchanged.

Aimed Relationship Class Genus after applying the component:

disq:uri.uri gen.rev
[G:Homo ] []
[G:Deus] [HURI(D:hades)]
[G:Mus] [HURI(D:mickey_mouse)]

Notes:

  • For the Resulting Predicate we used the name gen, to avoid confusion with predicate genus, but in principle we could have used genus
  • Because the Aimed Predicate is a forward link (genus.fwd), the Resulting Predicate is also forward (.fwd is added automatically to gen), and the back link is reverse (gen.rev)

Options

Target Class

  • Target Class : The class containing the relationship.
  • Relationship Predicate (existing) : Link predicate (either fwd or rev) linking Target Class to Relationship Class.
  • Target Class Filter [Optional] : Boolean expression returning true for resources which should be included.

Relationship Class

  • Relationship Class : The class which contains the (literal or link) predicate to be inferred to the target class. It should also be linked to the target class.
  • Relationship Class Filter [Optional] : Boolean expression returning true for resources which should be included.

Predicates

  • Predicates [Optional] : The predicates to infer. A list of sub-options with the following structure:
    • Aimed Predicate (existing) : Predicate to infer, either a literal or a link (fwd or rev) to Aimed Relationship Class.
    • Resulting Predicate : The resulting predicate. Its type must match that of Aimed Predicate.
    • Aimed Relationship Class (optional) : Aimed relationship class (only relevant if Aimed Predicate is a link).

Advanced

  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Quality Control

  • Fraction of resources with at least one match [Optional] : The fraction of filtered resources in Target Class which matched at least one resource in Relationship Class. (higher is better)

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.28. Map Classes (by label)

Add a URI to the target class in case a literal predicate matches another literal predicate in the matching class.

Description

This component copies the Preferred URI (disq:uri.puri) from resources in class Matching Class to resources in class Target Class if they contain matching literals.

This component is normally only used in preparation of a Merge Classes component, which can take over the rest of the predicate.

More in detail:

  • Values from predicate Matching Literal in Matching Class are compared to values from predicate Matching Predicate in Target Class.
  • If, for some resource M in Matching Class and some resource T in Target Class, one or more literal values of M are equal to one or more literal values from T, then the Preferred URI (disq:uri.puri) of resource M is added to the subject URIs (disq:uri.uri) of resource T.

By default Matching Literal and Matching Predicate are equal to disq:label.lit (or disq:label for short), because comparing by label is a common operation.

The way literals are compared to each other can be tailored via two options:

  • Case Sensitive determines whether uppercase/lowercase differences matter. For example, if False, “dog” is considered to be equal to “Dog”.
  • Remove Dashes and Spaces determines whether differences due to dashes ('-') or spaces (' ') matter. For example, if True “my-dog” is considered to be equal to “my dog” and to “mydog”.

In former versions of the Data Ingestion Engine only the one-to-one case was supported. Now also the many-many case is supported: if a resource in Matching Class matches with multiple resources in Target Class, then, by default, its Preferred URI is added to each of those target resources. Conversely, if multiple resources in Matching Class match with a resource in Target Class, then, by default, all their Preferred URIs are added to the target resource. This default behavior can be changed back to the old behavior by setting option Mapping Strategy equal to (Deprecated) Single to Pick One. It there are no many-to-many matches, this option has no effect.

Example

Option Value
Target Class MyHeroes
Matching Predicate name
Matching Class DisneyCharacters
Matching Literal DEFAULT (disq:label)
Case Sensitive True
Remove Dashes and Spaces True

URIs have been abbreviated:

Target Class MyHeroes before applying the component:

name.lit
[“mickey-mouse”]
[“john-snow”]

Matching Class DisneyCharacters before applying the component:

disq:uri.puri disq:uri.phuri disq:label.lit
[D:mickey_mouse] [HURI(D:mickey_mouse)] [“Mickey Mouse”]
[D:pluto] [HURI(D:pluto)] [“Pluto”]

Target Class MyHeroes after applying the component:

name.lit disq:uri.uri disq:uri.huri
[“mickey-mouse”] [D:mickey_mouse] [HURI(D:mickey_mouse)]
[“john-snow”] []  

The Matching Class is unchanged.

Options

Target Class

  • Target Class : The class to be matched. If the predicate from the Target Class matches the matching predicate of the Matching Class, the URI of that resource in the Matching Class will be added to the Target Class.
  • Matching Predicate [Optional] : The predicate in the Target Class to be used for matching. The default value is the label (disq:label.lit). The default value is disq:label.lit.
  • Target Class Filter [Optional] : A boolean expression returning True for resources in the Target Class to which the action should be applied.

Matching Class

  • Matching Class : The class to be screened during the mapping.
  • Matching Predicate [Optional] : The predicate in the Matching Class to be used. The default value is the label (disq:label.lit). The default value is disq:label.lit.
  • Matching Class Filter [Optional] : A boolean expression returning True for resources in the Matching Class to which the action should be applied.

Matching

  • Mapping Strategy [Optional] : The mapping strategy which will be used in in case of multiple hits. The possible values are: Multiple to Multiple, (Deprecated) Single to Pick One.
  • Case Sensitive [Optional] : Match literals in a case sensitive way. The default value is False.
  • Remove Dashes and Spaces [Optional] : Remove dashes and spaces when matching literals. The default value is False.

Advanced

  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Quality Control

  • Fraction of matched destination resources [Optional] : The fraction of resources in Target Class that matched successfully. (higher is better)

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “Matching literal in the Matching Class appears in multiple resources with different Preferred URI (further warnings about the same matching literal are suppressed).” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Matching literal in the Matching Class appears in multiple resources with different Preferred URI (further warnings about the same matching literal are suppressed).”. The default value is 1.
  • Minimal count for warning “Resource from Target Class matches literals associated with multiple Preferred URIs in Matching Class. (this literal will be excluded from matching)” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Resource from Target Class matches literals associated with multiple Preferred URIs in Matching Class. (this literal will be excluded from matching)”. The default value is 1.
  • Minimal count for warning “Multiple resources from Target Class match literals associated with the same URI in Matching Class. (only one of these resources will receive new URIs)” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Multiple resources from Target Class match literals associated with the same URI in Matching Class. (only one of these resources will receive new URIs)”. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.29. Merge Classes

Moves resources from source class to target class.

Description

This component transfers resources from Source Class to Target Class.

If a source resource has a URI that is already present in a resource of Target Class (a match), then the predicate values of the source resource are added to the target resource predicates. Otherwise the source resource is copied to a new target resource. Predicates in Source Class that don’t exist in Target Class are added.

Transferred resources are deactivated. If no Source filter was used, this component will leave Source Class completely “empty”.

Note that, as always in the Data Ingestion Engine, the order of values in a predicate is arbitrary, so merged values are not simply added after the original values.

If a resource is matched, it is possible that its Preferred URI in Source Class is different from its Preferred URI in Target Class, and likewise for Preferred Label. By default, the preferredness of Target Class takes precedence. This can be changed via option Source Class takes precedence.

For a discussion about preferredness, see Add URI.

If multiple source resource match with one target resource, then the predicate values of all these source resources are added to the target resource. This is a many-to-one merge. An alternative for merging many-to-one, but within a single class, is component Merge within Class.

Conversely, if a source resource matches with multiple target resources, then the the predicate values of the source resource is added to one of the target resources (an arbitrary choice). This is a one-to-many merge, which is best to avoid.

For a discussion about duplicate URIs, see below.

Because this component matches resources based on their URIs, the involved classes are often prepared to yield similar URIs. This can be achieved with components such as Add URI or match_subject_literal_reference_tag.

Example

Option Value
Source Class PixarCharacters
Target Class DisneyCharacters

URIs have been abbreviated:

and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.

Source Class PixarCharacters before applying the component:

disq:uri.uri disq:label.lit movie.lit
[P:remy] [“Remy”] [“Ratatouille”]
[P:dory, D:dory] [“Dory”] [“Finding Nemo”]
[P:wall_e, D:wall_e] [“Wall-E”] [“Wall-E”]
[P:nemo, D:nemo] [] [“Finding Nemo”]
[P:nemo, D:nemo] [“nemo”] [“Finding Nemo”]

Target Class DisneyCharacters before applying the component:

disq:uri.uri disq:label.lit year.lit
[D:mickey_mouse] [“Mickey Mouse”] [“1928”]
[D:wall_e] [“Wall E”] [“2008”]
[D:nemo] [] [“2003”]

Target Class DisneyCharacters after applying the component:

disq:uri.uri disq:label.lit year.lit movie.lit
[D:mickey_mouse] [“Mickey Mouse”] [“1928”] []
[D:wall_e, P:wall_e] [“Wall E”, “Wall-E”] [“2008”] [“Wall-E”]
[D:nemo, P:nemo] [“nemo”] [“2003”] [“Finding Nemo”]
[P:remy] [“Remy”] [] [“Ratatouille”]
[P:dory, D:dory] [“Dory”] [] [“Finding Nemo”]

Observe:

  • A new predicate movie.lit is created in Target Class DisneyCharacters.
  • Mickey Mouse didn’t have a counterpart in PixarCharacters, so its existing predicates are untouched. The new predicate movie.lit is left empty.
  • Remy only has a URI starting with P:, so it cannot match a URI in DisneyCharacters. As a result a new resource is created in DisneyCharacters and all predicates are copied. Predicate year.lit only exists in DisneyCharacters, so it is left empty.
  • Dory does have a URI starting with D: (a potential match), but it doesn’t match with any URI, so everything is copied to a new resource, like Remy. Note that the preferred URI is carried over.
  • Wall-E in PixarCharacters has a URI which matches with a resource in DisneyCharacters, so its values for disq:uri and disq:label.lit are added there, and its value for movie.lit is copied. Note that the Preferred URI and Label are not overridden!
  • Both Nemo’s in PixarCharacters are merged into the same existing resource. This shows that this component can help in dealing with “Duplicate URI” problems. Note that the original label was empty, so the copied label is taken to be the Preferred Label.

Source Class PixarCharacters after applying the component (all resources are deactivated):

active disq:uri.uri disq:label.lit movie.lit
NO [P:remy] [“Remy”] [“Ratatouille”]
NO [P:dory] D:dory] [“Dory”] [“Finding Nemo”]
NO [P:wall_e, D:wall_e] [“Wall-E”] [“Wall-E”]
NO [P:nemo, D:nemo] [] [“Finding Nemo”]
NO [P:nemo, D:nemo] [“nemo”] [“Finding Nemo”]

Dealing with Duplicate URIs

If a resource in the Source Class has multiple URIs which match with different resources in the Target Class, then the merge operation can introduce duplicate URIs.

For example: if Target Class contains a resource with URI P:young_nemo and another resource with URI P:old_nemo (so it considers Young Nemo and Old Nemo to be different resources), but Source Class contains a single resource with URIs [P:nemo, P:young_nemo, P:old_nemo] (so it considers Young and Old Nemo to be equivalent), then after merging both target resources will get all three URIs.

This situation (introduced duplicate URIs) is detected during execution of this component, and can be remedied, see further.

Note that the situation is more complicated if the Source Class or the Target Class already have duplicate URIs (different resources with the same URI) before merging:

  • Source Class resources with duplicate URIs which match with a Target Class resource will all be merged with that resource.
  • Source Class resources with duplicate URIs which don’t match with any Target Class resource are moved one by one to the Target Class, without merging.
  • If the Target Class has duplicate URIs, then these are, in principle, not merged by this component.

Duplicate URIs create problems: the corresponding resources will not be published in Publish in DISQOVER (unless Publish malformed instances is switched on). This is typically remedied by adding an extra Merge Within Class (operating on the Target Class) after merging Source Class to Target Class. To achieve this, you can either explicitly add a Merge within Class component in the pipeline after this component, or use option Add Merge within Class in this component to automatically do this extra merge step.

The option Add Merge within Class takes the following values:

Merge if needed:
 if duplicate URIs are introduced, automatically execute an extra Merge Within Class after execution of this component.
Warning:if duplicate URIs are introduced, issue a warning.
Suppress warning:
 if duplicate URIs are introduced, don’t issue a warning.
Always Merge:automatically execute an extra Merge Within Class after execution of this component, regardless of whether duplicate URIs were introduced.

Take into account that this extra merge step takes extra time to execute, and that it can introduce arbitrary value order in predicates (see Merge within Class)

Note that if neither the Source Class nor the Target Class contain duplicate URIs before merging, then option value Merge if needed guarantees that the Target Class will not have duplicate URIs after execution. Option value Always Merge has the same effect, but may do unnecessary work. If the Source Class or the Target Class already contain duplicate URIs, only the option value Always Merge guarantees that the Target Class will not have duplicate URIs after execution.

Performance considerations

This component copies resources from the Source Class to the Target Class. Therefore the performance is best when the Source Class is smaller (i.e. contains fewer resources) than the Target Class. If applicable, you can use the option Source Class takes precedence to have a resource retain the Preferred URI and Preferred Label of the Source Class if they are present.

Options

  • Preferred URI selection strategy [Optional] : Which value to pick as preferred URI when merging resources with a different preferred URI. Falls back to taking the alphabetically first value. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
  • Preferred label selection strategy [Optional] : Which value to pick as preferred label when merging resources with a different preferred label. Falls back to taking the alphabetically first value. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.

Source Class

  • Class : The class from which the resources will be moved.
  • Filter [Optional] : Boolean expression returning true for resources which should be included.

Target Class

  • Class : The class to which the resources will be moved.
  • Source Class Takes Precedence [Optional] : Preferred URI and preferred label from Source Class take precedence. The default value is False.
  • Add Merge within Class [Optional] : Desired behavior for automatically applying Merge within Class on Target Class after this component is finished The possible values are: Merge if needed, Warning, Suppress warning, Always Merge.

Advanced

  • Keep Auxiliary Predicates [Optional] : Include auxiliary predicates. The default value is False.

Quality Control

  • Fraction of Merged Resources [Optional] : The fraction of filtered source resources which were merged with a destination resource. (higher is better)

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “Potential broader visibility” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential broader visibility”. The default value is 1.
  • Minimal count for warning “Sub-optimal class choice” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Sub-optimal class choice”. The default value is 1.
  • Minimal count for warning “Preferred URI from federated public endpoint overwritten” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Preferred URI from federated public endpoint overwritten”. The default value is 1.
  • Minimal count for warning “Merge within class: Nothing was merged.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Merge within class: Nothing was merged.”. The default value is 1.
  • Minimal count for warning “Merge within class: Resources with different preferred URI were merged.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Merge within class: Resources with different preferred URI were merged.”. The default value is 1.
  • Minimal count for warning “Merge within class: Potential broader visibility” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Merge within class: Potential broader visibility”. The default value is 1.

9.4.30. Merge within Class

Merges resources in a class that have at least one identical URI.

Description

This component merges resources in a class (Target Class) which have subject URIs in common.

Resources are considered to be equivalent if they have subject URIs (disq:uri) in common. This is even true “indirectly”, e.g. if

  • resource 1 has subject URIs A and B
  • resource 2 has subject URIs B and C
  • resource 3 has subject URIs C and D

then all three are considered equivalent (even though 1 and 3 don’t have URIs in common.)

The component scans through all records and constructs “equivalence groups”.

  • If a resource doesn’t have equivalent resources, then it is left alone.
  • If a resource does have equivalent resources, all resources in its group are removed (or rather: deactivated), a new resource is created, and all predicates are “merged”. To be more precise: for each predicate in the class, the values of each resource in the group are concatenated (in arbitrary order).

Important note: if resources are merged, a Preferred URI is chosen arbitrarily from the Preferred URIs of those resources. Likewise for the Preferred Label, if present.

During execution of the pipeline it is not uncommon to have resources with identical URIs, within one class or over different classes. At the end of the pipeline, however, when data is published to DISQOVER, this should no longer occur. This component can be used to tackle this “Duplicate (H)URI” problem.

However, keep in mind that this component doesn’t offer a way to select Preferred URIs or Labels. Two alternatives to consider:

  • Component Merge Classes offers better control for Preferred URIs and Labels.
  • Sometimes it is possible to avoid the creation of duplicate URIs in the first place.

Example

URIs have been abbreviated:

and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.

Preferred URIs are notated in boldface.

Target Class before applying the component:

disq:uri.uri animal_name.lit
[D:mickey_mouse] [“Mickey Mouse”]
[D:pluto] [“Pluto”]
[D:mickey, D:mickey_mouse] [“Mickey”]

Target Class after applying the component:

active disq:uri.uri animal_name.lit
NO [D:mickey_mouse] [“Mickey Mouse”]
YES [D:pluto] [“Pluto”]
NO [D:mickey, D:mickey_mouse] [“Mickey”]
YES [D:mickey_mouse, D:mickey] [“Mickey Mouse”, “Mickey”]

Observe:

  • The first and third resource are equivalent because they share a URI (D:mickey_mouse). They are deactivated and a new resource is created with merged predicate values. In the merged resource D:mickey_mouse is indicated as the preferredLabel, but it could as well have been D:mickey, there is no way to tell beforehand.
  • The second resource is left untouched because it has a unique URI.

Options

  • Class : The class containing the resources to be merged.
  • Filter [Optional] : Boolean expression returning true for resources which should be included.
  • Preferred URI selection strategy [Optional] : Which value to pick as preferred URI when merging resources with a different preferred URI. Falls back to taking the alphabetically first value. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.
  • Preferred label selection strategy [Optional] : Which value to pick as preferred label when merging resources with a different preferred label. Falls back to taking the alphabetically first value. Falls back to taking the alphabetically first value. The possible values are: Alphabetically first, Most common, Shortest.

Quality Control

  • Fraction of resources that were merged [Optional] : The fraction of resources that were merged. (higher is better)

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “Nothing was merged.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Nothing was merged.”. The default value is 1.
  • Minimal count for warning “Resources with different preferred URI were merged.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Resources with different preferred URI were merged.”. The default value is 1.
  • Minimal count for warning “Potential broader visibility” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential broader visibility”. The default value is 1.

9.4.31. No Operation

Does nothing (can be used to improve the organization of the pipeline).

Description

This component does nothing. It only exists to facilitate organization of the pipeline.

For example, it can be used as the start or end of a set of components that are closely related, or as a placeholder for a component that needs to be implemented later.

Options

  • Class [Optional] : The class to which this component logically belongs.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.

9.4.32. Publish in DISQOVER

Publish the integrated data in DISQOVER.

Description

The main purpose of this component is to publish all configured data of the pipeline in DISQOVER. After a successful execution the data will be visible in the DISQOVER front-end.

This component transforms resources in the Data Ingestion Engine to instances in DISQOVER. An instance in DISQOVER is an instance of a canonical type, such as a Gene, a Publication or an Organism.

The configuration components dictate which resources should be published and how predicates map to instance properties and/or facets. For more information see

The Configuration components should be the last components in the pipeline, linking to a final Publish in DISQOVER component.

Note

After doing some checks on all the classes, this component erases all data in DISQOVER, before publishing the new data (except when differential indexing is enabled, see further).

Considering this, the Data Ingestion Engine handles execution of this component with some care. When executing a pipeline (fully or partially) Publish in DISQOVER will not execute if any component produced an error (warnings are allowed).

This behavior can be overridden in debugging mode.

Malformed instances

In principle, in order to be published in DISQOVER a resource needs to have at least two things:

  • a Preferred URI (disq:uri.puri).
  • a Preferred Label (disq:pref_label.plabel).

All other predicates are optional and will not prevent the resource from being published as an instance.

It is also important that, in principle, each URI should be unique over all classes before publishing. Duplicates within a class can be merged via Merge within Class, duplicates over multiple classes can be merged via Merge Classes.

In practice, due to errors in the pipeline or in the data, it can happen that resources get no Preferred URI or no Preferred Label, or that different resources get the same subject URI. The component ‘Publish in DISQOVER’ will issue warnings for these cases (see below).

However, it can be hard to find out which resources are causing the problems. To make it easier to find those resources, the option Publish malformed instances is available as a debugging tool. If turned on, the following problematic instances will be published in DISQOVER, but with a special indication in their label:

  • If a resource has no Preferred Label the instance label will be its Preferred URI followed by the indication ‘[MISSING LABEL]’.
  • If a URI occurs as subject URI in multiple resources (duplicate URIs), then the corresponding instances will get the indication ‘[DUPLICATE URI]’ in their label.
  • In this last case, if that URI is the Preferred URI of at least one instance, those instances will get the indication ‘[DUPLICATE PREFERRED URI]’.

Note that resources without Preferred URI will not be published, even if Publish malformed instances is turned on.

This option should be turned off when publishing for RDS.

Export to RDF

The component Publish in DISQOVER can also export instance data to an RDF-file in turtle format (see https://www.w3.org/TR/turtle/).

In order to do so, switch on Export data to file and execute the component.

The export will produce multiple files of the form ‘ccc_nnn’ where ‘ccc’ is the name of the Canonical Type and ‘nnn’ is some unique number. By default the exported files are written in directory /disqover/data/exports. You can export to a sub-directory of this directory by filling in the option Export path.

Instance data (if present) is exported in triples with Preferred URI used as subject URI, according to the following scheme:

Instance data Predicate
Other URIs owl:sameAs = http://www.w3.org/2002/07/owl#sameAs
Preferred label skos:prefLabel = http://www.w3.org/2004/02/skos/core#prefLabel
Other labels rdfs:label = http://www.w3.org/2004/02/skos/core#label
Resource Type rdf:type = http://www.w3.org/1999/02/22-rdf-syntax-ns#type
Properties (*) Corresponding property URI
Facets (*) Corresponding facet URI

(*) by default properties are included in the export, but facets are not. This behavior can be overridden per individual property/facet via the option Export to file in the configuration components.

If you only want to export, but not actually publish to DISQOVER you can switch off Publish data to Disqover.

Differential Indexing

By default, if this component is executed, all instances which are already present in DISQOVER are removed before publishing.

If the option Differential indexing is switched on, only the new and changed instances are published and only obsolete instances are removed. This can make the execution faster.

Information

Resource belongs to multiple types
When a resource belongs to 2 or more canonical types, one example of each combination is provided.

Warnings

Class does not contain uri and type predicates

Preferred URI or Resource Type (rdf:type) predicates do not exist in a specific class.

Class contains resources without a type

Resource Type (rdf:type) predicates exist in a class, but the class is not configured in a canonical type component and there are resources which have no values for that predicates. These resources will not be published.

Class does not contain any instances with configured types

The predicate rdf:type.lit is defined in a class, which is not configured for publishing in the configuration components, but none of its values are referenced in the configuration components.

Tree contains loops

Facets can be specified to be hierarchical. In that case the underlying data should also be hierarchical. Nodes are allowed to have multiple parents, but they should never result in a circular reference. If this is the case this warning is shown. One example per class of such a loop is reported.

Facets expecting single valued predicates received multi valued predicates

Facets that have a data type (integer, float or date) and can be used for histograms should be single-valued. If the corresponding predicate for a resource has more than one value, that value is not published. Note that the value is allowed to be empty.

This problem can probably be solved via a Transform Literals component.

Facets expecting predicates of a certain data type receiver the wrong data type

Facets that have a data type (integer, float or date) expect a fixed data format. For example a date should be in ISO format ‘YYYY-MM-DD’. A number should only contain digits and optionally a decimal point or a sign. If a predicate value of resource does not meet this criteria, it will not be published.

Preferred label and/or preferred URI changed in an incremental run

This is a warning which can occur during incremental runs: label and preferred URIs are immutable during such a run. When resources get a new label or URI during such a run anyway, this warning is issued. A full rerun of the data will adapt the labels and URIs.

Resource belongs to local-only and mixed canonical types simultaneously

When using federation, a resource belongs to a local canonical type and to a mixed canonical type. This will result in inaccurate counts in DISQOVER. The pipeline will have to be adapted to fix the problem.

Ambiguous storage field. Storage fields are derived from the postfix of the URI

When the data is published to DISQOVER, facets and properties are transferred to fields of the underlying storage. The names of the fields are derived from the configuration URI: they correspond to the _postfix_ of the URI. However this may result in conflicts in storage: if the prefix of 2 URIs is different but the postfix is identical, information may be stored together in one field which does not belong together. The solution is choose another URI for property or facet. Note that this typically occurs when one resource has multiple canonical types.

Initial error uploading a batch file to solr

An error occurred during the publishing of the data but the software could mitigate it by retrying the failed upload. When this error occurred it might indicate that some paarmeters on the server are not configured optimally.

Unknown data source

Some component in the pipeline refers to a data source, but the pipeline doesn’t contain a Define Datasource component for that data source.

Other non-fatal errors occurred

These warnings do not belong to specific categories.

Errors

Class contains resources without preferred label / URI

Preferred URI or Preferred Label predicates exist in the class, but there are resources which have no values for one of those predicates although they are in class configured for publishing or have an rdf:type configured for indexing. These resources will not be published (Note: if Publish malformed instances is turned on, this is reduced to a warning).

Class contains non-unique URIs/Classes contain non-unique preferred URIs

Some URIs are present in multiple resources. This may be in different classes. The resources with these duplicate URIs will not be published, unless Publish malformed instances. In the latter case this will be reduced to a warning.

Per class a few duplicate URIs are reported. The action Find URI can be used to further investigate or Publish malformed instances.

Instances invisible by combination of user roles

User roles can be defined on different levels. Sometimes this can result in an instance being invisible to all users. For example a canonical type may be visible only for members of group A, while an instance in this canonical type is only visible for members of group B.

Error uploading a batch file to solr

An error occurred during the publishing of the data. This might result in incomplete data. When this error occurs, please check the sanity of the solr service.

Error calling solr API

An error occurred during the publishing of the data. This might result in incomplete data. When this error occurs, please contact ONTOFORCE support.

Options

Advanced

  • Automatically drop predicates [Optional] : Determine automatically if certain predicates are no longer necessary at some point in the pipeline. The default value is False.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.

9.4.33. Remove Resources

Removes all resources from a class that match a given filter.

Description

This component “removes” all resources from a class (Target Class) which are included by the filter (filter).

Actually the resources are not really removed, but deactivated (internally marked by a boolean).

Note that it is preferable, in many cases, to apply a filter to a component instead of removing records before running that component.

Advanced

In general the pipeline tries to make all relationships bidirectional, i.e. using a forward relationship predicate in one alignment and a reverse relationship predicate in the other alignment. This component is the only one which can break directionality because it can remove one of the predicates, while leaving the other one. Therefore it is advisable to apply this component early in the pipeline, before relationships are created.

Removing a lot of records can compromise the performance of further processing; see Create Compact Class.

Options

  • Class : The class to remove resources from.
  • Filter [Optional] : Boolean expression returning true for resources which should be removed.

Advanced

  • Update Statistics [Optional] : Update predicate statistics. This may be switched off for performance reasons The default value is True.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.

9.4.34. Synchronize Federated Class

Synchronizes the given class with the federated data.

Description

This component tries to synchronize local resources to a Federation Endpoint, either by URI matching or by label matching. The Federation Endpoint should have been set up in the DISQOVER backend server, defining the federated server’s connection settings.

URI matching:

  • The system will try to find an instance with given local URI on the Federation Endpoint. It will either return a new preferred URI and preferred label, or nothing.
  • The local URIs are in a predicate specified in Match URI Predicate, by default disq:matchuri.lit.

Label matching:

  • The system will try to find an instance with given label and DISQOVER Canonical Type URI on the Federation Endpoint. If several candidates are found, the Use Most Referenced Match determines what happens (see the examples below).
  • The local labels are in a predicate specified in Match Label Predicate, by default disq:matchlabel.lit.
  • The remote Canonical Type URIs are in a predicate specified in Match Type Predicate, by default disq:matchtype.lit.

Note that you cannot do URI matching and Label matching at the same time.

Depending on the particular case, the component writes to different predicates (e.g. disq:label.lit), see the examples below. It always writes to predicate disq:partition.lit:

  • “public” if there was a match on the Federation Endpoint.
  • “local” if there was no match.

After publishing to DISQOVER, each instance containing “public” in disq:partition.lit becomes a candidate for “live” federation. This means that when an instance is shown in DISQOVER, all its properties and facets will be retrieved from the Federation Endpoint and combined with its local properties and facets.

After execution, the number of unmatched entries, as well as the number of resources with match errors are shown in the Counters section.

Because the process of synchronizing can take some time, the results are cached locally. This means that the first execution can be slow, but subsequent executions will be fast, provided the data hasn’t changed much. The number of cache hits and cache misses can be inspected in the Counters section.

Important

Federation relies on three assumptions:

  1. Federated instances belong to the same canonical type on the customer DISQOVER installation as on www.disqover.com. This means synchronized instances cannot also belong to a local canonical type on top of a federated one.
  2. Mixed instances (customer instances that are enriched with ONTOFORCE data) have the same preferred URI on the customer DISQOVER installation as on www.disqover.com. This means the preferred URI should not be changed after the “Synchronize Federated Class”-component.
  3. The URIs of data sources are different on the customer DISQOVER installation than on disqover.com

A URI matching example

Option Value
Target Class DisneyCharacters
Use Most Referenced Match False

URIs have been abbreviated:

The situation before applying the component:

disq:uri.uri disq:matchuri.lit
[LD:mickey_mouse] [D:mickey_mouse]
[LD:donald_dog] [D:donald_dog]

Suppose D:mickey_mouse exists on the Federation Endpoint, with label “Mickey Mouse”, and that D:donald_dog does not exist. The situation after executing the component is then:

disq:uri.uri disq:matchuri disq:partition.lit disq:label.lit
[LD:mickey_mouse, D:mickey_mouse] [D:mickey_mouse] [“public”] Mickey Mouse
[LD:donald_dog] [D:donald_dog] [“local”]  

D:mickey_mouse will be the new preferred URI for the first resource.

A label matching example using most referenced matches

Let’s assume the Federation Endpoint has the following data:

URI Label Dataset hits
D:mickey_mouse Mickey Mouse Cartoons 3 hits, Movies 4 hits
D:ronald_duck Ronald Duck Cartoons 1 hit, Movies 2 hits
D:ronald_d_duck Ronald Duck Cartoons 0 hits, Movies 2 hits

We’ll configure the component as follows:

Option Value
Target Class DisneyCharacters
Use Most Referenced Match True

Target Class before applying the component:

disq:uri.uri disq:matchlabel disq:matchtype
[LD:mickey_mouse] [“Mickey Mouse”] [D:disney_character]
[LD:donald_dog] [“Donald Dog”] [D:disney_character]
[LD:ronald_duck] [“Ronald Duck”] [D:disney_character]

Target Class after applying the component (omitting partition, matchlabel and matchtype):

disq:uri.uri disq:partition.lit disq:uri.err disq:label.lit
[LD:mickey_mouse, D:mickey_mouse] [“public”]   Mickey Mouse
[LD:donald_dog ] [“local”]    
[LD:ronald_duck] [“public”] [D:ronald_duck, D:ronald_d_duck]  

Results:

  • D:mickey_mouse will be the new preferred URI.
  • Donald Dog was not found and does not get values.
  • Ronald Duck has multiple matches with the same maximum number of hits in a dataset (Movies 2 hits), so both matching URIs are stored in the disq:uri.err predicate.

A label matching example not using most referenced matches

Let’s assume the Federation Endpoint has the following data:

URI Label Dataset hits
D:mickey_mouse Mickey Mouse Cartoons 3 hits, Movies 4 hits
D:ronald_duck Ronald Duck Cartoons 1 hit, Movies 2 hits
D:ronald_d_duck Ronald Duck Cartoons 0 hits, Movies 1 hits

We’ll configure the component as follows:

Option Value
Target Class DisneyCharacters
Use Most Referenced Match True

The situation before applying the component:

disq:uri.uri disq:matchlabel disq:matchtype
[LD:mickey_mouse] [“Mickey Mouse”] [D:disney_character]
[LD:donald_dog] [“Donald Dog”] [D:disney_character]
[LD:ronald_duck] [“Ronald Duck”] [D:disney_character]

The situation after applying the component (omitting partition, matchlabel and matchtype):

disq:uri.uri disq:uri.err disq:label.lit
[LD:mickey_mouse, D:mickey_mouse]   Mickey Mouse
[LD:donald_dog]    
[LD:ronald_duck] [D:ronald_duck, D:ronald_d_duck]  

Results:

  • D:mickey_mouse will be the new preferred URI.
  • Donald Dog was not found and does not get values.
  • Ronald Duck has multiple matches, so both matching URIs are stored in the disq:uri.err predicate. The number of hits in datasets is not used as a determining factor.

Matching multi-valued URIs or labels

When the local data contains multiple values in any of the predicate(s) selected by Match URI Predicate, Match Type Predicate or Match Label Predicate, the synchronizaztion action will:

  • Use the first found value in the predicate.
  • Add a warning about the instances which has multiple values in the selected predicate.

Options

  • Class : The class to synchronize.
  • Match URI Predicate [Optional] : Predicate used to match URIs. The default value is disq:matchuri.lit.
  • Match Labels Predicate [Optional] : Predicate used to match labels. The default value is ['disq:matchlabel.lit'].
  • Match Type Predicate [Optional] : Predicate used to match type. The default value is disq:matchtype.lit.
  • Filter [Optional] : Boolean expression returning true for resources which should be included.

Advanced

  • Use Most Referenced Match [Optional] : Whether to use the label match with the highest number of associated datasets. The default value is True.

Quality Control

  • Fraction of failed synchronizations [Optional] : The fraction of resources for which the synchronization failed. (lower is better)

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the synchronization of a resource.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the synchronization of a resource.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.

9.4.35. Transform Literals

Within each resource, applies an expression to transform literal predicates into output literal predicates

Description

This components adds literal predicate values in each resource of a class (Target Class). These values are derived from other literal predicates via expressions.

Predicates that are written (output predicates) are notated with prefix @, predicates which are read (input predicates) are notated with a prefix $. If an input predicate is known to be single-valued (for each resource), the prefix $$ can be used to retrieve that value.

For more details about the expression language, see Expression Functions.

The component operates resource by resource, so it is not possible to mix data from different resources. A similar but more powerful component which offers this functionality is Aggregate and Transform (resources).

It is not possible to define subject URIs (disq:uri) with this component, use Add URI instead. It is not possible to define subject Labels (disq:label) with this component, use Add Label instead.

Example

In this example we split a comma-separated literal into a multivalued literal, and convert dates to an ISO-format.

Transformation expression:

set @country = StrSplit($$country_list, ",");

set @iso_date = Map($raw_date, _el, IsoDater(_el, "%m%d%Y"));

Target Class before applying the component:

country_list.lit raw_date.lit
[“BE,FR,UK”] [“04101992”]
[“US”] [“05101992”, “06101992”]

Target Class after applying the component:

country_list.lit raw_date.lit country.lit iso_date.lit
[“BE,FR,UK”] [“04101992”]
[“BE”,
“FR”, “UK”]
[“1992-04-10”]
[“US”] [“05101992”, “06101992”] [“US”] [“1992-05-10”, “1992-06-10”]

Options

  • Class : Class containing the predicates to transform.
  • Transformation : The set of expressions to be executed for each resource of the class (unless filtered out). Newly created predicates must be lists.
  • Filter [Optional] : Boolean expression returning true for resources which should be included.

Advanced

  • Make Auxiliary [Optional] : Make all generated predicates auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is False.
  • Data Sources [Optional] : List of URIs of the data sources assigned to this component.

Quality Control

  • Fraction of failed transformations [Optional] : The fraction of resources for which the transformation failed. (lower is better)

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.
  • Minimal count for warning “An error occurred during the transformation of a resource.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the transformation of a resource.”. The default value is 1.
  • Minimal count for warning “Expression cannot be precompiled.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Expression cannot be precompiled.”. The default value is 1.
  • Minimal count for warning “Could not apply detailed provenance, all input data sources have been combined.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Could not apply detailed provenance, all input data sources have been combined.”. The default value is 1.
  • Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is 1.
  • Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is 1.
  • Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is 1.
  • Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is 1.

9.4.36. Verify Data

Verifies data based on a ratio between two filter counts. A warning level and an error level can be set for this fraction.

Description

This component compares the number of resources in Class specified by two filters in Class and generates a warning or an error if a threshold is exceeded.

It doesn’t change any data.

The component calculates a quality measure equal to condition count divided by the scope count, where

  • condition count is the number of (active) resources that pass the Scope Filter and the Condition Filter
  • scope count is the number of (active) resources that pass the Scope Filter.

The component generates a warning or an error if the quality measure exceeds thresholds specified via Warning Threshold and Error Threshold as described in Quality Control.

By default, a warning or error is generated if the quality measure is strictly greater than the threshold. That behavior can be reversed via the option High is Bad.

If Scope Filter is empty (default value is True) then the scope count is equal to the total number of (active) resources in the class. Leaving Condition Filter doesn’t make much sense.

Example

Option Value
Condition Filter ListEmpty($disq:label.lit)
Scope Filter empty
Warning Threshold 0.01
Error Threshold 0.1
Lower is better True

This component looks at the percentage of active resources for which disq:label.lit is empty. It generates an error if that percentage is greater than 0.1 and a warning if it’s greater than 0.01.

Options

  • Class : The class containing the data to be verified.
  • Condition Filter : A boolean expression that will be evaluated for all resources in the class. The numerator of the fraction is the number of resources that return True.
  • Scope Filter [Optional] : A boolean expression that will be evaluated all resources in the class. The denominator of the fraction is the number of resources that return True.

Quality Control

  • Warning Threshold : The warning threshold of the fraction.
  • Error Threshold : The error threshold of the fraction.
  • Lower is better [Optional] : If true (default), values above the threshold will generate warnings or errors. If false, values below the threshold will generate warnings or errors. The default value is True.

Warnings

  • Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is 1.