9.4. Components¶
Overview
- Add Label
- Add URI
- Aggregate and Transform (resources)
- Configure Canonical Type
- Configure Sub-instance Type
- Configure Typed Link
- Configure User Views (DEPRECATED)
- Create Compact Class
- Create Relationship (by identifier)
- Create Relationship (by label)
- Define Datasource
- Expand Hierarchical Paths
- Extract Class
- Extract Class (distinct)
- Extract Hierarchical Class
- Import CSV
- Import Excel
- Import Identifier Block
- Import JSON
- Import RDF (DEPRECATED)
- Import RDF (multiple classes)
- Import Remote Data Set
- Import Separator Block
- Import XML
- Infer by Relationship (DEPRECATED)
- Infer by Relationship (multiple predicates)
- Map Classes (by label)
- Merge Classes
- Merge within Class
- No Operation
- Publish in DISQOVER
- Remove Resources
- Synchronize Federated Class
- Transform Literals
- Verify Data
9.4.1. Introduction¶
Components can be configured via their options.
Some options occur in many component types. We’ll discuss these common options first, before diving into the individual component types.
Filters¶
Most components offer the possibility to restrict the resources that are processed in the involved classes by using filters.
A filter is a boolean expression which is evaluated for every (active) resource.
If the expression returns True
then the resource is taken into account for the component execution,
otherwise it is not taken into account.
By default (empty filter), the return value is True
.
The expression is written using the expression language, see Expression Functions. For each resource the filter expression has access to the values of all predicates.
Note: it is not possible to filter based on predicate values in multiple resources.
Known issue: In general, when specifying a predicate in an option or expression, the type extension can be left out. For example, one can normally write name instead of name.lit. However, if a predicate is mentioned both in an option and in a filter, then the extension has to be added explicitly in the filter.
After execution, the number of resources that were “filtered in” per class can be consulted in the section Counters of the component view.
Quality Control¶
Several components offer a mechanism for verifying the quality of the produced data.
Quality is expressed via numbers called quality measures. A quality measure can be either a quantity, e.g. the number of imported resources, or a ratio between two numbers, e.g. the number of resources that produced a failure divided by the total number of resources, in other words the fraction of failures.
Some quality measures are expected to be high numbers (higher is better), like the number of imported resources, others are expected to be low numbers (lower is better), like the fraction of failures.
A component can generate a warning or an error if a quality measure exceeds a user-specified threshold.
- For a higher is better quality measure the warning or error is generated if the number is smaller than the threshold.
- For a lower is better quality measure the warning or error is generated if the number is greater than the threshold.
Each component can offer different quality measures which are relevant for that component. For each quality measure, the user will be presented with two options to set the thresholds:
- Error Level: if the quality measure exceeds the threshold, an error is produced.
- Warning Level: if the quality measure exceeds the threshold, a warning is produced.
For the difference between warnings and errors, see Execution errors and warnings.
This kind of quality control is sometimes called in-component QC. It is not to be confused with the component Verify Data which offers an alternative, less component-specific way to verify quality.
Example
The Transform Literals component applies some transformation on each resource of a class.
For some resources the transformation may fail.
It is therefore natural for this component to offer a quality measure Fraction of Failed Transformations
.
This is an example of a ratio quality measure of the type lower is better.
Suppose the user specified Error Level = 0.05 and Warning Level = 0.02. Then this component will generate a warning if more than 2% of the resources failed, and an error if more than 5% of the resources failed.
Warning suppression¶
The user can suppress the reporting of specific warnings in specific components via component options.
The option-section Warnings
contains an option for (almost) each type of warning that a component can generate.
Each option specifies a minimal count.
The warning will only be reported if its number of occurrences is greater than or equal to the minimal count. The default value is 1, so by default every warning will be reported.
In the example above (Transform Literals component) the warnings can be suppressed by setting Minimal count for warning “Error while processing a resource.” to a value above 8776.
Note: Quality Control warnings can be controlled are controlled by a separate mechanism of thresholds, see Quality Control.
9.4.2. Add Label¶
Uses the content of a predicate as a new label. The preferred label will be visible in DISQOVER. All other labels will be used a synonyms.
Description¶
This component adds zero or more labels to each resource in a class (Target Class) which is included in the filter.
Labels are stored in predicate disq:label.lit (or disq:label for short), which has a special meaning in DISQOVER.
The labels are copied from a literal predicate specified in option Literal Predicate.
Preferred Label¶
Similar to the concept of Preferred URI, each resource needs a unique Preferred Label in DISQOVER.
Each resource can have zero or more labels (stored in disq:label), but one of them is defined to be the Preferred Label.
The behavior depends on the option New Preferred Label.
If option New Preferred Label is True
, then we want this component to define the Preferred Label for each resource.
If a Preferred Label has already been defined by an earlier component, it will be overridden.
In order to ensure that there is never more than one Preferred Label, the following rules apply for each resource:
Number of created labels | No Preferred label yet | Already has Preferred label |
---|---|---|
0 | warning | warning |
1 | OK | OK, override |
> 1 | warning; labels added but pref. label not set | warning; labels added but pref. label not overridden |
If option New Preferred Label is False
, then existing preferred labels are not changed.
Note that the mechanism and rules are subtly different compared to Preferred URI. Compare with Add URI.
Note that labels can not be defined via component Transform Literals, because that component cannot guarantee the uniqueness of Preferred Labels.
Example¶
Option | Value |
---|---|
Literal Predicate | name |
New Preferred Label | True |
URIs have been abbreviated:
- ‘G:’ stands for “http://got/”
Preferred Labels are notated in boldface.
Target Class before applying the component:
disq:uri.uri | disq:uri.huri | name.lit | disq:label.lit |
---|---|---|---|
[G:john_snow] | [HURI(G:john_snow)] | [“John Snow”] | [] |
[G:sansa_stark] | [HURI(G:sansa_stark)] | [“Sansa Stark”] | [“Sansa”] |
[G:petyr_baelish] | [HURI(G:petyr_baelish)] | [“Petyr Baelish”, “Littlefinger”] |
[] |
[G:sandor_clegane] | [HURI(G:sandor_clegane)] | [“Sandor Clegane”, “The Hound”] |
[“Sandor”] |
Target Class after applying the component:
disq:uri.uri | disq:uri.huri | name.lit | disq:label.lit |
---|---|---|---|
[G:john_snow] | [HURI(G:john_snow)] | [“John Snow”] | [“John Snow”] |
[G:sansa_stark] | [HURI(G:sansa_stark)] | [“Sansa Stark”] | [“Sansa”, “Sansa Stark”] |
[G:petyr_baelish] | [HURI(G:petyr_baelish)] | [“Petyr Baelish”, “Littlefinger”] |
[] |
[G:sandor_clegane] | [HURI(G:sandor_clegane)] | [“Sandor Clegane”, “The Hound”] |
[“Sandor”] |
Observe:
- John Snow didn’t have a label yet, so he gets a new label, which is a copy of his name.
- Sansa already had a label, to which a new label is added.
Because the option New Preferred Label is
True
, this new label becomes the preferred label. - Petyr and Sandor have two names, so they both have two label candidates.
Because it is not clear which one should be the Preferred Label,
a warning is issued and neither of the labels are added!
If option New Preferred Label would have been
False
, then Sandor would get two extra labels (not preferred), but Petyr wouldn’t.
Options¶
- Class : The name of the class on which the action will be performed.
- Literal Predicate : The predicate containing the value(s) that will be set as instance label. If the Preferred Label option is turned on, this predicate must be single-valued.
- New Preferred Label [Optional] : If turned on, the value of the selected predicate will be set as the preferred label and overwrite the existing label (if a preferred label has been set in an earlier component). In that case, the literal predicate must be single-valued. The default value is
True
. - New Preferred Label selection strategy [Optional] : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Filter [Optional] : Boolean expression returning true for resources which should be included.
Advanced
- Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is
1
. - Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is
1
. - Minimal count for warning “The predicate ‘disq:label’ should not be used as an input predicate for the add label component.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘disq:label’ should not be used as an input predicate for the add label component.”. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.3. Add URI¶
Uses the content of a Literal Predicate as a new URI with an optional prefix. This URI can subsequently be used for creating relationships.
Description¶
This component adds zero or more “subject” URIs (disq:uri) to each resource in a class (Target Class) which is included in the filter.
The URIs are constructed from the values of a literal predicate specified in option Literal Predicate.
The following conversions are applied to each literal value:
- If To Lowercase is
True
, the literal is converted to lowercase, otherwise it is left untouched (not converted to uppercase). - Then the literal is encoded as specified by Encoding (see below), unless Prefix is empty.
- Finally, the value of Prefix is added in front.
Per resource, a URI is created for each of the values of the literal predicate, and these URIs are added to the values of disq:uri for that resource. If that predicate doesn’t exist yet, it is created. If, for a certain resource, the literal predicate has n values, then n URIs will be created.
Advanced: the URIs are actually added to disq:uri.uri, and their hashed values to disq:uri.huri.
Note
This is not the only component that can add URIs.
It is not possible to define URIs with component Transform Literals.
About encoding¶
To be a valid URI, reserved characters have to be URL-encoded, see percent_encoding .
The way literals are encoded depends on the option Encoding
- Standard URL-encoding:
' '
(space character) is encoded as"%20"
,'@'
is encoded as"%40"
, and so on. - ONTOFORCE encoding: Space (
' '
), comma (','
), period ('.'
), semicolon (';'
) and slash ('/'
) are encoded as an underscore character ('_'
), all other special characters via standard URL-encoding
A simple example¶
Option | Value |
---|---|
To Lower | True |
Prefix | "http://got/" |
Encoding | ONTOFORCE encoding |
New Preferred URI | False |
Target Class before applying the component:
name.lit |
---|
[“John Snow”] |
[“Petyr Baelish”] |
Target Class after applying the component:
name.lit | disq:uri.uri | disq:uri.huri |
---|---|---|
[“John Snow”] | [http://got/john_snow] | [HURI(http://got/john_snow)] |
[“Petyr Baelish”] | [http://got/petyr_baelish] | [HURI(http://got/petyr_baelish)] |
Preferred URI¶
Every resource can have zero or more (subject) URIs; as all predicates disq:uri is multivalued.
However, every resource needs to have a unique Preferred URI in DISQOVER. The preferred URI is one of the values of disq:uri, and there are some mechanisms to specify which one.
If option New Preferred URI is True
, then we want this component to define the Preferred URI for each resource.
If a Preferred URI has already been defined by an earlier component, it will be overridden.
In order to ensure that there is never more than one Preferred URI, the following rules apply for each resource:
Number of created URIs | No Preferred URI yet | Already has Preferred URI |
---|---|---|
0 | warning | warning |
1 | OK | OK, override |
> 1 | warning; URIs not added! | warning; URIs not added! |
If option New Preferred URI is False
, we want this component to define the Preferred URI for each resource
which has no URIs yet, so without overriding any previously set Preferred URI.
For each resource the following rules apply:
Number of created URIs | No Preferred URI yet | Already has Preferred URI |
---|---|---|
0 | warning | warning |
1 | OK | OK, don’t override |
> 1 | warning; URIs not added! | OK, don’t override |
A more complicated example¶
In the following example some resources already have a URI before this component is applied.
Option | Value |
---|---|
To Lower | True |
Prefix | "http://got/" |
Encoding | ONTOFORCE encoding |
New Preferred URI | False |
URIs have been abbreviated:
- ‘G:’ stands for “http://got/”
- ‘P:’ stands for “http://people.org/”
Preferred URIs are notated in boldface.
Target Class before applying the component:
name.lit | disq:uri.uri | disq:uri.huri |
---|---|---|
[“John Snow”] | [P:John%20Snow] | [HURI(P:John%20Snow)] |
[“Sansa Stark”] | [] | [] |
[“Petyr Baelish”, “Littlefinger”] | [] | [] |
[“Sandor Clegane”, “The Hound”] | [P:Sandor%20Clegane] | [HURI(P:Sandor%20Clegane)] |
[] | [P:anonymous] | [HURI(P:anonymous)] |
[] | [] | [] |
Target Class after applying the component:
name.lit | disq:uri.uri | disq:uri.huri |
---|---|---|
[“John Snow”] | [P:John%20Snow, G:john_snow] | [HURI(P:John%20Snow), HURI(G:john_snow)] |
[“Sansa Stark”] | [G:sansa_stark] | [HURI(G:sansa_stark)] |
|
[] | [] |
|
[P:Sandor%20Clegane, G:sandor_clegane, G:the_hound] | [HURI(P:Sandor%20Clegane), HURI(G:sandor_clegane), HURI(G:the_hound)] |
[] | [P:anonymous] | [HURI(P:anonymous)] |
[] | [] | [] |
Observe:
- John Snow gets a second URI, but the first one is still the Preferred URI.
- Sansa gets her first URI, so it becomes the Preferred URI.
- Petyr didn’t have a URI yet, and has two names, so there are two URI candidates. Because it is not clear which one should be the Preferred URI, a warning is issued and neither of the URIs are added!
- Sandor also has two candidate URIs, but since he already has a preferred URI
and the option New Preferred URI is
False
, the URIs are just added, the preferred URI is not changed, and no warning is issued. - The next-to-last example (anonymous) poses no problem. Since there is no name, no URI is added.
- The last example will issue a warning because it no preferred URI can be set.
Options¶
- Class : The name of the class on which the action will be performed.
- Literal Predicate : The Literal Predicate containing the value(s) that will be combined with a prefix to form the URI(s).
- Force as new preferred URI [Optional] : If turned on, the created URI will be set as the preferred URI. If turned off, the URI will only be used as the preferred URI if the resource didn’t have one yet. The default value is
False
. - New Preferred URI selection strategy [Optional] : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Filter [Optional] : Boolean expression returning true for resources which should be included.
URI encoding
- Prefix [Optional] : The prefix to be used for the URI.
- Encoding [Optional] : Determines how the part of the generated URI after the prefix will be encoded. The possible values are:
Standard URL encoding (e.g. ' ' is converted to '%20')
,ONTOFORCE encoding. This strips surrounding whitespace, replaces ;,. / characters with underscores and applies standard URL encoding to all other characters.
. - To Lowercase [Optional] : Convert the part of the generated URI after the prefix to lowercase. The default value is
False
.
Advanced
- Make Auxiliary [Optional] : Make all generated predicates auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “The new Preferred URI overwrites the preferred URI assigned by federation synchronization (this may corrupt federation)” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The new Preferred URI overwrites the preferred URI assigned by federation synchronization (this may corrupt federation)”. The default value is
1
. - Minimal count for warning “The URI could not be added because the literal predicate is empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is
1
. - Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.4. Aggregate and Transform (resources)¶
Applies an expression to aggregate data from resources in one class, and then uses the aggregated data through an expression in another (or the same) class.
Description¶
This component operates in two phases. In the first phase it accumulates or aggregates information from one or more predicates of a class (Phase 1 Class). In the second phase it uses the aggregated information to transform predicates in the same or another class (Phase 2 Class).
Different kinds of accumulation are possible. Some examples:
- calculate the maximum of all values of a numerical predicate,
- calculate the sum of all values of a numerical predicate,
- count the average (mean) of values of a numerical predicate,
- count the number of values of a predicate,
- count the number of distinct values of a predicate,
- count the frequency of values of a predicate (a histogram).
The user can specify the exact aggregation behavior via an expression which is applied to every resource of Phase 1 Class. Example are given below.
In the second phase the aggregated information is used in a transformation, similar to Transform Literals. For example:
- use the aggregated maximum value to transform a numerical predicate to a percentage relative to that maximum
- use the aggregated average value to produce a predicate which indicates which values are below average
- use the aggregated value frequency to produce a predicate which indicates which values are unique
Similar to other components, filters can be applied in both phases.
The $STORE object¶
Aggregated information is stored in a special variable $STORE
, which is always a dictionary object
(also called ‘dict’ or ‘map’), a collection of key-value pairs.
The keys are always strings, the values can be any type (string, number, list, other dictionary, …)
As with any dictionary, the value corresponding to a key can be retrieved/changed using expression functions:
DictGet(d, k)
returns the value of keyk
in dictionaryd
DictSet(d, k, v)
sets the value of keyk
in dictionaryd
equal tov
A literal dict can be specified like this: {"name": "John", "age": 33}
.
Furthermore, it is possible to loop over a dictionary (with Map
or Reduce
) using the function DictKeys
.
Phase 1: Initial Expression¶
At the start of phase1, a $STORE
object is automatically created. Its value is an empty dictionary ({}
).
It is possible to change this initial value using :opt:Phase 1 Initial Expression
,
typically via the function DictSet
. For example, if you want to calculate the sum of some numerical predicate,
you can introduce a key-value pair sum = 0
like this:
DictSet($STORE, "sum", 0)
Note that you cannot set $STORE
via a construct like set $STORE = {...}
.
It is possible to initialize multiple variables, e.g.:
DictSet($STORE, "count", 0);
DictSet($STORE, "sum", 0)
Phase 1: Resource Expression¶
This expression is applied to every resource of (Phase 1 Class). Any change in $STORE is carried over to the next resource.
For example, if you want to count resources:
set _count = DictGet($STORE, "count");
DictSet($STORE, "count", _count + 1)
Or, in one line:
DictSet($STORE, "count", DictGet($STORE, "count") + 1)
This whole construction (including initialization) is equivalent to the following traditional (pseudo-)code:
count = 0
for resource in resources:
count = count + 1
Note that, in this case, initializing the count to zero in the Initial Expression is strictly not necessary,
because DictGet
takes an optional parameter specifying the default value (if the key is missing):
set _count = DictGet($STORE, "count", default=0);
...
To calculate the sum of a numerical predicate, say price, one has to convert the predicate values
to numbers (remember that predicate values are always stored as lists of strings).
If you know the predicate is single-valued, you can use the $$
-notation:
set _total_price = DictGet($STORE, "total_price", default=0);
DictSet($STORE, "total_price", _total_price + Float($$price))
If the predicate is multi-valued you can use Reduce
:
set _total_price = DictGet($STORE, "total_price", default=0);
set _resource_price = Reduce($price, 0, _tot, _el, _tot + Float(_el));
DictSet($STORE, "total_price", _resource_price + _total_price)
or Map
(in this case you cannot use the auxiliary variable _total_price
):
Map($price, _el, DictSet($STORE,
"total_price",
DictGet($STORE, "total_price", default=0)
+ Float(_el)))
Note that it is forbidden to “write” predicates during the first phase.
Validating the aggregated data¶
There are two ways to validate the aggregated data.
In the first place expressions involving $STORE
can be validated by providing a Unit Test.
Its value before evaluation can be specified in the normal way,
but a special syntax is required to specify the value after evaluation:
$price=["20", "40"],
$STORE={},
after $STORE={"total_price": 60};
$price=["20", "40"],
$STORE={"total_price": 100},
after $STORE={"total_price": 160};
In the second place, when the component is executed,
the value of $STORE
is reported in the component feedback.
Phase 2: Initial Expression¶
The second phase is essentially equivalent to the component Transform Literals,
with the addition that the aggregated data in $STORE
can also be used.
In some cases it is necessary to do a form of post-processing on $STORE
after the first phase,
before application to the individual resources.
Suppose, for example that you want to calculate the average value of a numerical predicate. This can be done by aggregating the total value (“sum”) and the number of values (“count”) in the first phase. The average can then be calculated in Phase 2 Initial Expression:
DictSet($STORE,
"average",
DictGet($STORE, "sum") / DictGet($STORE, "count"))
or, to avoid division by zero:
DictSet($STORE,
"average",
DictGet($STORE, "sum") / DictGet($STORE, "count", default=1))
Phase 2: Resource Expression¶
After the (optional) Initial Expression, a second sweep is executed on Phase 2 Class, applying Phase 2 Resource Expression to every resource.
In principle Phase 2 Class can be different from Phase 1 Class, but very often it is the same class. If that is the case, you can leave the option empty.
This expression can read and write predicates, and can use $STORE
.
For example, if you have a single-valued predicate cost and aggregated the total cost in phase 1, you can produce a derived predicate cost_percent like this:
set _total_cost = DictGet($STORE, "total_cost");
set @cost_percent = [Str(Float($$cost) / _total_cost * 100)]
or, if the predicate is multi-valued:
set _total_cost = DictGet($STORE, "total_cost");
set @cost_percent = Map($cost, _el, Str(Float(_el) / _total_cost * 100))
Notes:
- It is forbidden to change
$STORE
in Phase 2 Resource Expression. - Like Transform Literals, this component cannot write subject URIs (disq:uri) or subject Labels (disq:label). However, it can produce (auxiliary) predicates which can then be used in subsequent or Add URI or Add Label components.
Another example¶
Suppose you have resources with predicates ID and version, and that multiple resources can have the same ID, but in that case they have different versions:
ID.lit | version.lit |
---|---|
[“id1”] | [“1”] |
[“id2”] | [“1”] |
[“id1”] | [“2”] |
[“id3”] | [“1”] |
[“id1”] | [“3”] |
[“id2”] | [“2”] |
For every ID you only want to keep the resource with the highest version. This can be achieved by removing resources (see Remove Resources), but you first need to produce a predicate which indicates which resources are to be removed.
For this purpose you can use aggregation.
Instead of using a fixed key in $STORE
you can use the IDs.
The values are the maximum versions (per ID).
Phase 1 Initial Expression can be empty. Phase 1 Resource Expression can be:
set _this_version = Float($$version);
set _current_max_version = DictGet($STORE, $$ID, default=0);
DictSet($STORE, $$ID, Max(_this_version, _current_max_version))
This is an example of a Unit Test for this expression:
$STORE={}, $ID=["foo"], $version=["1"], after $STORE={"foo": 1};
$STORE={"foo": 1}, $ID=["foo"], $version=["3"], after $STORE={"foo": 3};
$STORE={"foo": 3}, $ID=["foo"], $version=["2"], after $STORE={"foo": 3};
$STORE={"foo": 3}, $ID=["bar"], $version=["2"], after $STORE={"foo": 3, "bar": 2};
For the example data above the value of $STORE
after the first phase would be:
{'id1': 3,
'id2': 2,
'id3': 1}
In the second phase, you can produce a predicate max_version:
set @max_version = [Str(DictGet($STORE, $$ID))]
The situation after execution is then:
ID.lit | version.lit | max_version.lit |
---|---|---|
[“id1”] | [“1”] | [“3”] |
[“id2”] | [“1”] | [“2”] |
[“id1”] | [“2”] | [“3”] |
[“id3”] | [“1”] | [“1”] |
[“id1”] | [“3”] | [“3”] |
[“id2”] | [“2”] | [“2”] |
Resources can now be removed in a following Remove Resources component with the filter:
$$version != $$max_version
Options¶
Phase 1 (Aggregation)
- Class : Class containing the predicates to be visited in the first phase.
- Initial Expression [Optional] : The expression executed once at the start of the first phase to create key-value pairs in the $STORE dictionary.
- Resource Expression : The transformation expression executed for each resource in the first phase. The values are accessed via the $STORE dictionary (see Dict Manipulation functions). This expression cannot create predicates.
- Phase1 Filter [Optional] : A boolean expression returning True for resources to which the action should be applied in the first phase.
Phase 2 (Transformation)
- Class [Optional] : Class containing the predicates to be visited in the second phase (can be the same as the first class).
- Initial Expression [Optional] : The expression executed once at the start of the second phase. It creates key-value pairs in the $STORE dictionary.
- Resource Expression : The transformation expression executed for each resource in the first phase. The values are accessed via the $STORE dictionary (see Dict Manipulation functions). This expression can create predicates.
- Phase2 Filter [Optional] : A boolean expression returning True for resources to which the action should be applied in the second phase.
Advanced
- Make Auxiliary [Optional] : Make all generated predicates auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during Phase 1 (Aggregation).” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during Phase 1 (Aggregation).”. The default value is
1
. - Minimal count for warning “An error occurred during Phase 2 (Transformation).” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during Phase 2 (Transformation).”. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.5. Configure Canonical Type¶
Configure a Canonical Type together with its facets and properties when loading in DISQOVER.
Description¶
This component defines the configuration settings for a DISQOVER Canonical Type.
Each Canonical Type has a unique URI and a unique label, see the options below.
Resource Selection¶
Each instance in DISQOVER corresponds to a resource in the Data Ingestion Engine, with a well defined Preferred URI and a Preferred Label.
Instances are categorized in groups called Canonical Types. For example the Canonical Type ‘Movie’ could encompass all movie-instances.
Very often instances belong to one Canonical Type, but it is also possible that an instance belongs to multiple Canonical Types. E.g. an actor-instances could belong to Canonical Type ‘Actor’ and Canonical Type ‘Person’. In other words: Canonical Types are not necessarily non-overlapping.
The Publish in DISQOVER component uses the information in the Configure Canonical Type components to convert resources in the Data Ingestion Engine to instances in DISQOVER.
Two options are available to specify which instances belong to a Canonical Type:
- Classes is a list of Class names. All valid resources in each of these classes are included.
- Types allows more fine-grained selection. It is a list of Resource Types. It looks at the resources of all classes produced by the pipeline. All valid resources which have (at least) this Resource Type (rdf:type) are included.
Some examples:
Option | Value |
---|---|
Label | Movies |
URI | http://movies |
Classes | [DisneyMovies ] |
Types | [] |
All enabled resources in class DisneyMovies
which have a Preferred URI and a Preferred Label
are converted to instances of the Canonical Type Movies
.
This is a simple one-to-one correspondence.
Option | Value |
---|---|
Label | Movies |
URI | http://movies |
Classes | [DisneyMovies , PixarMovies ] |
Types | [] |
Now all valid resources from class DisneyMovies
and from class PixarMovies
are included.
Option | Value |
---|---|
Label | Movies |
URI | http://movies |
Classes | [] |
Types | http://disney.org/movies/ |
All valid resources in any class which contain Resource Type http://disney.org/movies/
are included.
Resource Types for predicates can be specified in a number of ways:
- via the option Resource Type in
- all Import components
- Extract Class (distinct)
- Extract Hierarchical Class
- in a Transform Literals by writing to rdf:type
Note that a resource may have multiple Resource Types (multiple values for rdf:type).
Also note that multiple Canonical Types may include the same resources. The instances produced by these resources belong to each of these Canonical Types.
Properties and Facets¶
Predicates in the Data Ingestion Engine can be exposed in DISQOVER by configuring them as Properties or as Facets. A Facet is a kind of property which can be used to partition the instances in groups. It is typically used for filtering.
For example, the predicate title
in class Movies
can be exposed as a Property labeled Title
.
The value of this Property for a particular instance is equal to the value of this predicate for the corresponding resource.
(technical note: if the value is a list and it contains duplicates, then the duplicates are removed)
As most movies have a distinct title, this is not a very good candidate for a Facet.
The predicate genre
is a better candidate to expose as a Facet.
Filtering on genre
in DISQOVER would allow the user the show all movies of a particular genre.
Note that a property or facet can be configured to correspond to multiple predicates. The value of the property/facet for a particular instance is the union of the values of the contributing predicates.
A Canonical Type can have any number of Properties and any number of Facets. They are configured in this component. At a minimum, for each Property or Facet, the label, DISQOVER URI and the contributing predicates should be specified. More options are available to specify the description, the datatype etc. For more information, please refer to the list below.
Very often a predicate (or a set of predicates) is (are) used both as a Property and as a Facet. The Facet can be defined separately in Facets (note that the Facet-URI and the Property-URI should be different), but it is often easier to define the Property in Properties/Facets and use the special options within the Property definition to expand it to a Facet. At a minimum, you should provide a Facet-URI.
See also component Configure Sub-instance Type, which allows you to configure a Property as a sub-table of a Canonical Type.
Within Properties/Facets, the option Renderer defines how the property is rendered within the instance list or instance popout (see also section 4.2.3):
- Html: allows html html markdown in the property, but disables any executable scripts.
- Unsanitized Html: allows html markdown in the property. This option can be used to insert executable scripts into DISQOVER, and should therefore only be used when the source of the html is trustworthy beyond any doubt.
- Date: shows data predicates (of the form “yyyy-mm-dd”) in a date format as specified within the browser settings of the user.
- String: used to display text.
- Image: if a property contains an external link to an image, this renderer shows the image.
- External link: makes hyperlink properties clickable.
- Paragraph, Sub table (deprecated) and Sub key value (deprecated): these options were used in previous versions of DISQOVER but are now deprecated. They have no effect when chosen.
Template Properties¶
The option Template allows to create a template property that does not have any predicates but is based on one or multiple other properties, for example to add a prefix to the value of a property, or to combine multiple properties.
The syntax to do so, is by using
in the template. If the referenced property has multiple values, you will see multiple template values. You can reference multiple properties and if they are all multivalued, multiple template values will be created using all combinations of the property values.Another option for working with multivalued properties is to add a delimiter in the template, which looks like
. You can combine as many properties (with or without a delimiter) as you want in a template.Links¶
Relationships in the Data Ingestion Engine can be exposed to DISQOVER in different ways:
- All relationship are automatically exposed to DISQOVER as “untyped” or “anonymous” links. They are available, a.o. in the All links widget.
- The component Configure Typed Link allows configuration of “typed” links.
- A Property or Facet can be configured to use a Forward Relationship Predicate (xxx.fwd) or a Reverse Relationship Predicate (xxx.rev). In DISQOVER the link values are presented by their Preferred Label.
Trees¶
You can configure facets to have tree data by using path predicates (xxx.path, generated by the Expand hierarchical paths component). Note that you should not specify the option Parent Facet in this case. A path predicate can also be used in a property for publishing to Remote Data Subscription, however the property will not be used when publising to DISQOVER. You will receive a warning in this case which you can supress by turning the option Publish to Disqover off.
Publishing the Configuration¶
The configuration defined in all Configuration Components (Canonical Types, Properties, Facets, …) can be transferred to DISQOVER in two ways:
- “automatically” after successful execution of Publish in DISQOVER.
- “manually” via the menu command “Generate configuration” in the Data Ingestion Engine frontend.
Manual publishing is typically used for cosmetic changes in the configuration, e.g. if a Property description was changed. If anything more structural was changed (such as adding a property, changing selected predicates etc.), you should execute Publish in DISQOVER.
Icon¶
The icon to be displayed in the canonical type tile in DISQOVER, which can be selected from a number of available in-house icons along with the font-awesome v5.9.0 icons. To use one of our in-house icons, prefix the icon name from the table below with . For example: . To use a font-awesome icon, prefix the font-awesome icon name with . We currently only support the light style of fontawesome icons. For example: .
These are the available in-house icons:
Federation¶
In a federated setting, this component might add to or hide features of a remote type, if the local URI matches up with the remote type.
A facet can also be defined together with a property. The facet will use the same predicates, label and description etc. as the property. The DISQOVER URI for the facet must be defined explicitly.
Options¶
- Label : The display name of the canonical type.
- Description [Optional] : The description of the canonical type.
- Icon : The name of the icon of the canonical type. You can use icons from fontawesome.com. For example, to use the ‘handshake’ icon, fill in ‘font-awesome fa-handshake’.
- Classes [Optional] : List of classes contributing to this canonical type. The default value is
[]
. - Properties/Facets [Optional] : All properties of this canonical type. A list of sub-options with the following structure:
- Label : The display name of the property.
- URI : The URI of the property.
- Description : The description of the property.
- Subinstance Type : The subinstance type of the values if applicable.
- Predicates : List of predicates mapping to this property.
- Renderer : The way the property should be visualized. The possible values are:
html
,unsanitized html
,date
,string
,image
,paragraph (deprecated)
,sub table (deprecated)
,sub key value (deprecated)
,external link
. The value can also be undefined. - Template : A template which generates values from other property values. For example, to add a prefix to a property, the value should be: “prefix”@<property_uri>@
- Order By : The way the property values should be ordered within a single instance. The possible values are:
Numeric order
,Label order (Case sensitive)
,Label order (Case insensitive)
,Date order
. The value can also be undefined. - Disable Sorting : Specify true if there is no need to make this property sortable. The default value is
False
. - Data Type : The data type of the property. Specify this if you want a property sortable by an integer or float property. The possible values are:
int
,float
,lat-lon
,location_tree
. The value can also be undefined. - Not Text Searchable : Specify true if there is no need to make this property text-searchable. The default value is
False
. - Export to file : Specify true to include this property when exporting data to file (Turtle format). The default value is
True
. - Publish for Remote Data Subscription : Specify true to include this property when publishing for remote data subscription. The default value is
True
. - Publish to DISQOVER : Specify true to include this property when publishing to DISQOVER. The default value is
True
. - Visible for Groups : The user groups which are allowed to view the property. Leave unspecified if accessible for all.
- Mixed Security Values : Specify true here if this property has individual property values which could be hidden. The default value is
False
. - Custom predicate used in Published Data Set : Specify a custom predicate name to be used in the published data set
- Also create a facet using these options. : Use as facet The default value is
False
. - Facet URI : The URI of the facet.
- Facet Parent Predicate : The predicate defining the hierarchy between the facet values.
- Facet Not Annotated Label : A custom label for the “not annotated” item.
- Facet Data Type : The data type of the facet. Specify if you need it in a histogram, otherwise leave undefined. The possible values are:
location_tree
,lat-lon
,int
,float
,date
. The value can also be undefined. - Facet Additive : Specify true if it makes sense to show the sum of the values to the user (if dataType is int or float).
- Facet Precision : The number of decimals to show for a floating point number.
- Facet Single Valued : Specifies if the values are single valued, default is false. The default value is
False
. - Facet View Type : The view type of the facet. The possible values are:
countrymap
,date
,images
,hierarchical
,default
,dataset
. The value can also be undefined.
- Resource Types [Optional] : List of resource types contributing to this canonical type. The default value is
[]
.
Advanced
- URI [Optional] : The URI of the canonical type.
- Visible for Groups [Optional] : The identifiers of the user groups that will be allowed to see this canonical type. If this option is left empty, the canonical type will be accessible for everyone.
- Default Hidden [Optional] : Specify true if the canonical type should not be visible on the dashboard. The default value is
False
. - Synonym as Property [Optional] : If turned on, a property named Synonym is automatically created, which contains all labels of a resource The default value is
True
. - Generate Semantic Hit [Optional] : If true an exact match of the label will generate a semantic hit for the concept. The default value is
True
. - In-house Canonical Type [Optional] : If turned on, the in-house icon will be shown for this canonicaltype. The default value is True. The default value is
True
. - Allow Mixing with Federated Data [Optional] : If turned on, local data will be mixed with federated data in this canonical type. The default value is True. The default value is
True
. - Disable Canonical Type [Optional] : If turned on, this canonical type will be completely disabled. The default value is False. The default value is
False
. - Label Renderer [Optional] : The way the label should be visualized. The possible values are:
html
,string
. - Facets [Optional] : All facets of this canonical type. A list of sub-options with the following structure:
- Label : The display name of the facet.
- URI : The URI of the facet.
- Description : The description of the facet.
- Predicates : List of predicates mapping to this facet.
- Parent Predicate : The predicate defining the hierarchy between the facet values.
- Not Annotated Label : A custom label for the “not annotated” item.
- Data Type : The data type of the facet. Specify if you need it in a histogram, otherwise leave undefined. The possible values are:
location_tree
,lat-lon
,int
,float
,date
. The value can also be undefined. - Additive : Specify true if it makes sense to show the sum of the values to the user (if dataType is int or float).
- Precision : The number of decimals to show for a floating point number.
- Single Valued : Specifies if the values are single valued, default is false. The default value is
False
. - View Type : The view type of the facet. The possible values are:
countrymap
,date
,images
,hierarchical
,default
,dataset
. The value can also be undefined. - Export to file : Specify true to include this facet when exporting data to file (Turtle format). The default value is
False
. - Visible for Groups : The user groups which are allowed to view the facet. Leave unspecified if accessible for all.
- Mixed Security Values : Specify true here if this facet has individual facet values which could be hidden. The default value is
False
.
Remote Data Subscription
- Publish for Remote Data Subscription [Optional] : Publish as a data set for remote data subscription The default value is
False
. - Data Set Name [Optional] : Name of the data set to be used for remote data subscription (by default this is the last part of the URI of the canonical type).
- Remote Data Groups [Optional] : The identifiers of the user groups that will be allowed to see this Remote Data Set. If this option is left empty, the data set will be accessible for everyone.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “A property was configured to contain HTML code that will be rendered on the page. Only do this if the source of the property is trustworthy, because malicious code could be executed during rendering.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “A property was configured to contain HTML code that will be rendered on the page. Only do this if the source of the property is trustworthy, because malicious code could be executed during rendering.”. The default value is
1
.
9.4.6. Configure Sub-instance Type¶
Configure a Sub-instance Type which can be referenced by other Canonical Types when loading in DISQOVER
Description¶
This component defines the configuration settings for a DISQOVER sub-instance type. In a federated setting, this might add to or hide features from a remote sub-instance type, if the local URI matches up with the remote type. It does this by mapping types and predicates to a DISQOVER sub-instance type URI and a number of its property URIs.
Sub-instance types are used in those cases when you want to show a subtable as a property of a canonical type, and don’t want to define the items in the subtable as full-blown canonical types. Eg, you might want to show a subtable for a Disney character with cultural references:
Magazine | Article | Author |
---|---|---|
Mouse Monthly | Minnie as a role model | A. Mauser |
Shoes, shoes, shoes | Pitter patter | H. Heels |
If you don’t want a canonical type ‘Cultural Reference’, you can define it as a sub-instance type. Sub-instance types should have their own class with an instance uri. Here is the standard sequence of modeling this relationship for the given example. We’ll assume we have an import file for the parent class (disney_characters), and another one for cultural references, with the following structure:
Character | Magazine | Article | Author |
---|---|---|---|
D:minnie_mouse | Mouse Monthly | Minnie as a role model | A. Mauser |
… | … | … | … |
- Import the
disney_characters
and add URI and label - Import the references into the class
cultural_references
- Add a URI to
cultural_references
, eg by combining character, magazine and article - Create a relationship by identifier from
cultural_references
todisney_characters
, eg cultural_references:mentions - Create a sub-instance type for
cultural_references
, egCulturalReference
with URID:cultural_reference
. - Create or update the
DisneyCharacter
canonical type to have a propertymentioned_in
, that uses cultural_references:mentions.rev to populate the values, and specifies the sub_type to beD:cultural_reference
.
Properties¶
A sub-instance type can have any number of properties. At a minimum the DISQOVER URI and the contributing predicates should be specified. For a list of additional optional arguments, please refer to the options list below.
Options¶
- Label [Optional] : The display name of the subinstance type.
- Classes [Optional] : List of classes contributing to this subinstance type. The default value is
[]
. - Resource Types [Optional] : List of resource types corresponding to this subinstance type. The default value is
[]
. - Properties : All properties of this subinstance type. A list of sub-options with the following structure:
- Label : The display name of the property.
- URI : The URI of the property.
- Description : The description of the property.
- Predicates : List of predicates mapping to this property.
- Renderer : The way the property should be visualized. The possible values are:
html
,unsanitized html
,date
,string
,image
,paragraph (deprecated)
,sub table (deprecated)
,sub key value (deprecated)
,external link
. The value can also be undefined. - Template : A template which generates values from other property values. For example, to add a prefix to a property, the value should be: “prefix”@<property_uri>@
- Order By : The way the property values should be ordered within a single instance. The possible values are:
Numeric order
,Label order (Case sensitive)
,Label order (Case insensitive)
,Date order
. The value can also be undefined. - Disable Sorting : Specify true if there is no need to make this property sortable. The default value is
False
. - Data Type : The data type of the property. Specify this if you want a property sortable by an integer or float property. The possible values are:
int
,float
,lat-lon
,location_tree
. The value can also be undefined. - Not Text Searchable : Specify true if there is no need to make this property text-searchable. The default value is
False
. - Export to file : Specify true to include this property when exporting data to file (Turtle format). The default value is
True
. - Publish for Remote Data Subscription : Specify true to include this property when publishing for remote data subscription. The default value is
True
. - Publish to DISQOVER : Specify true to include this property when publishing to DISQOVER. The default value is
True
. - Visible for Groups : The user groups which are allowed to view the property. Leave unspecified if accessible for all.
- Mixed Security Values : Specify true here if this property has individual property values which could be hidden. The default value is
False
. - Custom predicate used in Published Data Set : Specify a custom predicate name to be used in the published data set
- Also create a facet using these options. : Use as facet The default value is
False
. - Facet URI : The URI of the facet.
- Facet Parent Predicate : The predicate defining the hierarchy between the facet values.
- Facet Not Annotated Label : A custom label for the “not annotated” item.
- Facet Data Type : The data type of the facet. Specify if you need it in a histogram, otherwise leave undefined. The possible values are:
location_tree
,lat-lon
,int
,float
,date
. The value can also be undefined. - Facet Additive : Specify true if it makes sense to show the sum of the values to the user (if dataType is int or float).
- Facet Precision : The number of decimals to show for a floating point number.
- Facet Single Valued : Specifies if the values are single valued, default is false. The default value is
False
. - Facet View Type : The view type of the facet. The possible values are:
countrymap
,date
,images
,hierarchical
,default
,dataset
. The value can also be undefined.
Advanced
- URI [Optional] : The URI of the subinstance type.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
.
9.4.7. Configure Typed Link¶
Typed Links specify which links exists between two Canonical Types and what information about them should be loaded in DISQOVER.
Description¶
This component defines the configuration settings for a DISQOVER Typed Link. A Typed Link defines a typed relationship between two Canonical Types or within a single Canonical Type. The Configure Typed Link component should be proceeded by the Configure Canonical Type components defining those Canonical Types.
Creating a relationship by identifier between two classes creates an untyped link, which will be visible in an instance’s detail pane and can be used to navigate through. However, this only tells us there is a link, but not what kind of link. To be able to filter on the type of link or on properties of that relationship, we need to define a Typed Link and/or associated Relation types.
Let’s take the example of Disney characters and movies. We might then have imported two classes: disney_characters
and movies
, and created a relationship by identifier between them: disney_characters:appears_in. At this moment we know there is some link, but we can not quantify that relationship.
Suppose we have a data set that describes properties of a certain character’s appearance in certain movies, and that we have imported this data set into the class chars_movies
. Furthermore, let’s say we already created a relationship by identifier for chars_movies:character and chars_movies:movie.
We’ll create a Typed Link Characters2Movies
between http://disney.org/characters
and http://disney.org/movies
. Let’s say the the Typed Link can be characterized by capacity (eg main role versus cameo) and the movie release date. We can define facets and properties for these values. (If we don’t want this functionality, we can turn Show details off).
We can then create a relation_type AppearsIn
, or HasMerchandiseFor
. These are two possible Relation types within the Typed Link. When configuring AppearsIn
, we’ll define the following options:
Option | Value |
---|---|
Label | Appears in |
Inverse Label | Appearance made by |
URI | http://disney.org/chars2movies/appearance |
Sources | See below |
And as a source, we’ll use disney_characters:appears_in, with the following options:
Option | Value |
---|---|
Direct Predicate | disney_characters:appears_in |
To Source Predicate | chars_movies:character |
To Destination Predicate | chars_movies:movie |
Types | http://disney.org/characters2movies |
Options¶
- Source Type : The URI of the existing source canonical type
- Destination Type : The URI of the existing destination canonical type.
- Relation Types : All relation types for this type link. A list of sub-options with the following structure:
- Label : The label for the outgoing relation.
- Inverse Label : The label for the inverse relation.
- URI : The URI of the relation type.
- Sources : The sources for this relation type. A list of sub-options with the following structure:
- Direct Predicate : The predicate which links source Canonical Type instances to destination Canonical Type instances.
- Inverse Predicate : The predicate which links destination Canonical Type instances to source Canonical Type instances. If empty, the reverse of Direct Predicate will be used
- Description : The description of the relation.
- Publish for Remote Data Subscription : Specify true to include this relation type when publishing for remote data subscription. The default value is
True
. - Visible for Groups : The user groups which are allowed to view the relation type. Leave unspecified if accessible for all. The default value is
[]
. - Custom direct predicate used in published data set : Specify a custom direct predicate name to be used in the published data set
- Custom inverse predicate used in published data set : Specify a custom inverse predicate name to be used in the published data set
- Allow Mixing with Federated Data [Optional] : If turned on, local data will be mixed with federated data in this type link. The default value is True. The default value is
True
.
Advanced
- URI [Optional] : The URI of the typed link.
Remote Data Subscription
- Publish for Remote Data Subscription [Optional] : Publish as a Link Set for Remote Data Subscription The default value is
False
.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
.
9.4.8. Configure User Views (DEPRECATED)¶
User views specify which different reduced views on the data will be available for the user when publishing in DISQOVER.
Options¶
- User Views Triples [Optional] : Triples defining the user views.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
.
9.4.9. Create Compact Class¶
Copies all enabled resources (i.e. resources that have not been removed) from the source class to the destination class. This can be used as an optimization after removing resources.
Description¶
This component copies all active resources from a class (Source Class) to another class (Target Class).
Components like Remove Resources, Merge Classes, and Merge within Class deactivate resources in a class. Deactivating records slows down further processing of the class, because the deactivated records must be “skipped” each time. Copying the active records to a new alignment can substantially improve the performance of further processing components.
Advanced¶
- This component doesn’t copy auxiliary columns.
- Preferredness of URIs and labels is taken over.
Options¶
- Source Class : The class containing disabled resources which will not be transferred to the Target Class.
- Target Class : The new class which will only contain enabled resources.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
.
9.4.10. Create Relationship (by identifier)¶
Create a relationship between the target class and the matching class by matching the content of an identifier (literal predicate) in the target class to a URI in the matching class.
Description¶
This component creates a relationship (or link) between two classes, by matching literals in the Target Class to existing URIs in Matching Class.
The created relationship bidirectionial, i.e. it is stored in two predicates:
- a “forward” predicate in Target Class.
- a “reverse” predicate in Matching Class.
The name of these predicates is specified in option Relationship Predicate.
More in detail, the component works as follows:
- For each resource T in Target Class, all values of the literal predicate specified in option Matching Identifier are considered.
- For each literal value, a corresponding URI is created in the same way as in component Add URI, taking into account options Prefix, To lowercase, Encoding.
- Each URI is searched in predicate disq:uri (subject URI) of the Matching class:
- If a resource M is found whose subject URI is equal to the URI (a match), then the URI is added to predicate RRR.uri, and its hashed value to predicate RRR.fwd, where we used RRR to denote the predicate specified in option Relationship Predicate. The reverse link is also created, by adding the (first) subject URI of resource T to predicate RRR.rev in resource M.
- It the URI is not found (no match) then it is added to predicate RRR.err (this can be used, e.g. for debugging).
Prerequisite: both classes must have a predicate disq:uri (.huri to be precise).
Filters can be defined on both classes.
Example¶
Option | Value |
---|---|
Target Class | DisneyCharacters |
Matching Class | Animals |
Matching Identifier | animal_name |
Relationship Predicate | animal |
Prefix | "http://animals.org/" |
To lowercase | True |
Encoding | ONTOFORCE encoding |
URIs have been abbreviated:
- ‘D:’ stands for
http://disney.org/
- ‘A:’ stands for
http://animals.org/
Target Class DisneyCharacters
before applying the component:
disq:uri.uri | disq:uri.huri | animal_name.lit |
---|---|---|
[D:mickey_mouse] | [HURI(D:mickey_mouse)] | [“Mouse”, “House Mouse”] |
[D:pluto] | [HURI(D:pluto)] | [“Dog”] |
[D:goofy)] | [HURI(D:goofy)] | [“Dog”, “Human”] |
[D:donald_duck] | [HURI(D:donald_duck)] | [“Duck”] |
Matching Class Animals
before applying the component:
disq:uri.uri | disq:uri.huri |
---|---|
[A:dog] | [HURI(A:dog)] |
[A:house_mouse)] | [HURI(A:house_mouse)] |
[A:mouse] | [HURI(A:mouse)] |
Target Class DisneyCharacters
after applying the component:
disq:uri.uri | disq:uri.huri | animal_name.lit | animal.uri | animal.err | animal.fwd |
---|---|---|---|---|---|
[D:mickey_mouse] | [HURI(D:mickey_mouse)] | [“Mouse”, “House Mouse”] | [A:mouse A:house_mouse] | [] | [HURI(A:mouse), HURI(A:house_ mouse)] |
[D:pluto] | [HURI(D:pluto)] | [“Dog”] | [A:dog] | [] | [HURI(A:dog)] |
[D:goofy)] | [HURI(D:goofy)] | [“Dog”, “Human”] | [A:dog] | [A: human] | [HURI(A:dog)] |
[D:donald_duck] | [HURI(D:donald_duck)] | [“Duck”] | [] | [A:duck] | [] |
Matching Class Animals
after applying the component:
disq:uri.uri | disq:uri.huri | animal.rev |
---|---|---|
[A:dog] | [HURI(A:dog)] | [HURI(D:goofy), HURI(D:pluto)] |
[A:house_mouse)] | [HURI(A:house_mouse)] | [HURI(D:mickey_mouse)] |
[A:mouse] | [HURI(A:mouse)] | [HURI(D:mickey_mouse)] |
[A:human)] | [HURI(A:human)] | [HURI(D:goofy)] |
Observe:
- The literal value “House Mouse” is converted to URI
http://animals.org/house_mouse
, according to the options Prefix, To Lowercase and Encoding. - Mickey Mouse has two values for the literal identifier, both matching a subject URI in Animals,
so both these URIs are written to animal.uri and their hashed values to animal.fwd.
Conversely, Mickey’s hashed subject URI (
HURI(http://disney.org/mickey_mouse)
) is added in animal.rev for both animals. - Goofy also has two values for the literal identifier, but one of them (human) has no counterpart in Animals, so that URI ends up in animal.err and only ‘dog’ is added in animal.fwd.
- There are two dogs, Pluto and Goofy, so animal.rev gets two HURIs. Note that their order is undetermined!
- Donald does not get linked to an animal, because
http://animals.org/duck
is not subject URI in Animals.
Options¶
Target Class
- Target Class : The class containing the literal predicate used for matching. This class will receive the forward predicate of the link.
- Matching Predicate : The literal predicate used to match URIs in the Matching Class.
- Relationship Predicate : The new predicate which will contain the links.
- Prefix [Optional] : The prefix to be used for the URI.
- Encoding [Optional] : Determines how the part of the generated URI after the prefix will be encoded. The possible values are:
No encoding
,Standard URL encoding (e.g. ' ' is converted to '%20')
,ONTOFORCE encoding. This strips surrounding whitespace, replaces ;,. / characters with underscores and applies standard URL encoding to all other characters.
. - To Lowercase [Optional] : Convert the part of the generated URI after the prefix to lowercase. The default value is
False
. - Target Class Filter [Optional] : A boolean expression returning True for resources in the Target Class to which the action should be applied.
Matching Class
- Matching Class : The class containing the URIs used for matching. This class will receive the reverse predicate of the link.
- Matching Class Filter [Optional] : A boolean expression returning True for resources in the Matching Class to which the action should be applied.
Advanced
- Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Quality Control
- Fraction of Unmatched Identifiers [Optional] : The fraction of unmatched URIs. (lower is better)
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
.
9.4.11. Create Relationship (by label)¶
Creates a relationship between 2 classes by matching labels, or another literal predicate.
Description¶
This component creates relationships (or links) between Target Class and Matching Class, by looking at matching literals.
The created relationship bidirectionial, i.e. it is stored in two predicates:
- a “forward” predicate in Target Class.
- a “reverse” predicate in Matching Class.
The name of these predicates is specified in option Relationship Predicate.
More in detail:
- All values of the literal predicate Matching Predicate in Target Class and all values of the literal predicate Matching Predicate in Matching Class are examined.
- If a resource T in Target Class has one or more values in common with a resource M in Matching Class, then a relationship is created:
- The hashed value of the (first) subject URI of M is added to predicate RRR.fwd in the Target Class.
- The hashed value of the (first) subject URI of T is added to predicate RRR.rev in the Matching Class, where we used RRR to denote the predicate specified in option Relationship Predicate.
By default both predicates Matching Predicate are equal to disq:label.lit (or disq:label for short), because comparing by label is a common operation.
The way literals are compared to each other can be tailored via two options:
- Case Sensitive determines whether uppercase/lowercase differences matter. For example, if
False
, “dog” is considered to be equal to “Dog”. - Remove Dashes and Spaces determines whether differences due to dashes (
'-'
) or spaces (' '
) matter. For example, ifTrue
“my-dog” is considered to be equal to “my dog” and to “mydog”.
Example¶
Option | Value |
---|---|
Target Class | DisneyCharacters |
Matching Predicate | DEFAULT (disq:label ) |
Relationship Predicate | animal |
Matching Class | Animals |
Matching Predicate | name |
Case Sensitive | True |
Remove Dashes and Spaces | True |
URIs have been abbreviated:
- ‘D:’ stands for “http://disney.org/”
- ‘A:’ stands for “http://animals.org/”
and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.
Target Class DisneyCharacters
before applying the component:
disq:uri.uri | disq:label.lit |
---|---|
[D:mickey_mouse] | [“Mouse”, “House Mouse”] |
[D:pluto] | [“Dog”] |
[D:goofy)] | [“Dog”, “Human”] |
[D:donald_duck] | [“Duck”] |
Matching Class Animals
before applying the component:
disq:uri.uri | name.lit |
---|---|
[A:123] | [“dog”] |
[A:482)] | [“house-mouse”] |
[A:392] | [“mouse”] |
Target Class DisneyCharacters
after applying the component:
disq:uri.uri | disq_label.lit | animal.fwd |
---|---|---|
[D:mickey_mouse] | [“Mouse”, “House Mouse”] | [HURI(A:392), HURI(A:482)] |
[D:pluto] | [“Dog”] | [HURI(A:123)] |
[D:goofy)] | [“Dog”, “Human”] | [HURI(A:123)] |
[D:donald_duck] | [“Duck”] | [] |
Matching Class Animals
after applying the component:
disq:uri.uri | name.lit | animal.rev |
---|---|---|
[A:123] | [“dog”] | [HURI(D:goofy), HURI(D:pluto)] |
[A:482)] | [“house-mouse”] | [HURI(D:mickey_mouse)] |
[A:392] | [“mouse”] | [HURI(D:mickey_mouse)] |
Options¶
Target Class
- Target Class : The class containing the literal predicate used for label matching. This class will receive the forward predicate of the link.
- Matching Predicate [Optional] : The predicate of the Target Class used for matching. The default is the label (disq:label.lit). The default value is
disq:label.lit
. - Relationship Predicate : The new predicate which will contain the links.
- Target Class Filter [Optional] : A boolean expression returning True for resources in the Target Class to which the action should be applied.
Matching Class
- Matching Class : The class to be matched against. This class will receive the reverse predicate of the link
- Matching Predicate : The literal predicate of the Matching Class used for matching.
- Matching Class Filter [Optional] : A boolean expression returning True for resources in the Matching Class to which the action should be applied.
Text Matching
- Case Sensitive [Optional] : Case sensitive matching of literals. The default value is
False
. - Remove Dashes and Spaces [Optional] : Remove dashes and spaces when matching literals. The default value is
False
.
Advanced
- Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Quality Control
- Fraction matched [Optional] : The fraction of resources in Matching Class that have been matched successfully. (higher is better)
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.12. Define Datasource¶
Define the meta-data of a data source.
Description¶
With this component, the meta-data of a data source is set. Each import component in the pipeline must be assigned to a data source. This data source is then used to indicate the provenance of the imported data. For more details see Data Overview . A single data source can be assigned to multiple import components.
All pipelines in the Data Ingestion Engine will start with one or multiple Define Datasource components. When creating a pipeline, the first thing to do is importing the data, but the import components must be proceeded by a Define Datasource component (Except the ‘Import Remote Data Set’ component).
‘Outdated’ pipeline run¶
When executing the pipeline, the Data Ingestion Engine checks if a component is outdated before it executes the component. A Define Datasource component is outdated if, as with all other components, the user has changed an option value, e.g. the Label of the data source.
For the Define Datasource component there is a second way the component can become outdated: by using the Info File. The Info File is a JSON file which contains the modification date of the data source:
{
"date_modified": "2000-01-01"
}
This location of this file is set via the Info File Path option and is stored somewhere where the Data Ingestion Engine has access to the file (this means in the source data directory, probably close to the actual source files of the data source).
When the Define Datasource component is executed, the date mentioned in the ‘info’ file is stored by the Data Ingestion Engine. During a subsequent pipeline run, the Data Ingestion Engine compares the modification date in that file to what the value was the previous time. If the date is more recent than the stored date, the Define Datasource component is flagged as ‘outdated’. The component and its successor components are then executed again, while the new modification date is stored.
Note (1) that this principle is used not only if you select ‘Outdated’ pipeline run, but also in ‘Differential’ and ‘Incremental’ mode.
Note (2) You can also set the modification date using the Modification Data option in the component. This way, you don’t need to create an info file. The effect is the same since the component options will be outdated if you adapt the date. The use of info-files is, however, very convenient when the download of the data source files is automated. The automated should then overwrite the file each time the downloaded files are renewed with the most recent modification date.
If you want more control over when a data source needs to be ‘outdated’, for example if multiple source file updates happen during one day, you can use an other parameter in the ‘info’ file. You can specify a so-called version tag, like this:
{
"date_modified": "2000-01-01",
"version_tag": "1.14.5"
}
The version tag can be adapted at any time, and can be formatted in any way you like. In fact, the tag does not need to be a string, but can also be an integer or decimal value. Being able to trigger multiple runs based on ‘outdated’ Define Datasource components can be very important when you are using incremental data ingestion.
Options¶
- Label : The name of the datasource.
- Short Label [Optional] : The short label of the datasource.
- Homepage [Optional] : The URL of the homepage.
- Description [Optional] : The description of the datasource.
- Modification Date [Optional] : The last modification date of the datasource.
- Info File Path [Optional] : It is possible to define properties of a datasource (such as the modification date) using a separate JSON file. This option specifies the relative path of that JSON file. If this option is filled in, the properties will be retrieved from the file and not from within this component. The file must contain the “date_modified” key.
Advanced
- URI [Optional] : A URI that uniquely identifies the datasource.
- URI Scheme [Optional] : The URI scheme of the datasource.
- Example URI Scheme [Optional] : A URI scheme example for the datasource.
- Visible for Groups [Optional] : The identifiers of the user groups that will have access to this data source. If this option is left empty, the data source will be accessible for everyone.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “There was a problem reading the Info file.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “There was a problem reading the Info file.”. The default value is
1
. - Minimal count for warning “The file path should be relative to the DISQOVER source_data folder. Absolute paths are deprecated.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The file path should be relative to the DISQOVER source_data folder. Absolute paths are deprecated.”. The default value is
1
.
9.4.13. Expand Hierarchical Paths¶
Expands a forward link to a hierarchical class to a new predicate containing the full paths.
Description¶
This component expands a forward link to a hierarchical class to a new predicate containing the full paths.
Options¶
Target Class
- Target Class : The class containing the relationship predicate to the hierarchical class.
- Target Relationship Predicate : Predicate containing the relationship to the hierarchical class..
- Target Class Filter [Optional] : Boolean expression returning true for resources which should be included.
- Target Path Predicate [Optional] : Predicate to store the generated path.
Hierarchical Class
- Hierarchical Class : The class containing the parent child relationships.
- Parent Relationship Predicate : Predicate containing the parent relationship..
- Hierarchical Class Filter [Optional] : Boolean expression returning true for resources which should be included.
Example¶
We have a list of Disney characters and information about where they live. We also have a class which contains hierarchical location data:
United States
/ \
/ \
Calisota Washagon
/ \ \
/ \ \
Duckburg Mouseton Zenith
URIs have been abbreviated:
- ‘D:’ stands for “http://disney.org/”
- ‘L:’ stands for “http://location.org/”
and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.
Preferred URIs are notated in boldface.
Target Class DisneyCharacters
before applying the component:
disq:uri.uri | location.fwd |
---|---|
[D:donald_duck] | [HURI(L:duckburg)] |
[D:mickey_mouse] | [HURI(L:mouseton)] |
[D:daisy_duck] | [HURI(L:duckburg)] |
Hierarchical Class DisneyLocations
before applying the component:
disq:uri.uri | disq:label.lit | name.rev | parent.fwd | parent.rev |
---|---|---|---|---|
[L:duckburg] | [“Duckburg”] | [HURI(D:donald_duck), HURI(D:daisy_duck)] | [HURI(L:calisota)] | [] |
[L:mouseton] | [“Mouseton”] | [HURI(D:mickey_mouse)] | [HURI(L:calisota)] | [] |
[L:zenith] | [“Zenith”] | [HURI(D:mickey_mouse)] | [HURI(L:washagon)] | [] |
[L:calisota] | [“Calisota”] | [] | [HURI(L:us)] | [HURI(L:duckburg), HURI(L:mouseton)] |
[L:washagon] | [“Washagon”] | [] | [HURI(L:us)] | [] |
[L:us] | [“United States”] | [] | [] | [HURI(L:calisota), HURI(L:washagon)] |
A predicate location_path.path (value of Target Path Predicate) is added to the Target Class. There are no changes in the Hierarchical class. The Target Path predicate contains the complete hierarchical path of the Parent Relationship Predicate:
disq:uri.uri | location_path.path | location.fwd |
---|---|---|
[D:donald_duck] | [(HURI(L:duckburg), HURI(L:calisota)), (HURI(L:calisota), HURI(L:united_states)] | [HURI(L:duckburg)] |
[D:mickey_mouse] | [(HURI(L:mouseton), HURI(L:calisota)), (HURI(L:calisota), HURI(L:united_states)] | [HURI(L:mouseton)] |
[D:daisy_duck] | [(HURI(L:duckburg), HURI(L:calisota)), (HURI(L:calisota), HURI(L:united_states)] | [HURI(L:duckburg)] |
Options¶
Target Class
- Target Class : The class containing the relationship predicate to the hierarchical class.
- Target relationship predicate : Predicate containing the relationship to the hierarchical class.
- Target Class Filter [Optional] : A boolean expression returning True for resources in the Target Class to which the action should be applied.
- Target path Predicate : Predicate to store the generated path.
Hierarchical Class
- Hierarchical Class : The class containing the parent child relationships.
- Parent relationship predicate : Predicate containing the parent relationship.
- Hierarchical Class Filter [Optional] : A boolean expression returning True for resources in the Hierarchical Class to which the action should be applied.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “Hierarchy contains loops.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Hierarchy contains loops.”. The default value is
1
.
9.4.14. Extract Class¶
Move or copy resources to a new class (all predicates or a selection of predicates).
Description¶
This component copies or moves resources from a class (Source Class) to a new class (Target Class).
The resources to be copied are specified via the Source Class Filter.
By default all predicates are copied, except auxiliary predicates. Predicates to include allows you to copy only a specific set of predicates. Predicates to exclude allows you to copy all predicates except a specific set. These options cannot be filled in at the same time.
By default copied resources are removed from the Source Class (or more accurately: these resources are disabled). This behavior can be changed with the option Remove Copied Resources, but be aware that this might introduce duplicate URIs.
If all predicates are copied and the source resources are removed, this amounts to moving the resources to the new class, or, in other words, splitting the class (the reverse of merging). This is the default behavior and has the same effect as the component Create Compact Class.
Advanced¶
If subject label (disq:label.lit) or preferred label are included in the predicates to be copied, they will both be copied. Likewise, if subject URI (disq:uri.uri), hashed URI, Preferred URI of hashed Preferred URI are included, they will all be copied.
Auxiliary columns are not copied.
Options¶
Source Class
- Source Class : Class containing resources to be extracted.
- Filter [Optional] : Boolean expression returning true for resources which should be included.
- Predicates to include [Optional] : List of predicates to be copied. Empty means all predicates.
- Predicates to exclude [Optional] : List of predicates to be excluded from extraction.
- Remove Copied Resources [Optional] : Remove the original resources after extracting. The default value is
True
.
Target Class
- Target Class : New class to which the resources will be extracted.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “The source class contains links that can become broken in the extracted class.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The source class contains links that can become broken in the extracted class.”. The default value is
1
.
9.4.15. Extract Class (distinct)¶
Creates a new class containing distinct values derived from a predicate in a given class.
Description¶
This component extracts distinct values from a literal predicate in Source Class to a new Target Class.
A new Target Class is created (it’s an error if there is already a class with that name), with the following predicates:
- disq:uri.uri (subject URI)
- disq:uri.huri (its hashed value)
- an output predicate defined by Values Predicate (by default disq:label.lit)
- rdf:type.lit (if Resource Type is filled in)
- RRR.rev (back link) where
RRR
is the value of Relationship Predicate.
In the Source Class a forward link is created in
- RRR.fwd.
The component reads values from Aimed Predicate in Source Class, converts them, one by one, via the expression given in Value Expression (if not empty), and transforms the results to URIs, similar to Add URI:
- convert to lowercase if To Lowercase is
True
.- URL-encode according to Encoding (unless Prefix is empty).
- add Prefix in front.
For each unique URI created (extracted) in this way:
- a resource is created in Target Class.
- the extracted URI is written to disq:uri.uri, and its hashed value to disq:uri.huri.
- all literal values which yield this URI are added to Values Predicate.
- the hashed URIs of all resources contributing to the extracted URI are added to RRR.rev.
Conversely,
- the hashed URI of each extracted URI is added to RRR.fwd in :option`Source Class` for each resource contributing to that extracted URI.
The option Value Expression can be used to transform (or extract from) literals prior to determining
unique values.
The expression can only depend on 1 string variable called $value
and must produce a single string value.
A typical use case is importing complete JSON- or XML-blobs and extracting a unique identifier.
For example, XmlGetTextFirst($value, "./code")
extracts the value of subnode <code>
of an XML-node.
If the option Resource Type is filled in, then an extra predicate rdf:type is created (with this value for each resource). See Configure Canonical Type for more information about Resource Types.
For more details about encoding, see Add URI.
Note that the order of extracted resources is undetermined.
Example¶
Option | Value |
---|---|
Source Class | DisneyCharacters |
Aimed Predicate | animal.lit |
Value Expression | empty |
Relationship Predicate | animal |
prefix | http://animals.org |
Encoding | ONTOFORCE encoding |
To lowercase | True |
Target Class | Animals |
Values Predicate | empty (disq:label.lit by default) |
Type | http://ontology/animal |
URIs have been abbreviated:
- ‘D:’ stands for “http://disney.org/”
- ‘A:’ stands for “http://animals.org/”
and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.
Source Class DisneyCharacters
before applying the component:
disq:uri.uri | animal.lit |
---|---|
[D:mickey_mouse] | [“Mouse”, “House Mouse”] |
[D:pluto] | [“dog”] |
[D:goofy)] | [“Dog”, “Human”] |
[D:minnie_mouse] | [“mouse”] |
[D:donald_duck] | [] |
Source Class DisneyCharacters
after applying the component:
disq:uri.uri | animal.lit | animal.fwd |
---|---|---|
[D:mickey_mouse] | [“Mouse”, “House Mouse”] | [HURI(A:mouse), HURI(A:house_mouse)] |
[D:pluto] | [“dog”] | [HURI(A:dog)] |
[D:goofy)] | [“Dog”, “Human”] | [HURI(A:dog), HURI(A:human)] |
[D:minnie_mouse] | [“Mouse”] | [HURI(A:mouse)] |
[D:donald_duck] | [] | [] |
Target Class Animals
after applying the component:
disq:uri.uri | disq:label.lit | rdf:type.lit | animal.rev |
---|---|---|---|
[A:house_mouse] | [“House Mouse”] | [”http://ontology/animal”] | [HURI(D:mickey_mouse)] |
[A:dog] | [“Dog”, “dog”] | [”http://ontology/animal”] | [HURI(D:goofy), HURI(D:pluto)] |
[A:mouse] | [“Mouse”] | [”http://ontology/animal”] | [HURI(D:mickey_mouse), HURI(D:minnie_mouse)] |
[A:human] | [“Human”] | [”http://ontology/animal”] | [HURI(D:goofy)] |
Note:
- For Relationship Predicate we chose the same name (
animal
) as the Aimed Predicate; this is not required. - “House Mouse” converts to “house_mouse”
- Pluto and Goofy both link to
A:dog
because “dog” and “Dog” both convert to “dog”;
Options¶
Source Class
- Class : The class containing the values to be extracted.
- Aimed Predicate : Literal predicate from which values will be extracted. Each distinct value produces a resource in the Target Class with a URI derived from the value.
- Value Expression [Optional] : An expression that creates the identifier by transforming the Aimed Predicate. In this expression, the Aimed Predicate is represented by the $value variable. Both the input and output are Strings.
- Filter [Optional] : Boolean expression returning true for resources which should be included.
- Relationship Predicate : The predicate containing the created relationship.
- Prefix [Optional] : The prefix to be used for the URI.
- Encoding [Optional] : Determines how the part of the generated URI after the prefix will be encoded. The possible values are:
Standard URL encoding (e.g. ' ' is converted to '%20')
,ONTOFORCE encoding. This strips surrounding whitespace, replaces ;,. / characters with underscores and applies standard URL encoding to all other characters.
. - To Lowercase [Optional] : Convert the part of the generated URI after the prefix to lowercase. The default value is
True
.
Target Class
- Class : The new class which will contain the distinct resources.
- Values Predicate [Optional] : This is the predicate of the Target Class that will contain the distinct values from the Aimed Predicate. By default: disq:label.lit. The default value is
disq:label.lit
. - Preferred Label selection strategy [Optional] : Determines which value to pick as preferred label when the values predicate is ‘disq:label’ and it has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Resource Type [Optional] : The Resource type for all extracted resources.
Advanced
- Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred while processing the Value Expression.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred while processing the Value Expression.”. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.16. Extract Hierarchical Class¶
Extracts and creates node resources from a “Tree Path” predicate.
Description¶
This component produces a hierarchical class (or tree class) (Extracted Class) based on parent-child-information in a literal predicate in Target Class.
Terminology¶
A hierarchical system can be thought of as a collection of nodes, where each node can have zero or more child nodes, and zero or more parent nodes. A tree is a very common hierarchical system in which every node (except the root node) has exactly one parent.
Consider the following example:
A G
/ \ /
/ \ /
B C H
/ \ \
/ \ \
D E F
This hierarchical system has 8 nodes. Parents are shown above their children, so A is a parent of B and C, B is a parent of D and E etc. There are 2 trees, with root nodes A and G. Nodes without children, like D an H, are called leaf nodes.
For every node we can define its path as the list of parent-nodes we have to follow up until we reach the root of the tree. So the path of D would be [D, B, E], and the path of C would be [C, A].
Interestingly, the hierarchy can be constructed based on the paths of all leaf nodes (in this case [D, B, A], [E, B, A], [F, C, A], [H, G]), and that is exactly what this component does.
Implementation¶
In Data Ingestion Engine every node will be represented by a resource in the hierarchical class,
which has a parent-child relationship with itself.
This relationship is conventionally stored in predicates parent.fwd
and parent.rev (although the name parent
has no special meaning).
For each resource, the hashed URI of its parent is stored in parent.fwd,
and the hashed URIs of its children in parent.rev.
Following the example above, in resource B parent.fwd would have a single value, namely the hashed URI of A, and parent.rev would have two values, namely the hashed URIs of D and E.
This component creates the hierarchy based on path information that is stored in Literal Predicate, for each leaf node.
Path information is essentially made up of labels and URIs of all nodes in the path (up to the root). In the example above, the path information for node F would be
- label-of-F, URI-of-F
- label-of-C, URI-of-C
- label-of-A, URI-of-A
This is encoded in a single string, separating each item by "||"
.
So for node F that would be:
"label-of-F||URI-of-F||label-of-C||URI-of-C||label-of-A||URI-of-A"
Path information is normally produced by a Transform Literals component. Two special functions are dedicated to this task:
CreateTreePath
CreatePersonPath
Some special cases may arise if path informations for different nodes are “incompatible”:
Different parents:
"A||||URI_A||P1||URI_P1" "A||||URI_A||P2||URI_P2"
In this case node A is mentioned twice, but with different parents. This is not a problem, only this will not be a real tree.
Different labels:
"A||||URI_A||P1||URI_P" "B||||URI_A||P2||URI_P"
In this case nodes A and B have the same parent P, but give P different labels. In the current implementation only one of the labels will be retained, the other one will be discarded (the choice is arbitrary).
Options¶
The option Add to existing class determines whether this component should create a new class for the extracted resources, or add to an existing class (normally also produced by another Extract Hierarchical Class component). In the latter case the hierarchy in that class will be expanded using the path information in the Target Class. Note that extracting different classes to a single hierarchy class is preferrable over extracting to multiple hierarchies and merging them using Merge Classes.
If the option Resource Type is filled in, then an extra predicate rdf:type is created (with this value for each resource). See Configure Canonical Type for more information about Resource Types.
Example¶
In this example, a name-tree is extracted from name-information. Special nodes are created for abbreviated names, such as “Duck, D.” and “Duck”. “Duck, Donald” and “Duck, Daisy” have the same initials, so both have the same parent “Duck. D”.
Duck
/ \
/ \
Duck, D Duck, H
/ \ \
/ \ \
Duck, Duck, Duck,
Donald Daisy Huey
Note: this is not a family tree!
In preparation of this component, the name path information
(probably created using the function CreatePersonPath
in a Transform Literals component)
has been stored in name_path.lit.
Option | Value |
---|---|
Target Class | DisneyCharacters |
Literal Predicate | name_path |
Link Predicate | name |
Extracted Class | DisneyNames |
URIs have been abbreviated:
- ‘D:’ stands for “http://disney.org/”
- ‘N:’ stands for “http://names.org/”
and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.
Target Class DisneyCharacters
before applying the component:
disq:uri.uri | name_path.lit |
---|---|
[D:donald_duck] | [“Duck, Donald||N:duck_donald||Duck, D||N:duck_d||Duck||N:duck”] |
[D:daisy_duck] | [“Duck, Daisy||N:duck_daisy||Duck, D||N:duck_d||Duck||N:duck”] |
[D:huey_duck] | [“Duck, Huey||N:duck_huey||Duck, H||N:duck_h||Duck||N:duck”] |
Target Class DisneyCharacters
after applying the component (the values of name_path
are left out):
disq:uri.uri | name_path.lit | name.fwd |
---|---|---|
[D:donald_duck] | … | [HURI(N:duck_donald)] |
[D:daisy_duck] | … | [HURI(N:duck_daisy)] |
[D:huey_duck] | … | [HURI(N:duck_huey)] |
Extracted Class DisneyNames
after applying the component:
disq:uri.uri | disq:label.lit | name.rev | parent.fwd | parent.rev |
---|---|---|---|---|
[N:duck] | [“Duck”] | [] | [] | [HURI(N:duck_d), HURI(N:duck_h)] |
[N:duck_d] | [“Duck, D”] | [] | [HURI(N:duck)] | [HURI(N:duck_donald), HURI(N:duck_daisy)] |
[N:duck_donald] | [“Duck, Donald”] | [HURI(D:donald_duck)] | [HURI(N:duck_d)] | [] |
[N:duck_daisy] | [“Duck, Daisy”] | [HURI(D:daisy_duck)] | [HURI(N:duck_d)] | [] |
[N:duck_h] | [“Duck, H”] | [] | [HURI(N:duck)] | [HURI(N:duck_huey)] |
[N:duck_huey] | [“Duck, Huey”] | [HURI(D:donald_huey)] | [HURI(N:duck_h)] | [] |
Options¶
Target Class
- Target Class : The class containing the “Tree Path” predicate to be extracted.
- Literal Predicate : Literal predicate that contains the “Tree Path”.
- Filter [Optional] : Boolean expression returning true for resources which should be included.
- Link Predicate : Predicate to store the forward link in.
Extracted Class
- Extracted Class : The class to which the resources will be extracted. If this class does not already exist, it will be created.
- Add to existing class [Optional] : Needs to be turned on if the tree is added to an existing class. The default value is
True
. - Resource Type [Optional] : The Resource type for all extracted resources.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “The value for the Tree Path is invalid.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The value for the Tree Path is invalid.”. The default value is
1
. - Minimal count for warning “The extracted class is not a pure tree because some resources have multiple parents” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The extracted class is not a pure tree because some resources have multiple parents”. The default value is
1
. - Minimal count for warning “Some resources in the extracted class have multiple labels.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Some resources in the extracted class have multiple labels.”. The default value is
1
. - Minimal count for warning “Some labels and/or parents in the extracted tree have changed during the incremental run. Any tree facets derived from it will not show the changes correctly.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Some labels and/or parents in the extracted tree have changed during the incremental run. Any tree facets derived from it will not show the changes correctly.”. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.17. Import CSV¶
Imports CSV files.
Description¶
This component imports CSV files. It can be used to import all types of delimiter-separated files by specifying the Delimiter and it can handle different types of quoting by setting the Quoting option. The column headers are used to transfer the data in the columns to resource predicates.
All data is imported into a single class, specified in the option Class Name.
Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.
The option Files is used to specify which files to import.
This is a list of file-paths. It is possible to use wildcards (like ‘*’).
As a security measure, all file-paths must resolve to locations inside the Source Data directory
(by default /disqover/data/source_data/
, but configurable by an administrator).
The use of absolute paths is discouraged and will cause a warning.
Relative paths are relative to the Source Data directory.
All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.
You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.
At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.
Example¶
As an example, the following file is imported:
movies/disney_movies.csv
title,release_date,runtime
Snow White,"December 21, 1937",83
Pinocchio,"February 7, 1940",88
Dumbo,"October 23, 1941",64
Bambi,"August 13, 1942",70
Cinderella,"February 15, 1950",74
The file is comma-separated and the quoting used in the values is the double quote, so the default values for Delimiter and Quote Character can be used.
Option | Value |
---|---|
Class Name | DisneyMovies |
Data Source | http://disney.org/movies/ |
Files | movies/disney_movies.csv |
Columns | See below |
In the file there are three columns, and we want to import each of these columns to a predicate. The header line of the CSV file is used to import these columns to different predicates. For the Columns option use the following:
File Column | Predicate |
---|---|
title |
movie:title |
release_date |
movie:release_date |
runtime |
movie:runtime |
Class DisneyMovies
after the CSV import component:
movie:title.lit | movie:release_date.lit | movie:runtime.lit |
---|---|---|
[“Snow White”] | [“December 21, 1937”] | [“83”] |
[“Pinocchio”] | [“February 7, 1940”] | [“88”] |
[“Dumbo”] | [“October 23, 1941”] | [“64”] |
[“Bambi”] | [“August 13, 1942”] | [“70”] |
[“Cinderella”] | [“February 15, 1950”] | [“74”] |
Each movie resource has has three predicates which all have a single value.
Options¶
- Class : The name of the new class which will contain the imported data.
- Data Source : The URI of the data source.
- Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
- Columns : The properties to import from the input file(s). This requires the name of the existing column in the input file and the name of the predicate in which the information will be stored. If the input file does not contain headers, the Field Names option should be used to define headers. A list of sub-options with the following structure:
- File Column : Field in the file.
- Predicate : The destination predicate for the imported field.
- Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - Use as URI : Use the predicate as an URI for the resources in the class The default value is
False
. - Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is
False
. - Prefix : The prefix to be used for the URI.
- New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is
1
. - Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is
1
. - Use as label : Use the predicate as a label for the resources in the class The default value is
False
. - Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is
False
. - New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is
1
. - Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is
1
.
- Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
- Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
- Delimiter [Optional] : Delimiter used in the CSV file. The default value is
,
. - Quote Character [Optional] : Override the default quote character. The default value is
"
. - Escape Character [Optional] : A character that removes any meaning from the character following it.
- Field Names [Optional] : If the input file does not contain headers, this option defines names for the columns which can then be used in the Columns option.
Advanced
- Filename Predicate [Optional] : Predicate in which the filename will be stored.
- Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is
False
. - Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
- Ignore Quotes [Optional] : Ignore all quoting. Can be used when the file does not contain consistent quoting. The default value is
False
.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is
1
. - Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is
1
.
9.4.18. Import Excel¶
Imports from Excel spreadsheet files.
Description¶
This component imports Excel files. It is used to import data from Excel files by specifying the columns that need to be imported. If there is a header row, you can use the header labels (see also the ‘Has Header Row’ option ), otherwise you can use the column names (‘A’, ‘B’, …).
All data is imported into a single class, specified in the option Class.
Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.
The option Files is used to specify which files to import.
This is a list of file-paths. It is possible to use wildcards (like ‘*’).
As a security measure, all file-paths must resolve to locations inside the Source Data directory
(by default /disqover/data/source_data/
, but configurable by an administrator).
The use of absolute paths is discouraged and will cause a warning.
Relative paths are relative to the Source Data directory.
All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.
You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.
At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.
Options¶
- Class : The name of the new class which will contain the imported data.
- Data Source : The URI of the data source.
- Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
- Columns : The properties to import from the input file(s). This requires the name of the existing column in the input file and the name of the predicate in which the information will be stored. If there is a header row (see option Has Header Row) use those headers, otherwise use the column name (‘A’, ‘B’, …) A list of sub-options with the following structure:
- File Column : Field in the file.
- Predicate : The destination predicate for the imported field.
- Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - Use as URI : Use the predicate as an URI for the resources in the class The default value is
False
. - Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is
False
. - Prefix : The prefix to be used for the URI.
- New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is
1
. - Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is
1
. - Use as label : Use the predicate as a label for the resources in the class The default value is
False
. - Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is
False
. - New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is
1
. - Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is
1
.
- Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
- Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
- Has Header Row [Optional] : The first row contains column headers. The default value is
False
.
Advanced
- Filename Predicate [Optional] : Predicate in which the filename will be stored.
- Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is
False
. - Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
- Sheet or Cell Range [Optional] : Specifies which data will be imported. It is either a sheet name (in which case the whole sheet is imported), a Defined Name denoting a named range, or a range of the form ‘A1:E4’ or ‘sheet-title!A1:E4. If option Has Header Row is True, make sure the cell range includes the header row. The default is the complete first sheet.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is
1
. - Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is
1
.
9.4.19. Import Identifier Block¶
Imports column-based files concatenating all subsequent lines which share the column ‘ID Column’ into one resource.
Description¶
This component is an advanced Import CSV component. For the main usage of the component refer to the CSV importer component. It works in a similar way as the CSV importer, i.e. it can import any column-based (delimiter-separated) file, but it concatenates subsequent lines which share the value of a column: the ID Column. The rows in the imported file must be ordered by the ID Column for the concatenating to work.
All data is imported into a single class, specified in the option Class Name.
Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.
The option Files is used to specify which files to import.
This is a list of file-paths. It is possible to use wildcards (like ‘*’).
As a security measure, all file-paths must resolve to locations inside the Source Data directory
(by default /disqover/data/source_data/
, but configurable by an administrator).
The use of absolute paths is discouraged and will cause a warning.
Relative paths are relative to the Source Data directory.
All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.
You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.
At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.
Example¶
As an example, the following file is imported:
movies/disney_characters.csv
character,movie
Snow White,Snow White
Bashful,Snow White
Doc,Snow White
Dopey,Snow White
Grumpy,Snow White
Happy,Snow White
Sleepy,Snow White
Sneezy,Snow White
Geppetto,Pinocchio
Pinocchio,Pinocchio
Dumbo,Dumbo
Jumbo,Dumbo
The header line of the column-based file indicates there are two columns: character
and movie
.
The file contains 12 different movie characters from three different movies.
A comma is used to separate the data so the default value for Delimiter can be used.
To import this file to the DisneyMovies
class, select the movie
column as ID Column:
Option | Value |
---|---|
Class Name | DisneyMovies |
Data Source | http://disney.org/movies/ |
Files | movies/disney_characters.csv |
ID Column | movie |
Columns | See below |
For the Columns option use the following:
File Column | Predicate |
---|---|
movie |
movie:title |
character |
movie:character |
Class DisneyMovies
after execution:
movie:title.lit | movie:character.lit |
---|---|
[“Snow White”] | [“Snow White”, “Bashful”, “Doc”, “Dopey”, “Grumpy”, “Happy”, “Sleepy”, “Sneezy”] |
[“Pinocchio”] | [“Geppetto”, “Pinocchio”] |
[“Dumbo”] | [“Dumbo”, “Jumbo”] |
The resources have been grouped per movie title and each movie has a list of characters in the movie:character.lit
predicate.
Options¶
- Class : The name of the new class which will contain the imported data.
- Data Source : The URI of the data source.
- Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
- Columns : The properties to import from the input file(s). This requires the name of the existing column in the input file and the name of the predicate in which the information will be stored. If the input file does not contain headers, the Field Names option should be used to define headers. A list of sub-options with the following structure:
- File Column : Field in the file.
- Predicate : The destination predicate for the imported field.
- Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - Use as URI : Use the predicate as an URI for the resources in the class The default value is
False
. - Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is
False
. - Prefix : The prefix to be used for the URI.
- New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is
1
. - Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is
1
. - Use as label : Use the predicate as a label for the resources in the class The default value is
False
. - Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is
False
. - New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is
1
. - Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is
1
.
- Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
- Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
- ID Column : The column that will be used as identifier. All consecutive lines with the same value for “ID column” will be aggregated into a single instance.
- Delimiter [Optional] : Delimiter used in the CSV file. The default value is
,
. - Quote Character [Optional] : Override the default quote character. The default value is
"
. - Escape Character [Optional] : A character that removes any meaning from the character following it.
- Field Names [Optional] : If the input file does not contain headers, this option defines names for the columns which can then be used in the Columns option.
Advanced
- Filename Predicate [Optional] : Predicate in which the filename will be stored.
- Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is
False
. - Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
- Ignore Quotes [Optional] : Ignore all quoting. Can be used when the file does not contain consistent quoting. The default value is
False
.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is
1
. - Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is
1
.
9.4.20. Import JSON¶
Imports JSON files.
Description¶
This component imports resources from JSON files. The JSON file can simply contain a list of resources but it can also have a more complicated structure. In that case the list of resources to be imported from the file can be determined by setting the Instance Path.
All data is imported into a single class, specified in the option Class Name.
Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.
The option Files is used to specify which files to import.
This is a list of file-paths. It is possible to use wildcards (like ‘*’).
As a security measure, all file-paths must resolve to locations inside the Source Data directory
(by default /disqover/data/source_data/
, but configurable by an administrator).
The use of absolute paths is discouraged and will cause a warning.
Relative paths are relative to the Source Data directory.
All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.
You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.
At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.
Example¶
As an example, the following file is imported:
movies/disney_movies.json
{
"movies":
[
{
"title": "Snow White",
"release_date": "December 21, 1937",
"runtime": 83
},
{
"title": "Pinocchio",
"release_date": "February 7, 1940",
"runtime": 88
},
{
"title": "Dumbo",
"release_date": "October 23, 1941",
"runtime": 64
},
{
"title": "Bambi",
"release_date": "August 13, 1942",
"runtime": 70
},
{
"title": "Cinderella",
"release_date": "February 15, 1950",
"runtime": 74
}
]
}
The file contains a list of entries representing a movie, and each entry has three fields.
Option | Value |
---|---|
Class Name | DisneyMovies |
Data Source | http://disney.org/movies/ |
Files | movies/disney_movies.json |
Instance Path | movies |
Columns | See below |
Each instance entry in the JSON list has three fields. To import each of these fields to a predicate use the following for the Columns option:
File Column | Predicate |
---|---|
title |
movie:title |
release_date |
movie:release_date |
runtime |
movie:runtime |
Class DisneyMovies
after the JSON import component:
movie:title.lit | movie:release_date.lit | movie:runtime.lit |
---|---|---|
[“Snow White”] | [“December 21, 1937”] | [“83”] |
[“Pinocchio”] | [“February 7, 1940”] | [“88”] |
[“Dumbo”] | [“October 23, 1941”] | [“64”] |
[“Bambi”] | [“August 13, 1942”] | [“70”] |
[“Cinderella”] | [“February 15, 1950”] | [“74”] |
Each movie resource has has three predicates which all have a single value.
Options¶
- Class : The name of the new class which will contain the imported data.
- Data Source : The URI of the data source.
- Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
- Columns : The properties to import from the input file(s). This requires the JSON path or key and the name of the predicate in which the information will be stored. The JSON keys are relative to the instance path. A list of sub-options with the following structure:
- File Column : Field in the file.
- Predicate : The destination predicate for the imported field.
- Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - Use as URI : Use the predicate as an URI for the resources in the class The default value is
False
. - Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is
False
. - Prefix : The prefix to be used for the URI.
- New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is
1
. - Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is
1
. - Use as label : Use the predicate as a label for the resources in the class The default value is
False
. - Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is
False
. - New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is
1
. - Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is
1
.
- Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
- Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
- Resource Path [Optional] : The JSON path to the resources.
Advanced
- Filename Predicate [Optional] : Predicate in which the filename will be stored.
- Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is
False
. - Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is
1
. - Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is
1
.
9.4.21. Import RDF (DEPRECATED)¶
Imports RDF files.
Description¶
This component imports RDF files (e.g. turtle or ntriples). In contrast to the CSV, JSON and XML importers the resources in the RDF files already have a URI and an RDF type, so these will be set during import. The resources to be imported are determined by specifying the Selected RDF Type, only resources with the following triple will be imported:
<subject> a <selected_rdf_type> .
All data is imported into a single class, specified in the option Class Name.
Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.
The option Files is used to specify which files to import.
This is a list of file-paths. It is possible to use wildcards (like ‘*’).
As a security measure, all file-paths must resolve to locations inside the Source Data directory
(by default /disqover/data/source_data/
, but configurable by an administrator).
The use of absolute paths is discouraged and will cause a warning.
Relative paths are relative to the Source Data directory.
All imported resources are assigned a Resource Type (rdf:type) during import. If the option Resource Type is filled in, that value is used. If the option Resource Type is not filled in, the RDF types from the files are used. Note: since Selected RDF Type allows to import multiple RDF types, different imported resources can have different Resource Types. See Configure Canonical Type for more information about Resource Types.
You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.
At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.
Example¶
As an example, the following turtle file is imported:
/data/source_data/movies/disney_movies.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
@prefix dbpedia: <http://dbpedia.org/ontology/>
@prefix disney: <http://disney.org/movies/>
@prefix prop: <http://disney.org/properties#>
disney:snow_white a dbpedia:film ;
rdfs:label "Snow White" ;
prop:release_date: "1937-12-21" ;
prop:runtime: "83" .
disney:pinocchio a dbpedia:film ;
rdfs:label "Pinocchio" ;
prop:release_date: "1940-02-07" ;
prop:runtime: "88" .
disney:dumbo a dbpedia:film ;
rdfs:label "Dumbo" ;
prop:release_date: "1941-10-23" ;
prop:runtime: "64" .
disney:bambi a dbpedia:film ;
rdfs:label "Bambi" ;
prop:release_date: "1940-02-07" ;
prop:runtime: "70" .
disney:cinderella a dbpedia:film ;
rdfs:label "Cinderella" ;
prop:release_date: "1950-02-15" ;
prop:runtime: "74" .
This file contains five http://dbpedia.org/ontology/film
instances we want to import.
The data contains type specifiers, these can be removed during import by enabling the Remove Type Specifiers option.
The Class Type is set to http://ns.ontoforce.com#movie
to use a new ontology.
Option | Value |
---|---|
Class Name | DisneyMovies |
Data Source | http://disney.org/movies/ |
Files | /data/source_data/movies/disney_movies.ttl |
File Type | turtle |
Selected RDF Type | http://dbpedia.org/ontology/film |
Remove Type Specifiers | True |
Properties | See below |
Each http://dbpedia.org/ontology/film
subject has three properties we want to import.
To import each of these properties to a predicate use the following for the Properties option:
File Property | Predicate |
---|---|
<http://www.w3.org/2000/01/rdf-schema#label> |
movie:title |
<http://disney.org/properties#release_date> |
movie:release_date |
<http://disney.org/properties#runtime> |
movie:runtime |
Note that the angular brackets (< >
) around the URIs are optional.
Class DisneyMovies
after the RDF import component:
disq:uri.uri | movie:title.lit | movie:release_date.lit | movie:runtime.lit |
---|---|---|---|
[D:snow_white] | [“Snow White”] | [“December 21, 1937”] | [“83”] |
[D:pinocchio] | [“Pinocchio”] | [“February 7, 1940”] | [“88”] |
[D:dumbo] | [“Dumbo”] | [“October 23, 1941”] | [“64”] |
[D:bambi] | [“Bambi”] | [“August 13, 1942”] | [“70”] |
[D:cinderella] | [“Cinderella”] | [“February 15, 1950”] | [“74”] |
URIs have been abbreviated:
- ‘D:’ stands for
http://disney.org/movies/
In contrast to the CSV, JSON and XML import components, the resources are assigned a URI during import. A (new) RDF type has been set for each of the imported resources.
Other cases¶
If the RDF file to import contains multiple RDF types,
e.g. http://dbpedia.org/ontology/film
and http://dbpedia.org/ontology/movie
, we can import both types at once
using the following options.
Option | Value |
---|---|
Selected RDF Type | [
|
The default predicate used to import resources from an RDF file is http://www.w3.org/1999/02/22-rdf-syntax-ns#type
or
commonly written as a
.
If we want to import resources by an other predicate we can set it via the rdf_type_predicate option.
If we have a file with movies produced by Walt Disney:
/data/source_data/movies/movie_collection.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
@prefix person: <http://www.w3.org/ns/person#>
@prefix disney: <http://disney.org/movies/>
@prefix prop: <http://disney.org/properties#>
disney:snow_white prop:produced_by person:disney_w ;
rdfs:label "Snow White" ;
prop:release_date: "1937-12-21" ;
prop:runtime: "83" .
disney:pinocchio prop:produced_by person:disney_w ;
rdfs:label "Pinocchio" ;
prop:release_date: "1940-02-07" ;
prop:runtime: "88" .
We can import the movies as follows
Option | Value |
---|---|
RDF Type predicate | <http://disney.org/properties#produced_by> |
Options¶
- Class : The name of the new class which will contain the imported data.
- Data Source : The URI of the data source.
- Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
- Properties : The properties to import from the input file(s). This requires the existing RDF predicate name and the name of the predicate in which the information will be stored. A list of sub-options with the following structure:
- File Property : Field in the file.
- Predicate : The destination predicate for the imported field.
- Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - Predicate Type : The type of object predicate. The possible values are:
Literal
,URI
. - Target Classes : Target Classes for this predicate (only applicable if it is an object URI). The default value is
[]
. - Use as URI : Use the predicate as an URI for the resources in the class The default value is
False
. - Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is
False
. - Prefix : The prefix to be used for the URI.
- New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is
1
. - Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is
1
. - Use as label : Use the predicate as a label for the resources in the class The default value is
False
. - Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is
False
. - New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is
1
. - Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is
1
.
- Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
- Selected RDF Type [Optional] : The type or types of the instances in the source RDF data. The default value is
[]
. - File Type : The RDF file format of the imported file(s). The possible values are:
ntriples
,rdfxml
,rdfxml-xmp
,rdfxml-abbrev
,rss-1.0
,atom
,dot
,json-triples
,json
,html
,nquads
,turtle
. - RDF Type Predicate [Optional] : The RDF predicate used to select type. The default value is
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
. - Remove Type Specifiers [Optional] : Remove the type and language specifiers of the objects during import. The default value is
False
.
Advanced
- Filename Predicate [Optional] : Predicate in which the filename will be stored.
- Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is
False
. - Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is
1
. - Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is
1
.
9.4.22. Import RDF (multiple classes)¶
Imports RDF files into multiple classes and creates internal relationships.
Description¶
This component can import RDF-data, store it in multiple classes, and automatically create relationships present in the RDF-data.
Execution of the component triggers one or more single-class Import RDF (DEPRECATED) sub-components, followed by zero or more Create Relationship (by identifier) components.
For the most part, the options of this component correspond to the options of component Import RDF (DEPRECATED). However, you can add more than one target class. Some options, like Selected RDF Type or Properties can be different from class to class, while other options, like Files and Data Source apply to all classes.
A typical use case is to import resources corresponding to different RDF types into different classes.
The component can also create relationships between resources. If the RDF-data contains a triple
s1 p1 o1 .
then the relationship s1 -> o1 (forward and backward link) can be created if s1 and o1 both have a type:
s1 a t1 .
o1 a t2 .
and if both types (t1
and t2
) are imported (in the same or in different classes).
Such a relationship could be called internal.
In order for this relation to be created, Predicate Type (in this case of predicate p1
) has to be set to ‘URI’,
and one or more Target Class es have to be filled in (in this case the only target class would be the class into which type t2
is imported).
External relationships (i.e. relationships between resources not imported by this component) still need to be created using component Create Relationship (by identifier).
Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.
The option Files is used to specify which files to import.
This is a list of file-paths. It is possible to use wildcards (like ‘*’).
As a security measure, all file-paths must resolve to locations inside the Source Data directory
(by default /disqover/data/source_data/
, but configurable by an administrator).
The use of absolute paths is discouraged and will cause a warning.
Relative paths are relative to the Source Data directory.
All imported resources are assigned an RDF type during import. When the Class Type is not set, the resources are assigned their original RDF types. Since Selected RDF Type allows to import multiple RDF types, not all imported resources will have the same RDF type. When the Class Type is set, all resources are assigned the new RDF type. This option be specified for each class.
You can specify for each predicate you configure in the importer whether you want the predicate to be used as a (preferred) label. Note that you can have multiple labels for a resource, but only one preferred label.
For more advanced use cases you can use the designated Add Label component.
At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.
Example¶
As an example, the following turtle file is imported:
movies/pixar_movies.ttl
@prefix pred: <http://pixar/predicates/> .
@prefix type: <http://pixar/types/> .
@prefix movie: <http://pixar/movies/> .
@prefix char: <http://pixar/movies/character/> .
movie:toystory a type:movie ;
pred:title "Toy Story" ;
pred:year "1995" .
movie:walle a type:movie ;
pred:title "Wall E" ;
pred:year "2008" .
movie:findingnemo a type:movie ;
pred:title "Finding Nemo" ;
pred:year "2003" .
char:nemo a type:character ;
pred:name "Nemo" ;
pred:debut movie:findingnemo .
char:buzzlightyear a type:character ;
pred:name "Buzz Lightyear" ;
pred:debut movie:toystory .
char:mrincredible a type:character ;
pred:name "Mr. Incredible" ;
pred:debut movie:incredibles .
It contains information about Pixar movies and about characters appearing in those movies.
Each character has a predicate pred:debut
that links it to the movie it first appeared in.
Option | Value |
---|---|
Data Source | http://movies.org |
Files | movies/pixar_movies.ttl |
File Type | turtle |
Class Options | See below |
In order to import the movies to class movies
and characters to class characters
,
add two Class Options`:
Option | Value |
---|---|
Class | movies |
Selected RDF Type | http://pixar/types/movie |
Properties | See below |
with properties
File Property | Predicate | Predicate Type |
---|---|---|
<http://pixar/predicates/title> |
mov:title |
Literal |
<http://pixar/predicates/year> |
mov:year |
Literal |
and
Option | Value |
---|---|
Class | characters |
Selected RDF Type | http://pixar/types/character |
Properties | See below |
with properties
File Property | Predicate | Predicate Type | Target Classes` |
---|---|---|---|
<http://pixar/predicates/name> |
char:name |
Literal |
|
<http://pixar/predicates/debut> |
char:debut |
URI |
movies |
These settings will trigger two single-class Import RDF (DEPRECATED) sub-components (one creating class movies
and one creating class characters
),
followed by a Create Relationship (by identifier) component, establishing the debut
-link between the two classes.
Class movies
after executing the component:
disq:uri.uri | mov:title.lit | mov:year.lit | char:debut.rev |
---|---|---|---|
[M:toystory] | [“Toy Story”] | [“1995”] | [HURI(C:buzzlightyear)] |
[M:walle] | [“Wall E”] | [“2008”] | [] |
[M:findingnemo] | [“Finding Nemo”] | [“2003”] | [HURI(C:nemo)] |
Class characters
after executing the component:
disq:uri.uri | char:name.lit | char:debut.fwd | char:debut.err | char:debut.uri |
---|---|---|---|---|
[C:nemo] | [“Nemo”] | [HURI(M:findingnemo)] | [] | [M:findingnemo] |
[C:buzzlightyear] | [“Buzz Lightyear”] | [HURI(M:toystory)] | [] | [M:toystory] |
[C:mrincredible] | [“Mr. Incredible”] | [] | [M:incredibles] | [M:incredibles] |
URIs have been abbreviated:
- ‘M:’ stands for
http://pixar/movies/
- ‘C:’ stands for
http://pixar/characters/
Next to the predicates shown in the tables above, each class will also
have a predicate disq:uri.huri
with hashed URIs, and a predicate rdf:type.lit
with value http://pixar/types/movie
for each resource in movies
, and
with value http://pixar/types/character
for each resource in characters
.
Options¶
- Data Source : URI for the data source.
- Files : Full paths to files to import. May contain wildcards (e.g. ‘*’).
- File Type : The RDF file format of the imported file(s). The possible values are:
ntriples
,rdfxml
,rdfxml-xmp
,rdfxml-abbrev
,rss-1.0
,atom
,dot
,json-triples
,json
,html
,nquads
,turtle
. - Class Options [Optional] : The options for each generated class A list of sub-options with the following structure:
- Class : The class containing the stored data.
- RDF Type Predicate : The RDF predicate used to select type. The default value is
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
. - Selected RDF Type : The type or types of the instances in the source RDF data. The default value is
[]
. - Properties : The properties to import from the input file(s). A list of sub-options with the following structure:
- File Property : Field in the file.
- Predicate : The destination predicate for the imported field.
- Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - Predicate Type : The type of object predicate. The possible values are:
Literal
,URI
. - Target Classes : Target Classes for this predicate (only applicable if it is an object URI). The default value is
[]
. - Use as label : Use the predicate as a label for the resources in the class The default value is
False
. - Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is
False
. - New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is
1
. - Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is
1
.
- Resource Type : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
Advanced
- Remove Type Specifiers [Optional] : Remove the type and language specifiers of the objects during import. The default value is
False
. - Filename Predicate [Optional] : Predicate in which the filename will be stored.
- Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is
False
. - Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is
1
. - Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is
1
.
9.4.23. Import Remote Data Set¶
Import a Remote Data Set from a Remote Data Subscription
Description¶
This component imports a remote data set from a remote data subscription (see Subscriber how to subscribe to a remote data publisher). It can be used to import the desired data set from the remote data publisher using the identifier of the remote data publisher and the name of the remote data set. All data is imported into a single class, specified in the option Class.
Options¶
- Remote Data Publisher Identifier [Optional] : The identifier of the Remote Data Publisher to import from.
- Remote Data Set [Optional] : The name of the Remote Data Set to import.
- Class [Optional] : The class containing the data (by default equal to Remote Data Set.
- Predicates [Optional] : The predicates to import from the Remote Data Set. If empty, all predicates will be imported. A list of sub-options with the following structure:
- Data Set Predicate : Name of the predicate in the Remote Data Set.
- Predicate : Name of the resulting imported predicate. If empty, will be equal to Data Set Predicate
- Auxiliary : Make the imported predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
.
- Predicate Prefix [Optional] : String that is used as a prefix in the name of all imported predicates that are not explicitly renamed and that have no predicate prefix yet. E.g. if Predicate Prefix is ‘abc’ then predicate ‘p’ will be named ‘abc:p’ unless it is explicitly renamed, but predicate ‘xyz:q’ will not be changed.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
.
9.4.24. Import Separator Block¶
Imports files chunked by a separator line.
Description¶
This component imports files which contains data separated by a Separation Line. The data between two consecutive separation lines (a data block) is added to a single resource in the specified class. All data is imported to a single predicate determined by the Chunk Predicate option. Each line in the data block is added as a single value to the Chunk Predicate.
All data is imported into a single class, specified in the option Class Name.
Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.
The option Files is used to specify which files to import.
This is a list of file-paths. It is possible to use wildcards (like ‘*’).
As a security measure, all file-paths must resolve to locations inside the Source Data directory
(by default /disqover/data/source_data/
, but configurable by an administrator).
The use of absolute paths is discouraged and will cause a warning.
Relative paths are relative to the Source Data directory.
All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.
You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.
At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.
Example¶
As an example, the following file is imported:
movies/disney_movies.sdf
title: "Snow White"
release_date: "December 21, 1937"
runtime: 83
-----
title: "Pinocchio"
release_date: "February 7, 1940"
runtime: 88
-----
title: "Dumbo"
release_date: "October 23, 1941"
runtime: 64
-----
title: "Bambi"
release_date: "August 13, 1942"
runtime: 70
-----
title: "Cinderella"
release_date: "February 15, 1950"
runtime: 74
This file contains five chunks of data which contains information about a movie.
The chunks are separated by -----
.
To import all data blocks to the batch_data
predicate in the DisneyMovies
use the following:
Option | Value |
---|---|
Class Name | DisneyMovies |
Data Source | http://disney.org/movies/ |
Files | movies/disney_movies.sdf |
Chunk Predicate | batch_data |
Separation Line | ----- |
Class DisneyMovies
after the Separator Block import component:
batch_data.lit |
---|
[‘title: “Snow White”’, ‘release_date: “December 21, 1937”’, ‘runtime: 83’] |
[‘title: “Pinocchio”’, ‘release_date: “February 7, 1940”’, ‘runtime: 88’] |
[‘title: “Dumbo”’, ‘release_date: “October 23, 1941”’, ‘runtime: 64’] |
[‘title: “Bambi”’, ‘release_date: “August 13, 1942”’, ‘runtime: 70’] |
[‘title: “Cinderella”’, ‘release_date: “February 15, 1950”’, ‘runtime: 74’] |
All data is imported to a single predicate: batch_data.lit
.
Remark that all values are literals, therefor this component is usually followed by a Transform Literals component.
Options¶
- Class : The name of the new class which will contain the imported data.
- Data Source : The URI of the data source.
- Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
- Encoding [Optional] : Explicitly overrule the character encoding of the imported files.
- Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
- Separation Line : The line format that separates chunks.
- Chunk Predicate [Optional] : Predicate to write chunks to. The default value is
main
. - Trim leading/trailing whitespace [Optional] : Remove leading and trailing whitespace from each line. The default value is
True
.
Advanced
- Filename Predicate [Optional] : Predicate in which the filename will be stored.
- Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is
False
. - Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is
1
. - Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is
1
.
9.4.25. Import XML¶
Imports XML files.
Description¶
This component imports resources from XML files.
The list of resources to be imported from the file is determined by setting the Instance X Path.
All data is imported into a single class, specified in the option Class Name.
Each import component must be assigned to an existing Data Source, i.e. the component must be proceeded by a Define Datasource component defining that data source. All resources imported by the component are assigned that Data Source.
The option Files is used to specify which files to import.
This is a list of file-paths. It is possible to use wildcards (like ‘*’).
As a security measure, all file-paths must resolve to locations inside the Source Data directory
(by default /disqover/data/source_data/
, but configurable by an administrator).
The use of absolute paths is discouraged and will cause a warning.
Relative paths are relative to the Source Data directory.
All the imported resources can be assigned a Resource Type during import by filling in the option Resource Type, but this is not required. See Configure Canonical Type for more information about Resource Types.
You can specify for each predicate you configure in the importer component whether you want the predicate to be used as a (preferred) URI or (preferred) label. You can mark multiple predicates to be used as URI and/or label, but there can be only one predicate used as preferred URI and one predicate used as preferred label. Note that this behaviour differs from the the behaviour of the designated Add URI and Add Label components, and you can use those for more advanced use cases.
At the top of the component view, next to the button Save Changes, there is a button which will open a file scanner. The file scanner can inspect files in order to assist the user with filling in the options, e.g. it can suggest predicates. Keep in mind that this is only a best guess: it may be needed to fine-tune the options manually after scanning.
Example¶
As an example, the following file is imported:
movies/disney_movies.xml
<?xml version="1.0" encoding="UTF-8"?>
<root>
<row>
<title>Snow White</title>
<release_date>December 21, 1937</release_date>
<runtime>83</runtime>
</row>
<row>
<title>Pinocchio</title>
<release_date>February 7, 1940</release_date>
<runtime>88</runtime>
</row>
<row>
<title>Dumbo</title>
<release_date>October 23, 1941</release_date>
<runtime>64</runtime>
</row>
<row>
<title>Bambi</title>
<release_date>August 13, 1942</release_date>
<runtime>70</runtime>
</row>
<row>
<title>Cinderella</title>
<release_date>February 15, 1950</release_date>
<runtime>74</runtime>
</row>
</root>
The file contains a list of entries representing a movie, and each movie entry has three properties to import.
Option | Value |
---|---|
Class Name | DisneyMovies |
Data Source | http://disney.org/movies/ |
Files | movies/disney_movies.xml |
Instance X Path | ./root/row |
X Paths | See below |
Each instance entry in the XML file has three nodes with text. To import each of these nodes to a predicate use the following for the Columns option:
File X Path | XPath Type | Predicate |
---|---|---|
./title |
Text |
movie:title |
./release_date |
Text |
movie:release_date |
./runtime |
Text |
movie:runtime |
Class DisneyMovies
after the XML import component:
movie:title.lit | movie:release_date.lit | movie:runtime.lit |
---|---|---|
[“Snow White”] | [“December 21, 1937”] | [“83”] |
[“Pinocchio”] | [“February 7, 1940”] | [“88”] |
[“Dumbo”] | [“October 23, 1941”] | [“64”] |
[“Bambi”] | [“August 13, 1942”] | [“70”] |
[“Cinderella”] | [“February 15, 1950”] | [“74”] |
Each movie resource has three predicates which all have a single value.
Note that if your XML contains explicit prefixes:
<?xml version="1.0" encoding="UTF-8"?>
<root dis="http://disney.org/movies">
<dis:row>
<dis:title>Snow White</title>
</dis:row>
</dis:root>
Then these prefixes should be included in the imported XPaths, as well as being defined in the Prefix Dictionary:
File X Path:
./dis:row/dis:title
Prefix Dictionary:
dis: 'http://disney.org/movies'
If your XML lacks explicit prefixes, but does contain a defined prefix dictionary at the beginning of the file:
<?xml version="1.0" encoding="UTF-8"?>
<root dis="http://disney.org/movies">
<row>
<title>Snow White</title>
</row>
</root>
Then these prefixes should be defined in the Prefix Dictionary and manually added to your XPaths:
File X Path:
./anything:row/anything:title
Prefix Dictionary:
anything: 'http://disney.org/movies'
Options¶
- Class : The name of the new class which will contain the imported data.
- Data Source : The URI of the data source.
- Files : The relative path(s) of the files to be imported. The path is expressed from the source_data repository of DISQOVER and may contain wildcards (e.g. ‘*’).
- X Paths : The list of properties to import from the input file(s). This requires the XML path and the name of the predicate in which the information will be stored. The XML paths are relative to the instance path. A list of sub-options with the following structure:
- File X Path : Field in the file.
- Predicate : The destination predicate for the imported field.
- Auxiliary : Make the generated predicate auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - XPath Type : Type of xpath. The possible values are:
Text
,Node
,Attribute
. - Use as URI : Use the predicate as an URI for the resources in the class The default value is
False
. - Use as preferred URI : If turned on, the created URI will be set as the preferred URI. The default value is
False
. - Prefix : The prefix to be used for the URI.
- New Preferred URI selection strategy : Determines which value to pick as preferred URI when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The URI could not be added because the literal predicate is empty.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The URI could not be added because the literal predicate is empty.”. The default value is
1
. - Minimal count for warning “The predicate ‘…’ seems to contain irregular URIs” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The predicate ‘…’ seems to contain irregular URIs”. It finds suspicion URIs if the encoding is not empty and the prefix is empty. It checks every 100 records and stops checking if 10 warnings are found. The default value is
1
. - Use as label : Use the predicate as a label for the resources in the class The default value is
False
. - Use as preferred label : If turned on, the created label will be set as the preferred label. The default value is
False
. - New Preferred Label selection strategy : Determines which value to pick as preferred label when the literal predicate has multiple values. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Minimal count for warning “The label could not be added to one or more resources because the literal predicate is empty for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the literal predicate is empty for those resources.”. The default value is
1
. - Minimal count for warning “The label could not be added to one or more resources because the predicate contains an empty string for those resources.” : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The label could not be added to one or more resources because the predicate contains an empty string for those resources.”. The default value is
1
.
- Resource Type [Optional] : The Resource Type which will be assigned to all imported resources. In a later Configure Canonical Type component, these Resource Types can be used to define a Canonical Type.
- Resource XPath : The XPath to the resources.
- Prefix Dictionary [Optional] : A list of prefixes used in the XML file.
Advanced
- Filename Predicate [Optional] : Predicate in which the filename will be stored.
- Data Source per Instance [Optional] : All instances imported will get an extra data source which can differ per instance and should be provided at some point in the disq:data_source predicate (not necessarily during import) The default value is
False
. - Empty Value Encoding [Optional] : A list of values that represent an empty or missing value and is imported as no value.
- Allow Huge Tree [Optional] : Allow importing files with very deep trees. Make sure the files are from trusted sources, setting this option will disable protection against certain malicious XML content. The default value is
False
.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the import. Some data might not have been imported.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the import. Some data might not have been imported.”. The default value is
1
. - Minimal count for warning “All values of an imported predicate are empty.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “All values of an imported predicate are empty.”. The default value is
1
. - Minimal count for warning “One or more files do not contain any resources as defined by the Resource XPath.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “One or more files do not contain any resources as defined by the Resource XPath.”. The default value is
1
. - Minimal count for warning “The XPath expression could not be evaluated.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The XPath expression could not be evaluated.”. The default value is
1
.
9.4.26. Infer by Relationship (DEPRECATED)¶
Infers a new predicate using an existing relationship.
See Infer by Relationship (multiple predicates)
Options¶
Target Class
- Target Class : The class containing the relationship.
- Relationship Predicate (existing) : Link predicate (either fwd or rev) linking Target Class to Relationship Class.
- Resulting Predicate : The resulting predicate. Its type must match that of Aimed Predicate (existing).
- Target Class Filter [Optional] : Boolean expression returning true for resources which should be included.
Relationship Class
- Relationship Class : Relationship class.
- Aimed Predicate (existing) : Predicate to infer, either a literal or a link (fwd or rev) to Aimed Relationship Class (optional).
- Relationship Class Filter [Optional] : Boolean expression returning true for resources which should be included.
Aimed Relationship Class (optional)
- Aimed Relationship Class (optional) [Optional] : Aimed relationship class (only relevant if Aimed Predicate (existing) is a link).
Quality Control
- Fraction of resources with at least one match [Optional] : The fraction of filtered resources in Target Class which matched at least one resource in Relationship Class. (higher is better)
Advanced
- Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.27. Infer by Relationship (multiple predicates)¶
Infers new predicates using an existing relationship.
Description¶
This component copies information from resources in class (Relationship Class) to resources in another class (Target Class) if there is a relationship between the resources.
- From the statements
- Mickey Mouse is a mouse
- and
- all mice are mortal
- we can infer that
- Mickey Mouse is mortal.
In Data Ingestion Engine “is a mouse” would correspond to a relationship from class DisneyCharacters
to class Animals
,
while “is mortal” would be a literal predicate (“Yes” or “No”) in class Animals
.
The conclusion (inference) that Mickey Mouse is mortal is implicitly present in this data.
This component can be used to make the inference “explicit”, i.e. to transfer the literal predicate “is mortal”
to class DisneyCharacters
.
The component works like this:
- It looks at the relationship from Target Class to Relationship Class, specified in Relationship Predicate.
- If a relationship exists between resource T in the Target Class and resource M in the Relationship Class, then all values of Aimed Predicate of M are added to to the (new) predicate Resulting Predicate of T.
Note that we wrote “relation from … to …”, whereas relationships are bidirectional in general: forward and reverse
hashed URIs are stored in different predicates in both involved classes.
This component uses one of these relationship predicates, either forward or reverse.
In case of ambiguity, the user can specify the direction in Relationship Predicate
by adding extension .fwd
or .rev
.
Note that it is also possible to infer information within a single class, if a class has a relationship with itself, e.g. a parent-child relationship.
Two cases can be distinguished:
- infer a literal
- infer a relationship
The second case is a bit more complicated. Let’s look at inferring literals first.
1. Inferring a literal¶
This is the simplest case.
Relationship Predicate should be a literal predicate (.lit
),
and Aimed Relationship Class should be left empty.
URIs have been abbreviated:
- ‘D:’ stands for “http://disney.org/”
- ‘A:’ stands for “http://animals.org/”
and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.
Target Class DisneyCharacters
before applying the component:
disq:uri.uri | animal.fwd |
---|---|
[D:mickey_mouse] | [HURI(A:mouse)] |
[D:hades] | [HURI(A:god)] |
Relationship Class Animals
before applying the component:
disq:uri.uri | is_mortal.lit |
---|---|
[A:mouse] | [“Yes”] |
[A:god] | [“No”] |
Target Class DisneyCharacters
after applying the component:
disq:uri.uri | animal.fwd | mortal.lit |
---|---|---|
[D:mickey_mouse] | [HURI(A:mouse)] | [“Yes”] |
[D:hades] | [HURI(A:god)] | [“No”] |
Relationship Class Animals
is unchanged.
2. Inferring a relationship¶
This case is bit more complicated as it involves 3 classes, and because we want to preserve bidirectionality of the relationships.
Option Aimed Predicate is now a relationship predicate,
either forward (.fwd
) or reverse (.rev
),
which defines a relationship from Relationship Class to Aimed Relationship Class.
Option Aimed Relationship Class should be filled in.
The mechanism is the same as for above, except that
- instead of copying values from a literal predicate, we now copy (HURI) values from the relationship predicate Aimed Predicate to Resulting Predicate in Target Class, thus creating a relationship from Target Class to Aimed Relationship Class
- back links, from Aimed Relationship Class to Target Class, are stored in a predicate Resulting Predicate, but with opposite extension: if the copied Aimed Predicate is forward, then the back link is reverse, and vice versa.
URIs have been abbreviated:
- ‘D:’ stands for “http://disney.org/”
- ‘A:’ stands for “http://animals.org/”
- ‘G:’ stands for “http://genus.org/”
and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.
Target Class DisneyCharacters
before applying the component:
disq:uri.uri | animal.fwd |
---|---|
[D:mickey_mouse] | [HURI(A:mouse)] |
[D:hades] | [HURI(A:god)] |
Relationship Class Animals
before applying the component:
disq:uri.uri | genus.fwd |
---|---|
[A:mouse] | [HURI(G:Mus)] |
[A:god] | [HURI(G:Deus)] |
Aimed Relationship Class Genus
before applying the component:
disq:uri.uri |
---|
[G:Homo ] |
[G:Deus] |
[G:Mus] |
Target Class DisneyCharacters
after applying the component:
disq:uri.uri | animal.fwd | gen.fwd |
---|---|---|
[D:mickey_mouse] | [HURI(A:mouse)] | [HURI(G:Mus)] |
[D:hades] | [HURI(A:god)] | [HURI(G:Deus)] |
Relationship Class Animals
is unchanged.
Aimed Relationship Class Genus
after applying the component:
disq:uri.uri | gen.rev |
---|---|
[G:Homo ] | [] |
[G:Deus] | [HURI(D:hades)] |
[G:Mus] | [HURI(D:mickey_mouse)] |
Notes:
- For the Resulting Predicate we used the name gen, to avoid confusion with predicate genus, but in principle we could have used genus
- Because the Aimed Predicate is a forward link (genus.fwd),
the Resulting Predicate is also forward (
.fwd
is added automatically togen
), and the back link is reverse (gen.rev)
Options¶
Target Class
- Target Class : The class containing the relationship.
- Relationship Predicate (existing) : Link predicate (either fwd or rev) linking Target Class to Relationship Class.
- Target Class Filter [Optional] : Boolean expression returning true for resources which should be included.
Relationship Class
- Relationship Class : The class which contains the (literal or link) predicate to be inferred to the target class. It should also be linked to the target class.
- Relationship Class Filter [Optional] : Boolean expression returning true for resources which should be included.
Predicates
- Predicates [Optional] : The predicates to infer. A list of sub-options with the following structure:
- Aimed Predicate (existing) : Predicate to infer, either a literal or a link (fwd or rev) to Aimed Relationship Class.
- Resulting Predicate : The resulting predicate. Its type must match that of Aimed Predicate.
- Aimed Relationship Class (optional) : Aimed relationship class (only relevant if Aimed Predicate is a link).
Advanced
- Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Quality Control
- Fraction of resources with at least one match [Optional] : The fraction of filtered resources in Target Class which matched at least one resource in Relationship Class. (higher is better)
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.28. Map Classes (by label)¶
Add a URI to the target class in case a literal predicate matches another literal predicate in the matching class.
Description¶
This component copies the Preferred URI (disq:uri.puri) from resources in class Matching Class to resources in class Target Class if they contain matching literals.
This component is normally only used in preparation of a Merge Classes component, which can take over the rest of the predicate.
More in detail:
- Values from predicate Matching Literal in Matching Class are compared to values from predicate Matching Predicate in Target Class.
- If, for some resource M in Matching Class and some resource T in Target Class, one or more literal values of M are equal to one or more literal values from T, then the Preferred URI (disq:uri.puri) of resource M is added to the subject URIs (disq:uri.uri) of resource T.
By default Matching Literal and Matching Predicate are equal to disq:label.lit (or disq:label for short), because comparing by label is a common operation.
The way literals are compared to each other can be tailored via two options:
- Case Sensitive determines whether uppercase/lowercase differences matter. For example, if
False
, “dog” is considered to be equal to “Dog”. - Remove Dashes and Spaces determines whether differences due to dashes (
'-'
) or spaces (' '
) matter. For example, ifTrue
“my-dog” is considered to be equal to “my dog” and to “mydog”.
In former versions of the Data Ingestion Engine only the one-to-one case was supported. Now also the many-many case is supported: if a resource in Matching Class matches with multiple resources in Target Class, then, by default, its Preferred URI is added to each of those target resources. Conversely, if multiple resources in Matching Class match with a resource in Target Class, then, by default, all their Preferred URIs are added to the target resource. This default behavior can be changed back to the old behavior by setting option Mapping Strategy equal to (Deprecated) Single to Pick One. It there are no many-to-many matches, this option has no effect.
Example¶
Option | Value |
---|---|
Target Class | MyHeroes |
Matching Predicate | name |
Matching Class | DisneyCharacters |
Matching Literal | DEFAULT (disq:label ) |
Case Sensitive | True |
Remove Dashes and Spaces | True |
URIs have been abbreviated:
- ‘D:’ stands for “http://disney.org/”
Target Class MyHeroes
before applying the component:
name.lit |
---|
[“mickey-mouse”] |
[“john-snow”] |
Matching Class DisneyCharacters
before applying the component:
disq:uri.puri | disq:uri.phuri | disq:label.lit |
---|---|---|
[D:mickey_mouse] | [HURI(D:mickey_mouse)] | [“Mickey Mouse”] |
[D:pluto] | [HURI(D:pluto)] | [“Pluto”] |
Target Class MyHeroes
after applying the component:
name.lit | disq:uri.uri | disq:uri.huri |
---|---|---|
[“mickey-mouse”] | [D:mickey_mouse] | [HURI(D:mickey_mouse)] |
[“john-snow”] | [] |
The Matching Class is unchanged.
Options¶
Target Class
- Target Class : The class to be matched. If the predicate from the Target Class matches the matching predicate of the Matching Class, the URI of that resource in the Matching Class will be added to the Target Class.
- Matching Predicate [Optional] : The predicate in the Target Class to be used for matching. The default value is the label (disq:label.lit). The default value is
disq:label.lit
. - Target Class Filter [Optional] : A boolean expression returning True for resources in the Target Class to which the action should be applied.
Matching Class
- Matching Class : The class to be screened during the mapping.
- Matching Predicate [Optional] : The predicate in the Matching Class to be used. The default value is the label (disq:label.lit). The default value is
disq:label.lit
. - Matching Class Filter [Optional] : A boolean expression returning True for resources in the Matching Class to which the action should be applied.
Matching
- Mapping Strategy [Optional] : The mapping strategy which will be used in in case of multiple hits. The possible values are:
Multiple to Multiple
,(Deprecated) Single to Pick One
. - Case Sensitive [Optional] : Match literals in a case sensitive way. The default value is
False
. - Remove Dashes and Spaces [Optional] : Remove dashes and spaces when matching literals. The default value is
False
.
Advanced
- Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Quality Control
- Fraction of matched destination resources [Optional] : The fraction of resources in Target Class that matched successfully. (higher is better)
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “Matching literal in the Matching Class appears in multiple resources with different Preferred URI (further warnings about the same matching literal are suppressed).” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Matching literal in the Matching Class appears in multiple resources with different Preferred URI (further warnings about the same matching literal are suppressed).”. The default value is
1
. - Minimal count for warning “Resource from Target Class matches literals associated with multiple Preferred URIs in Matching Class. (this literal will be excluded from matching)” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Resource from Target Class matches literals associated with multiple Preferred URIs in Matching Class. (this literal will be excluded from matching)”. The default value is
1
. - Minimal count for warning “Multiple resources from Target Class match literals associated with the same URI in Matching Class. (only one of these resources will receive new URIs)” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Multiple resources from Target Class match literals associated with the same URI in Matching Class. (only one of these resources will receive new URIs)”. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.29. Merge Classes¶
Moves resources from source class to target class.
Description¶
This component transfers resources from Source Class to Target Class.
If a source resource has a URI that is already present in a resource of Target Class (a match), then the predicate values of the source resource are added to the target resource predicates. Otherwise the source resource is copied to a new target resource. Predicates in Source Class that don’t exist in Target Class are added.
Transferred resources are deactivated. If no Source filter was used, this component will leave Source Class completely “empty”.
Note that, as always in the Data Ingestion Engine, the order of values in a predicate is arbitrary, so merged values are not simply added after the original values.
If a resource is matched, it is possible that its Preferred URI in Source Class is different from its Preferred URI in Target Class, and likewise for Preferred Label. By default, the preferredness of Target Class takes precedence. This can be changed via option Source Class takes precedence.
For a discussion about preferredness, see Add URI.
If multiple source resource match with one target resource, then the predicate values of all these source resources are added to the target resource. This is a many-to-one merge. An alternative for merging many-to-one, but within a single class, is component Merge within Class.
Conversely, if a source resource matches with multiple target resources, then the the predicate values of the source resource is added to one of the target resources (an arbitrary choice). This is a one-to-many merge, which is best to avoid.
For a discussion about duplicate URIs, see below.
Because this component matches resources based on their URIs, the involved classes are often prepared to yield similar URIs. This can be achieved with components such as Add URI or match_subject_literal_reference_tag.
Example¶
Option | Value |
---|---|
Source Class | PixarCharacters |
Target Class | DisneyCharacters |
URIs have been abbreviated:
- ‘D:’ stands for “http://disney.org/”
- ‘P:’ stands for “http://pixar.org/”
and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.
Source Class PixarCharacters
before applying the component:
disq:uri.uri | disq:label.lit | movie.lit |
---|---|---|
[P:remy] | [“Remy”] | [“Ratatouille”] |
[P:dory, D:dory] | [“Dory”] | [“Finding Nemo”] |
[P:wall_e, D:wall_e] | [“Wall-E”] | [“Wall-E”] |
[P:nemo, D:nemo] | [] | [“Finding Nemo”] |
[P:nemo, D:nemo] | [“nemo”] | [“Finding Nemo”] |
Target Class DisneyCharacters
before applying the component:
disq:uri.uri | disq:label.lit | year.lit |
---|---|---|
[D:mickey_mouse] | [“Mickey Mouse”] | [“1928”] |
[D:wall_e] | [“Wall E”] | [“2008”] |
[D:nemo] | [] | [“2003”] |
Target Class DisneyCharacters
after applying the component:
disq:uri.uri | disq:label.lit | year.lit | movie.lit |
---|---|---|---|
[D:mickey_mouse] | [“Mickey Mouse”] | [“1928”] | [] |
[D:wall_e, P:wall_e] | [“Wall E”, “Wall-E”] | [“2008”] | [“Wall-E”] |
[D:nemo, P:nemo] | [“nemo”] | [“2003”] | [“Finding Nemo”] |
[P:remy] | [“Remy”] | [] | [“Ratatouille”] |
[P:dory, D:dory] | [“Dory”] | [] | [“Finding Nemo”] |
Observe:
- A new predicate movie.lit is created in Target Class
DisneyCharacters
. - Mickey Mouse didn’t have a counterpart in
PixarCharacters
, so its existing predicates are untouched. The new predicate movie.lit is left empty. - Remy only has a URI starting with
P:
, so it cannot match a URI inDisneyCharacters
. As a result a new resource is created inDisneyCharacters
and all predicates are copied. Predicate year.lit only exists inDisneyCharacters
, so it is left empty. - Dory does have a URI starting with
D:
(a potential match), but it doesn’t match with any URI, so everything is copied to a new resource, like Remy. Note that the preferred URI is carried over. - Wall-E in
PixarCharacters
has a URI which matches with a resource inDisneyCharacters
, so its values for disq:uri and disq:label.lit are added there, and its value for movie.lit is copied. Note that the Preferred URI and Label are not overridden! - Both Nemo’s in
PixarCharacters
are merged into the same existing resource. This shows that this component can help in dealing with “Duplicate URI” problems. Note that the original label was empty, so the copied label is taken to be the Preferred Label.
Source Class PixarCharacters
after applying the component (all resources are deactivated):
active | disq:uri.uri | disq:label.lit | movie.lit |
---|---|---|---|
NO | [P:remy] | [“Remy”] | [“Ratatouille”] |
NO | [P:dory] D:dory] | [“Dory”] | [“Finding Nemo”] |
NO | [P:wall_e, D:wall_e] | [“Wall-E”] | [“Wall-E”] |
NO | [P:nemo, D:nemo] | [] | [“Finding Nemo”] |
NO | [P:nemo, D:nemo] | [“nemo”] | [“Finding Nemo”] |
Dealing with Duplicate URIs¶
If a resource in the Source Class has multiple URIs which match with different resources in the Target Class, then the merge operation can introduce duplicate URIs.
For example: if Target Class contains a resource with URI P:young_nemo
and another resource with URI P:old_nemo
(so it considers Young Nemo and Old Nemo to be different resources),
but Source Class contains a single resource with URIs [P:nemo
, P:young_nemo
, P:old_nemo
]
(so it considers Young and Old Nemo to be equivalent),
then after merging both target resources will get all three URIs.
This situation (introduced duplicate URIs) is detected during execution of this component, and can be remedied, see further.
Note that the situation is more complicated if the Source Class or the Target Class already have duplicate URIs (different resources with the same URI) before merging:
- Source Class resources with duplicate URIs which match with a Target Class resource will all be merged with that resource.
- Source Class resources with duplicate URIs which don’t match with any Target Class resource are moved one by one to the Target Class, without merging.
- If the Target Class has duplicate URIs, then these are, in principle, not merged by this component.
Duplicate URIs create problems: the corresponding resources will not be published in Publish in DISQOVER (unless Publish malformed instances is switched on). This is typically remedied by adding an extra Merge Within Class (operating on the Target Class) after merging Source Class to Target Class. To achieve this, you can either explicitly add a Merge within Class component in the pipeline after this component, or use option Add Merge within Class in this component to automatically do this extra merge step.
The option Add Merge within Class takes the following values:
Merge if needed: | |
---|---|
if duplicate URIs are introduced, automatically execute an extra Merge Within Class after execution of this component. | |
Warning: | if duplicate URIs are introduced, issue a warning. |
Suppress warning: | |
if duplicate URIs are introduced, don’t issue a warning. | |
Always Merge: | automatically execute an extra Merge Within Class after execution of this component, regardless of whether duplicate URIs were introduced. |
Take into account that this extra merge step takes extra time to execute, and that it can introduce arbitrary value order in predicates (see Merge within Class)
Note that if neither the Source Class nor the Target Class contain duplicate URIs before merging, then option value Merge if needed guarantees that the Target Class will not have duplicate URIs after execution. Option value Always Merge has the same effect, but may do unnecessary work. If the Source Class or the Target Class already contain duplicate URIs, only the option value Always Merge guarantees that the Target Class will not have duplicate URIs after execution.
Performance considerations¶
This component copies resources from the Source Class to the Target Class. Therefore the performance is best when the Source Class is smaller (i.e. contains fewer resources) than the Target Class. If applicable, you can use the option Source Class takes precedence to have a resource retain the Preferred URI and Preferred Label of the Source Class if they are present.
Options¶
- Preferred URI selection strategy [Optional] : Which value to pick as preferred URI when merging resources with a different preferred URI. Falls back to taking the alphabetically first value. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Preferred label selection strategy [Optional] : Which value to pick as preferred label when merging resources with a different preferred label. Falls back to taking the alphabetically first value. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
.
Source Class
- Class : The class from which the resources will be moved.
- Filter [Optional] : Boolean expression returning true for resources which should be included.
Target Class
- Class : The class to which the resources will be moved.
- Source Class Takes Precedence [Optional] : Preferred URI and preferred label from Source Class take precedence. The default value is
False
. - Add Merge within Class [Optional] : Desired behavior for automatically applying Merge within Class on Target Class after this component is finished The possible values are:
Merge if needed
,Warning
,Suppress warning
,Always Merge
.
Advanced
- Keep Auxiliary Predicates [Optional] : Include auxiliary predicates. The default value is
False
.
Quality Control
- Fraction of Merged Resources [Optional] : The fraction of filtered source resources which were merged with a destination resource. (higher is better)
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “Potential broader visibility” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential broader visibility”. The default value is
1
. - Minimal count for warning “Sub-optimal class choice” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Sub-optimal class choice”. The default value is
1
. - Minimal count for warning “Preferred URI from federated public endpoint overwritten” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Preferred URI from federated public endpoint overwritten”. The default value is
1
. - Minimal count for warning “Merge within class: Nothing was merged.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Merge within class: Nothing was merged.”. The default value is
1
. - Minimal count for warning “Merge within class: Resources with different preferred URI were merged.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Merge within class: Resources with different preferred URI were merged.”. The default value is
1
. - Minimal count for warning “Merge within class: Potential broader visibility” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Merge within class: Potential broader visibility”. The default value is
1
.
9.4.30. Merge within Class¶
Merges resources in a class that have at least one identical URI.
Description¶
This component merges resources in a class (Target Class) which have subject URIs in common.
Resources are considered to be equivalent if they have subject URIs (disq:uri) in common. This is even true “indirectly”, e.g. if
- resource 1 has subject URIs A and B
- resource 2 has subject URIs B and C
- resource 3 has subject URIs C and D
then all three are considered equivalent (even though 1 and 3 don’t have URIs in common.)
The component scans through all records and constructs “equivalence groups”.
- If a resource doesn’t have equivalent resources, then it is left alone.
- If a resource does have equivalent resources, all resources in its group are removed (or rather: deactivated), a new resource is created, and all predicates are “merged”. To be more precise: for each predicate in the class, the values of each resource in the group are concatenated (in arbitrary order).
Important note: if resources are merged, a Preferred URI is chosen arbitrarily from the Preferred URIs of those resources. Likewise for the Preferred Label, if present.
During execution of the pipeline it is not uncommon to have resources with identical URIs, within one class or over different classes. At the end of the pipeline, however, when data is published to DISQOVER, this should no longer occur. This component can be used to tackle this “Duplicate (H)URI” problem.
However, keep in mind that this component doesn’t offer a way to select Preferred URIs or Labels. Two alternatives to consider:
- Component Merge Classes offers better control for Preferred URIs and Labels.
- Sometimes it is possible to avoid the creation of duplicate URIs in the first place.
Example¶
URIs have been abbreviated:
- ‘D:’ stands for “http://disney.org/”
and, for simplicity, we have left out the hashed subject URI predicate disq:uri.huri.
Preferred URIs are notated in boldface.
Target Class before applying the component:
disq:uri.uri | animal_name.lit |
---|---|
[D:mickey_mouse] | [“Mickey Mouse”] |
[D:pluto] | [“Pluto”] |
[D:mickey, D:mickey_mouse] | [“Mickey”] |
Target Class after applying the component:
active | disq:uri.uri | animal_name.lit |
---|---|---|
NO | [D:mickey_mouse] | [“Mickey Mouse”] |
YES | [D:pluto] | [“Pluto”] |
NO | [D:mickey, D:mickey_mouse] | [“Mickey”] |
YES | [D:mickey_mouse, D:mickey] | [“Mickey Mouse”, “Mickey”] |
Observe:
- The first and third resource are equivalent because they share a URI (D:mickey_mouse).
They are deactivated and a new resource is created with merged predicate values.
In the merged resource
D:mickey_mouse
is indicated as the preferredLabel, but it could as well have beenD:mickey
, there is no way to tell beforehand. - The second resource is left untouched because it has a unique URI.
Options¶
- Class : The class containing the resources to be merged.
- Filter [Optional] : Boolean expression returning true for resources which should be included.
- Preferred URI selection strategy [Optional] : Which value to pick as preferred URI when merging resources with a different preferred URI. Falls back to taking the alphabetically first value. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
. - Preferred label selection strategy [Optional] : Which value to pick as preferred label when merging resources with a different preferred label. Falls back to taking the alphabetically first value. Falls back to taking the alphabetically first value. The possible values are:
Alphabetically first
,Most common
,Shortest
.
Quality Control
- Fraction of resources that were merged [Optional] : The fraction of resources that were merged. (higher is better)
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “Nothing was merged.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Nothing was merged.”. The default value is
1
. - Minimal count for warning “Resources with different preferred URI were merged.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Resources with different preferred URI were merged.”. The default value is
1
. - Minimal count for warning “Potential broader visibility” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential broader visibility”. The default value is
1
.
9.4.31. No Operation¶
Does nothing (can be used to improve the organization of the pipeline).
Description¶
This component does nothing. It only exists to facilitate organization of the pipeline.
For example, it can be used as the start or end of a set of components that are closely related, or as a placeholder for a component that needs to be implemented later.
Options¶
- Class [Optional] : The class to which this component logically belongs.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
.
9.4.32. Publish in DISQOVER¶
Publish the integrated data in DISQOVER.
Description¶
The main purpose of this component is to publish all configured data of the pipeline in DISQOVER. After a successful execution the data will be visible in the DISQOVER front-end.
This component transforms resources in the Data Ingestion Engine to instances in DISQOVER. An instance in DISQOVER is an instance of a canonical type, such as a Gene, a Publication or an Organism.
The configuration components dictate which resources should be published and how predicates map to instance properties and/or facets. For more information see
The Configuration components should be the last components in the pipeline, linking to a final Publish in DISQOVER component.
Note
After doing some checks on all the classes, this component erases all data in DISQOVER, before publishing the new data (except when differential indexing is enabled, see further).
Considering this, the Data Ingestion Engine handles execution of this component with some care. When executing a pipeline (fully or partially) Publish in DISQOVER will not execute if any component produced an error (warnings are allowed).
This behavior can be overridden in debugging mode.
Malformed instances¶
In principle, in order to be published in DISQOVER a resource needs to have at least two things:
- a Preferred URI (disq:uri.puri).
- a Preferred Label (disq:pref_label.plabel).
All other predicates are optional and will not prevent the resource from being published as an instance.
It is also important that, in principle, each URI should be unique over all classes before publishing. Duplicates within a class can be merged via Merge within Class, duplicates over multiple classes can be merged via Merge Classes.
In practice, due to errors in the pipeline or in the data, it can happen that resources get no Preferred URI or no Preferred Label, or that different resources get the same subject URI. The component ‘Publish in DISQOVER’ will issue warnings for these cases (see below).
However, it can be hard to find out which resources are causing the problems. To make it easier to find those resources, the option Publish malformed instances is available as a debugging tool. If turned on, the following problematic instances will be published in DISQOVER, but with a special indication in their label:
- If a resource has no Preferred Label the instance label will be its Preferred URI followed by the indication ‘[MISSING LABEL]’.
- If a URI occurs as subject URI in multiple resources (duplicate URIs), then the corresponding instances will get the indication ‘[DUPLICATE URI]’ in their label.
- In this last case, if that URI is the Preferred URI of at least one instance, those instances will get the indication ‘[DUPLICATE PREFERRED URI]’.
Note that resources without Preferred URI will not be published, even if Publish malformed instances is turned on.
This option should be turned off when publishing for RDS.
Export to RDF¶
The component Publish in DISQOVER can also export instance data to an RDF-file in turtle format (see https://www.w3.org/TR/turtle/).
In order to do so, switch on Export data to file and execute the component.
The export will produce multiple files of the form ‘ccc_nnn’ where ‘ccc’ is the name of the Canonical Type and ‘nnn’ is some unique number.
By default the exported files are written in directory /disqover/data/exports
.
You can export to a sub-directory of this directory by filling in the option Export path.
Instance data (if present) is exported in triples with Preferred URI used as subject URI, according to the following scheme:
Instance data | Predicate |
---|---|
Other URIs | owl:sameAs = http://www.w3.org/2002/07/owl#sameAs |
Preferred label | skos:prefLabel = http://www.w3.org/2004/02/skos/core#prefLabel |
Other labels | rdfs:label = http://www.w3.org/2004/02/skos/core#label |
Resource Type | rdf:type = http://www.w3.org/1999/02/22-rdf-syntax-ns#type |
Properties (*) | Corresponding property URI |
Facets (*) | Corresponding facet URI |
(*) by default properties are included in the export, but facets are not. This behavior can be overridden per individual property/facet via the option Export to file in the configuration components.
If you only want to export, but not actually publish to DISQOVER you can switch off Publish data to Disqover.
Differential Indexing¶
By default, if this component is executed, all instances which are already present in DISQOVER are removed before publishing.
If the option Differential indexing is switched on, only the new and changed instances are published and only obsolete instances are removed. This can make the execution faster.
Information¶
- Resource belongs to multiple types
- When a resource belongs to 2 or more canonical types, one example of each combination is provided.
Warnings¶
Class does not contain uri and type predicates
Preferred URI or Resource Type (rdf:type) predicates do not exist in a specific class.
Class contains resources without a type
Resource Type (rdf:type) predicates exist in a class, but the class is not configured in a canonical type component and there are resources which have no values for that predicates. These resources will not be published.
Class does not contain any instances with configured types
The predicate rdf:type.lit is defined in a class, which is not configured for publishing in the configuration components, but none of its values are referenced in the configuration components.
Tree contains loops
Facets can be specified to be hierarchical. In that case the underlying data should also be hierarchical. Nodes are allowed to have multiple parents, but they should never result in a circular reference. If this is the case this warning is shown. One example per class of such a loop is reported.
Facets expecting single valued predicates received multi valued predicates
Facets that have a data type (integer, float or date) and can be used for histograms should be single-valued. If the corresponding predicate for a resource has more than one value, that value is not published. Note that the value is allowed to be empty.
This problem can probably be solved via a Transform Literals component.
Facets expecting predicates of a certain data type receiver the wrong data type
Facets that have a data type (integer, float or date) expect a fixed data format. For example a date should be in ISO format ‘YYYY-MM-DD’. A number should only contain digits and optionally a decimal point or a sign. If a predicate value of resource does not meet this criteria, it will not be published.
Preferred label and/or preferred URI changed in an incremental run
This is a warning which can occur during incremental runs: label and preferred URIs are immutable during such a run. When resources get a new label or URI during such a run anyway, this warning is issued. A full rerun of the data will adapt the labels and URIs.
Resource belongs to local-only and mixed canonical types simultaneously
When using federation, a resource belongs to a local canonical type and to a mixed canonical type. This will result in inaccurate counts in DISQOVER. The pipeline will have to be adapted to fix the problem.
Ambiguous storage field. Storage fields are derived from the postfix of the URI
When the data is published to DISQOVER, facets and properties are transferred to fields of the underlying storage. The names of the fields are derived from the configuration URI: they correspond to the _postfix_ of the URI. However this may result in conflicts in storage: if the prefix of 2 URIs is different but the postfix is identical, information may be stored together in one field which does not belong together. The solution is choose another URI for property or facet. Note that this typically occurs when one resource has multiple canonical types.
Initial error uploading a batch file to solr
An error occurred during the publishing of the data but the software could mitigate it by retrying the failed upload. When this error occurred it might indicate that some paarmeters on the server are not configured optimally.
Unknown data source
Some component in the pipeline refers to a data source, but the pipeline doesn’t contain a Define Datasource component for that data source.
Other non-fatal errors occurred
These warnings do not belong to specific categories.
Errors¶
Class contains resources without preferred label / URI
Preferred URI or Preferred Label predicates exist in the class, but there are resources which have no values for one of those predicates although they are in class configured for publishing or have an rdf:type configured for indexing. These resources will not be published (Note: if Publish malformed instances is turned on, this is reduced to a warning).
Class contains non-unique URIs/Classes contain non-unique preferred URIs
Some URIs are present in multiple resources. This may be in different classes. The resources with these duplicate URIs will not be published, unless Publish malformed instances. In the latter case this will be reduced to a warning.
Per class a few duplicate URIs are reported. The action
Find URI
can be used to further investigate or Publish malformed instances.
Instances invisible by combination of user roles
User roles can be defined on different levels. Sometimes this can result in an instance being invisible to all users. For example a canonical type may be visible only for members of group A, while an instance in this canonical type is only visible for members of group B.
Error uploading a batch file to solr
An error occurred during the publishing of the data. This might result in incomplete data. When this error occurs, please check the sanity of the solr service.
Error calling solr API
An error occurred during the publishing of the data. This might result in incomplete data. When this error occurs, please contact ONTOFORCE support.
Options¶
Advanced
- Automatically drop predicates [Optional] : Determine automatically if certain predicates are no longer necessary at some point in the pipeline. The default value is
False
.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
.
9.4.33. Remove Resources¶
Removes all resources from a class that match a given filter.
Description¶
This component “removes” all resources from a class (Target Class) which are included by the filter (filter).
Actually the resources are not really removed, but deactivated (internally marked by a boolean).
Note that it is preferable, in many cases, to apply a filter to a component instead of removing records before running that component.
Advanced¶
In general the pipeline tries to make all relationships bidirectional, i.e. using a forward relationship predicate in one alignment and a reverse relationship predicate in the other alignment. This component is the only one which can break directionality because it can remove one of the predicates, while leaving the other one. Therefore it is advisable to apply this component early in the pipeline, before relationships are created.
Removing a lot of records can compromise the performance of further processing; see Create Compact Class.
Options¶
- Class : The class to remove resources from.
- Filter [Optional] : Boolean expression returning true for resources which should be removed.
Advanced
- Update Statistics [Optional] : Update predicate statistics. This may be switched off for performance reasons The default value is
True
.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
.
9.4.34. Synchronize Federated Class¶
Synchronizes the given class with the federated data.
Description¶
This component tries to synchronize local resources to a Federation Endpoint, either by URI matching or by label matching. The Federation Endpoint should have been set up in the DISQOVER backend server, defining the federated server’s connection settings.
URI matching:
- The system will try to find an instance with given local URI on the Federation Endpoint. It will either return a new preferred URI and preferred label, or nothing.
- The local URIs are in a predicate specified in Match URI Predicate, by default disq:matchuri.lit.
Label matching:
- The system will try to find an instance with given label and DISQOVER Canonical Type URI on the Federation Endpoint. If several candidates are found, the Use Most Referenced Match determines what happens (see the examples below).
- The local labels are in a predicate specified in Match Label Predicate, by default disq:matchlabel.lit.
- The remote Canonical Type URIs are in a predicate specified in Match Type Predicate, by default disq:matchtype.lit.
Note that you cannot do URI matching and Label matching at the same time.
Depending on the particular case, the component writes to different predicates (e.g. disq:label.lit), see the examples below. It always writes to predicate disq:partition.lit:
- “public” if there was a match on the Federation Endpoint.
- “local” if there was no match.
After publishing to DISQOVER, each instance containing “public” in disq:partition.lit becomes a candidate for “live” federation. This means that when an instance is shown in DISQOVER, all its properties and facets will be retrieved from the Federation Endpoint and combined with its local properties and facets.
After execution, the number of unmatched entries, as well as the number of resources with match errors are shown in the Counters section.
Because the process of synchronizing can take some time, the results are cached locally. This means that the first execution can be slow, but subsequent executions will be fast, provided the data hasn’t changed much. The number of cache hits and cache misses can be inspected in the Counters section.
Important
Federation relies on three assumptions:
- Federated instances belong to the same canonical type on the customer DISQOVER installation as on www.disqover.com. This means synchronized instances cannot also belong to a local canonical type on top of a federated one.
- Mixed instances (customer instances that are enriched with ONTOFORCE data) have the same preferred URI on the customer DISQOVER installation as on www.disqover.com. This means the preferred URI should not be changed after the “Synchronize Federated Class”-component.
- The URIs of data sources are different on the customer DISQOVER installation than on disqover.com
A URI matching example¶
Option | Value |
---|---|
Target Class | DisneyCharacters |
Use Most Referenced Match | False |
URIs have been abbreviated:
- ‘D’ stands for ‘http://disney.org/’
- ‘LD’ stands for ‘http://local.disney.org/’
The situation before applying the component:
disq:uri.uri | disq:matchuri.lit |
---|---|
[LD:mickey_mouse] | [D:mickey_mouse] |
[LD:donald_dog] | [D:donald_dog] |
Suppose D:mickey_mouse exists on the Federation Endpoint, with label “Mickey Mouse”, and that D:donald_dog does not exist. The situation after executing the component is then:
disq:uri.uri | disq:matchuri | disq:partition.lit | disq:label.lit |
---|---|---|---|
[LD:mickey_mouse, D:mickey_mouse] | [D:mickey_mouse] | [“public”] | Mickey Mouse |
[LD:donald_dog] | [D:donald_dog] | [“local”] |
D:mickey_mouse will be the new preferred URI for the first resource.
A label matching example using most referenced matches¶
Let’s assume the Federation Endpoint has the following data:
URI | Label | Dataset hits |
---|---|---|
D:mickey_mouse | Mickey Mouse | Cartoons 3 hits, Movies 4 hits |
D:ronald_duck | Ronald Duck | Cartoons 1 hit, Movies 2 hits |
D:ronald_d_duck | Ronald Duck | Cartoons 0 hits, Movies 2 hits |
We’ll configure the component as follows:
Option | Value |
---|---|
Target Class | DisneyCharacters |
Use Most Referenced Match | True |
Target Class before applying the component:
disq:uri.uri | disq:matchlabel | disq:matchtype |
---|---|---|
[LD:mickey_mouse] | [“Mickey Mouse”] | [D:disney_character] |
[LD:donald_dog] | [“Donald Dog”] | [D:disney_character] |
[LD:ronald_duck] | [“Ronald Duck”] | [D:disney_character] |
Target Class after applying the component (omitting partition, matchlabel and matchtype):
disq:uri.uri | disq:partition.lit | disq:uri.err | disq:label.lit |
---|---|---|---|
[LD:mickey_mouse, D:mickey_mouse] | [“public”] | Mickey Mouse | |
[LD:donald_dog ] | [“local”] | ||
[LD:ronald_duck] | [“public”] | [D:ronald_duck, D:ronald_d_duck] |
Results:
- D:mickey_mouse will be the new preferred URI.
- Donald Dog was not found and does not get values.
- Ronald Duck has multiple matches with the same maximum number of hits in a dataset (Movies 2 hits), so both matching URIs are stored in the disq:uri.err predicate.
A label matching example not using most referenced matches¶
Let’s assume the Federation Endpoint has the following data:
URI | Label | Dataset hits |
---|---|---|
D:mickey_mouse | Mickey Mouse | Cartoons 3 hits, Movies 4 hits |
D:ronald_duck | Ronald Duck | Cartoons 1 hit, Movies 2 hits |
D:ronald_d_duck | Ronald Duck | Cartoons 0 hits, Movies 1 hits |
We’ll configure the component as follows:
Option | Value |
---|---|
Target Class | DisneyCharacters |
Use Most Referenced Match | True |
The situation before applying the component:
disq:uri.uri | disq:matchlabel | disq:matchtype |
---|---|---|
[LD:mickey_mouse] | [“Mickey Mouse”] | [D:disney_character] |
[LD:donald_dog] | [“Donald Dog”] | [D:disney_character] |
[LD:ronald_duck] | [“Ronald Duck”] | [D:disney_character] |
The situation after applying the component (omitting partition, matchlabel and matchtype):
disq:uri.uri | disq:uri.err | disq:label.lit |
---|---|---|
[LD:mickey_mouse, D:mickey_mouse] | Mickey Mouse | |
[LD:donald_dog] | ||
[LD:ronald_duck] | [D:ronald_duck, D:ronald_d_duck] |
Results:
- D:mickey_mouse will be the new preferred URI.
- Donald Dog was not found and does not get values.
- Ronald Duck has multiple matches, so both matching URIs are stored in the disq:uri.err predicate. The number of hits in datasets is not used as a determining factor.
Matching multi-valued URIs or labels¶
When the local data contains multiple values in any of the predicate(s) selected by Match URI Predicate, Match Type Predicate or Match Label Predicate, the synchronizaztion action will:
- Use the first found value in the predicate.
- Add a warning about the instances which has multiple values in the selected predicate.
Options¶
- Class : The class to synchronize.
- Match URI Predicate [Optional] : Predicate used to match URIs. The default value is
disq:matchuri.lit
. - Match Labels Predicate [Optional] : Predicate used to match labels. The default value is
['disq:matchlabel.lit']
. - Match Type Predicate [Optional] : Predicate used to match type. The default value is
disq:matchtype.lit
. - Filter [Optional] : Boolean expression returning true for resources which should be included.
Advanced
- Use Most Referenced Match [Optional] : Whether to use the label match with the highest number of associated datasets. The default value is
True
.
Quality Control
- Fraction of failed synchronizations [Optional] : The fraction of resources for which the synchronization failed. (lower is better)
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the synchronization of a resource.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the synchronization of a resource.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
.
9.4.35. Transform Literals¶
Within each resource, applies an expression to transform literal predicates into output literal predicates
Description¶
This components adds literal predicate values in each resource of a class (Target Class). These values are derived from other literal predicates via expressions.
Predicates that are written (output predicates) are notated with prefix @
,
predicates which are read (input predicates) are notated with a prefix $
.
If an input predicate is known to be single-valued (for each resource), the prefix $$
can be used to retrieve that value.
For more details about the expression language, see Expression Functions.
The component operates resource by resource, so it is not possible to mix data from different resources. A similar but more powerful component which offers this functionality is Aggregate and Transform (resources).
It is not possible to define subject URIs (disq:uri) with this component, use Add URI instead. It is not possible to define subject Labels (disq:label) with this component, use Add Label instead.
Example¶
In this example we split a comma-separated literal into a multivalued literal, and convert dates to an ISO-format.
Transformation expression:
set @country = StrSplit($$country_list, ",");
set @iso_date = Map($raw_date, _el, IsoDater(_el, "%m%d%Y"));
Target Class before applying the component:
country_list.lit | raw_date.lit |
---|---|
[“BE,FR,UK”] | [“04101992”] |
[“US”] | [“05101992”, “06101992”] |
Target Class after applying the component:
country_list.lit | raw_date.lit | country.lit | iso_date.lit |
---|---|---|---|
[“BE,FR,UK”] | [“04101992”] |
|
[“1992-04-10”] |
[“US”] | [“05101992”, “06101992”] | [“US”] | [“1992-05-10”, “1992-06-10”] |
Options¶
- Class : Class containing the predicates to transform.
- Transformation : The set of expressions to be executed for each resource of the class (unless filtered out). Newly created predicates must be lists.
- Filter [Optional] : Boolean expression returning true for resources which should be included.
Advanced
- Make Auxiliary [Optional] : Make all generated predicates auxiliary. An auxiliary predicate is ignored when publishing data in DISQOVER or when moving data to a different class. The default value is
False
. - Data Sources [Optional] : List of URIs of the data sources assigned to this component.
Quality Control
- Fraction of failed transformations [Optional] : The fraction of resources for which the transformation failed. (lower is better)
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
. - Minimal count for warning “An error occurred during the transformation of a resource.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “An error occurred during the transformation of a resource.”. The default value is
1
. - Minimal count for warning “Expression cannot be precompiled.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Expression cannot be precompiled.”. The default value is
1
. - Minimal count for warning “Could not apply detailed provenance, all input data sources have been combined.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Could not apply detailed provenance, all input data sources have been combined.”. The default value is
1
. - Minimal count for warning “The component uses a predicate with provenance of more than one data source.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The component uses a predicate with provenance of more than one data source.”. The default value is
1
. - Minimal count for warning “The output of the component has broader visibility to user groups than some input predicates.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “The output of the component has broader visibility to user groups than some input predicates.”. The default value is
1
. - Minimal count for warning “Potential Preferred Label Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred Label Protection Leak”. The default value is
1
. - Minimal count for warning “Potential Preferred URI Protection Leak” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Potential Preferred URI Protection Leak”. The default value is
1
.
9.4.36. Verify Data¶
Verifies data based on a ratio between two filter counts. A warning level and an error level can be set for this fraction.
Description¶
This component compares the number of resources in Class specified by two filters in Class and generates a warning or an error if a threshold is exceeded.
It doesn’t change any data.
The component calculates a quality measure equal to condition count divided by the scope count, where
- condition count is the number of (active) resources that pass the Scope Filter and the Condition Filter
- scope count is the number of (active) resources that pass the Scope Filter.
The component generates a warning or an error if the quality measure exceeds thresholds specified via Warning Threshold and Error Threshold as described in Quality Control.
By default, a warning or error is generated if the quality measure is strictly greater than the threshold. That behavior can be reversed via the option High is Bad.
If Scope Filter is empty (default value is True
) then the scope count is equal to the total number
of (active) resources in the class.
Leaving Condition Filter doesn’t make much sense.
Example¶
Option | Value |
---|---|
Condition Filter | ListEmpty($disq:label.lit) |
Scope Filter | empty |
Warning Threshold | 0.01 |
Error Threshold | 0.1 |
Lower is better | True |
This component looks at the percentage of active resources for which disq:label.lit is empty. It generates an error if that percentage is greater than 0.1 and a warning if it’s greater than 0.01.
Options¶
- Class : The class containing the data to be verified.
- Condition Filter : A boolean expression that will be evaluated for all resources in the class. The numerator of the fraction is the number of resources that return True.
- Scope Filter [Optional] : A boolean expression that will be evaluated all resources in the class. The denominator of the fraction is the number of resources that return True.
Quality Control
- Warning Threshold : The warning threshold of the fraction.
- Error Threshold : The error threshold of the fraction.
- Lower is better [Optional] : If true (default), values above the threshold will generate warnings or errors. If false, values below the threshold will generate warnings or errors. The default value is
True
.
Warnings
- Minimal count for warning “Encountered empty classes.” [Optional] : This component will report a warning if the number of warning messages is greater than or equal to this number, for warning messages of the type “Encountered empty classes.”. The default value is
1
.