9.2. Concepts¶
The search environment of DISQOVER is targeted to end users of various backgrounds who explore the data to gain new insights. Conversely, the data scientists in charge of the data possess extensive knowledge about the data and often have experience with different data models.
Targeted towards the latter audience, the Data Ingestion Engine adopts terminology stemming from different data models, such as RDF and Semantic Web. This section introduces the concepts and terminology used in the Data Ingestion Engine.
Figure 9.6 and Figure 9.7 show an example representation of a resource in the Data Ingestion Engine and the corresponding instance in DISQOVER.
9.2.1. Data representation¶
Data can be represented in many different ways and formats, depending on the used data model.
In a tabular model, data is represented as a table with rows and columns. Examples are CSV files and relational databases.
name, animal, first_appearance
Mickey Mouse, mouse, 1928
Pluto, dog, 1930
In an object-based model, data is represented as an object with properties stored as key-value pairs. Examples are JSON and XML.
[
{ "name": "Mickey Mouse",
"animal": "mouse",
"first_appearance": 1928
},
{ "name": "Pluto",
"animal": "dog",
"first_appearance": 1930
}
]
In a triple-based model, data is represented as a collection of triples of the form “subject predicate object”. RDF is an example.
(Mickey Mouse) (first appeared) 1928
(Mickey Mouse) (is animal) mouse
Each data model uses its own terminology. For example, the terms “record”, “object” and “resource” refer to the same concept.
The terminology used in the Data Ingestion Engine is inspired by the RDF and Semantic Web data models. It adopts among others the words “resource” and “predicate”. However, the internal representation follows the tabular model, as this improves performance and scalability.
9.2.2. Terminology¶
In this section, the different terms and concepts used in the Data Ingestion Engine are explained using two example data sets, represented as a table.
The first data set contains the names, species and first appearance dates of Disney characters:
name | animal | first_appearance |
---|---|---|
Mickey Mouse | mouse | 1928 |
Pluto | dog | 1930 |
Goofy | dog | 1932 |
Donald Duck | duck | 1934 |
The second data set contains the English and Latin names of animals:
name | species |
---|---|
dog | Canis lupus |
mouse | Mus musculus |
duck | Anas platyrhynchos |
Resource¶
A resource is a piece of information. In the tabular representation of Disney characters and animals, each row is a resource. When publishing data from the Data Ingestion Engine into DISQOVER, resources become instances.
Predicate¶
A predicate is a property of a resource. In the tabular representation of Disney characters and animals, the columns are the predicates.
There are some important ground rules for predicates in the Data Ingestion Engine:
- Predicates are multi-valued lists of strings, which can have no value, [], a single value, [“mouse”], or more than one value, [“mouse”, “rat”].
- The order of values in a predicate is arbitrary.
- Components in the pipeline can add values to predicates, but they cannot remove or change values.
The Data Ingestion Engine uses several predicate data types: literal, path, URI and HURI. Literals can be any kind of information, while URIs and HURIs are identifiers (discussed below). As there is no separate data type for literal predicates containing numbers or dates, they are also represented as strings, eg. [“1928”]. Paths contain lists of HURIs, representing a path from the root node to a given node in a tree. They are used to create tree facets.
You can use the extension of a predicate to see its data type:
- Literal: .lit, .err
- URI: .uri
- HURI: .huri, .fwd, .rev
- Path: .path
In the example of Disney characters and animals, all predicates are literals. Their
names in the Data Ingestion Engine could be for example name.lit
, animal.lit
and first_appearance.lit
.
Class¶
A class is a set of resources with the same predicates. For example, the class
containing Disney characters can be called disney_characters
, and the class containing
English and Latin names of species can be called animals
. Within a class, there should be no duplicate
resources and the order of resources is irrelevant.
When configuring the canonical types that are shown in DISQOVER, you can choose to attach one or more classes to a canonical type. In DISQOVER, resources from the class are then shown as instances belonging to that canonical type.
URI¶
Similar to DISQOVER, the Data Ingestion Engine uses Uniform Resource Identifiers or URIs to uniquely identify resources.
There is a dedicated component, Add URI, that creates URIs for resources based
on predicate values. They are stored in a special predicate, called disq.uri.uri
. All resources need to have at
least one URI. When publishing, there should be no duplicate URIs.
Typically, a URI is created by combining a predicate with a prefix. For the class disney_characters
,
the prefix “http://disney.org/” is a logical choice. For animals
, you can choose “http://animals.org/”.
In both cases, the “name” predicate can be used to construct the URI.
For efficiency reasons, the Data Ingestion Engine also uses Hashed URIs or HURIs. A HURI is a 128-bit hash of a URI.
This is stored in the predicate disq:uri.huri
. For example: the HURI corresponding to
URI “http://ns.ontoforce.com/datasets/countries/BE” is 111438477708605542085115604658184524903, or
53d645266d35408f530d46f3416c1467 in hexadecimal notation. HURIs are automatically created and cannot be made by a user.
When adding the URIs and HURIs, the tabular representation of the disney_characters
and
animals
classes looks like this:
disq:uri.uri | disq:uri.huri | name.lit | animal.lit | first_appearance.lit |
---|---|---|---|---|
http://disney.org/mickey_mouse | 257ed34a4… | Mickey Mouse | mouse | 1928 |
http://disney.org/pluto | dd77abb2b… | Pluto | dog | 1930 |
http://disney.org/goofy | 8dd9dc7cb… | Goofy | dog | 1932 |
http://disney.org/donald_duck | f5d6fa0c2… | Donald Duck | duck | 1934 |
and
disq:uri.uri | disq:uri.huri | name.lit | species.lit |
---|---|---|---|
http://animals.org/dog | 2633f7265… | dog | Canis lupus |
http://animals.org/mouse | 008e7e077… | mouse | Mus musculus |
http://animals.org/duck | ce37da7b6… | duck | Anas platyrhynchos |
Label¶
Labels are human-readable names for resources, and are added using a dedicated component,
Add Label. They are stored in the predicate disq:label.lit
.
A resource can have multiple labels, but only one preferred label. This preferred label
is displayed as the name of the corresponding
instance after publishing into DISQOVER. All other labels become synonyms.
In disney_characters
, name.lit
is a logical candidate to form the label, while
name.lit
and species.lit
can both be used as labels for the class animals
.
Relationship¶
Relationships or links between resources are created by refering to the HURI of a resource.
For example, it is possible to create a link between Mickey Mouse in disney_characters
and mouse in animals
, signifying that Mickey Mouse is a mouse.
This link is then stored as the predicate isanimal.fwd
in class disney_characters
and as isanimal.rev
in class animals
(or vice versa). Which class receives the .fwd
and which one the .rev predicate, depends on the configuration of the link.
The value of the predicate is the HURI of the linked resource:
disq:uri.uri | disq:uri.huri | name.lit | animal.lit | first_appearance.lit | isanimal.fwd |
---|---|---|---|---|---|
http://disney.org/mickey_mouse | 257ed34a4… | Mickey Mouse | mouse | 1928 | 008e7e0776… |
http://disney.org/pluto | dd77abb2b… | Pluto | dog | 1930 | 2633f72657… |
http://disney.org/goofy | 8dd9dc7cb… | Goofy | dog | 1932 | 2633f72657… |
http://disney.org/donald_duck | f5d6fa0c2… | Donald Duck | duck | 1934 | ce37da7b6f… |
and
disq:uri.uri | disq:uri.huri | name.lit | species.lit | isanimal.rev |
---|---|---|---|---|
http://animals.org/dog | 2633f72657… | dog | Canis lupus | dd77abb2b… 8dd9dc7cb… |
http://animals.org/mouse | 008e7e0776… | mouse | Mus musculus | 257ed34a4… |
http://animals.org/duck | ce37da7b6f… | duck | Anas platyrhynchos | f5d6fa0c2… |
Relationships can be used to infer information from one class to another. For example, by using the link between the Disney character Mickey Mouse and the animal mouse, you can infer that the Latin name for Mickey’s species is “Mus musculus”.
Resource Type¶
Resource Types are similar to classes, but they can be used to assign individual resources from the Data Ingestion Engine to a canonical type in DISQOVER.
For example, if you want to show two canonical types in DISQOVER, Movie characters
and Animals, it suffices to assign the classes disney_characters
and animals
to
their respective canonical types.
However, if you want to show two different canonical types, Mammals and Birds, there is no longer a direct link between the classes and the canonical types. In that case, you can assign a resource type “mammal” or “bird” to each resource that defines to which canonical type the corresponding instance belongs. In the configuration, you then assign the resource types to the canonical types.
Resource types are stored in the special predicate rdf:type.lit
.
It is good practice to use a prefix for resource types, for example “http://ns.company.com/ontology/classes/”:
disq:uri.uri | disq:uri.huri | name.lit | animal.lit | first_appearance.lit | rdf:type.lit |
---|---|---|---|---|---|
http://disney.org/mickey_mouse | 257ed34a4… | Mickey Mouse | mouse | 1928 | http://ns.company.com/ontology/classes/mammal |
http://disney.org/pluto | dd77abb2b… | Pluto | dog | 1930 | http://ns.company.com/ontology/classes/mammal |
http://disney.org/goofy | 8dd9dc7cb… | Goofy | dog | 1932 | http://ns.company.com/ontology/classes/mammal |
http://disney.org/donald_duck | f5d6fa0c2… | Donald Duck | duck | 1934 | http://ns.company.com/ontology/classes/bird |
and
disq:uri.uri | disq:uri.huri | name.lit | species.lit | rdf:type.lit |
---|---|---|---|---|
http://animals.org/dog | 2633f72657… | dog | Canis lupus | http://ns.company.com/ontology/classes/mammal |
http://animals.org/mouse | 008e7e0776… | mouse | Mus musculus | http://ns.company.com/ontology/classes/mammal |
http://animals.org/duck | ce37da7b6f… | duck | Anas platyrhynchos | http://ns.company.com/ontology/classes/bird |
9.2.3. Naming conventions¶
You can make your pipeline easier to understand and ensure the uniqueness of identifiers by adhering to some naming conventions. Names cannot contain white spaces and the use of numbers and special characters is discouraged.
Canonical Types: | |
---|---|
When using federation, the federated canonical types need to have the same URIs as those on www.disqover.com. You can find an overview of canonical types and their URIs in the data tab (see section 6.3). If the canonical type is not federated, the recommended URI format is |
|
Classes: |
The names of classes should reflect their content. When the pipeline contains multiple data sources, it is helpful to include the name of the data source. |
Predicates: |
The names of predicates use the class name as a prefix. |
Resource URIs: | URIs need to be unique. To create a unique resource URIs, you can attach the identifier to a prefix. The following recommendations apply for URI prefixes:
|
Data Sources: |
The data sources that are used to track the provenance of information also need a unique identifier: |
Data Sources: | The data sources are used to track the provenance of information. They also need a unique identifier: |
Resource Types: | The following recommendations apply to Resource Types:
|
9.2.4. Advanced concepts¶
Provenance¶
The Data Ingestion Engine tracks the provenance of each predicate value. The provenance of a value is a bit of meta-data, specifying the set of data sources that contributed to this value, either directly (by importing) or indirectly (e.g. by a transformation).
Data sources are defined in Define Datasource components. Each data source has a unique URI, as well as a name, description and so on.
In an import component you need to specify in Data Source the URI of a data source corresponding to a predecessor Define Datasource component. All values imported will get provenance equal to that data source URI.
The Import Remote Data Set is a special case: it imports not only the data but also the data source definitions in the Remote Data Set, so no separate Define Datasource component is needed. Values will automatically have the correct provenance.
Each Data Ingestion Engine component that outputs predicates will add provenance to the written values. How this provenance is determined depends on the type of component.
Transform Literals components treats predicates separately. For example if you have a transformation
set @country = StrSplit($$country_list, ",");
set @iso_date = Map($raw_date, _el, IsoDater(_el, "%m%d%Y"));
the provenance of country.lit
will be set equal to that of country_list.lit
and the provenance of iso_date.lit
will be set equal to that of raw_date.lit
.
Provenances can also be combined. For example
set @name = [ $$first_name + $$last_name ];
Suppose first_name.lit
has provenance {http://my_data_sources/source1}
and last_name.lit
has provenance {http://my_data_sources/source2}
then name.lit
will get provenance equal to the union of those sets:
{http://my_data_sources/source1, http://my_data_sources/source2}
.
In general provenance of output values is based on the provenance of input values. For example in the case of a Infer by Relationship (multiple predicates) component the provenance of an inferred literal predicate is equal to the combination of the provenance of the Aimed Predicate and the provenance of the Relationship Predicate. Two notable exceptions:
- Predicates only used in filters are not taken into account.
- In components which match values (Create Relationship and Map Classes) the deduced provenance will take into account the Source Predicate, but not the Matching Predicate (or the subject URI of the Matching Class in the case of Create Relationship (by identifier)).
In most component types it is possible to override the automatic provenance deduction by filling in one or more data source URIs in the compont option Data Sources. All values produced by that component will receive the given provenance instead of the provenance they would receive by default.
Component Merge Classes retains the provenances of the merged values.
E.g. if class A has a predicate P with provenance {http://my_data_sources/source1}
and class B has the same predicate P with provenance {http://my_data_sources/source2}
,
and you merge class B into A, then predicate P in A will become multi-provenance,
i.e. its values can have different provenances
(either {http://my_data_sources/source1}
or {http://my_data_sources/source1}
).
Note that the provenance of Preferred URIs and Preferred Labels is not tracked.
Publish in DISQOVER will attach provenance to properties and facets configured in Configure Canonical Type components, based on the provenance of the predicates involved. The provenance of these properties can be seen in DISQOVER. See also section 3.2.2 and section 6.
It is also possible to see provenances in the Data Ingestion Engine via the Fetch Resource Data data inspection tool (see section 9.9.7).
URI Templates¶
By default, the Data Ingestion Engine automatically generates URIs for certain resources. It does this for datasources, canonical types, facets, properties, typed links, relation types and subinstance types.
The default templates which are used for the generation of the URIs are:
Datasource | http://ns.disqover.com/datasource/${data_source_label} |
Canonical Type | http://ns.disqover.com/ct/${canonical_type_label} |
Facet | http://ns.disqover.com/facet/${canonical_type_label}/${facet_label} |
Property | http://ns.disqover.com/property/${canonical_type_label}/${property_label} |
Typed Link | http://ns.disqover.com/typed_link/${source_canonical_type_label}/${destination_canonical_type_label} |
Relation Type | http://ns.disqover.com/relation_type/${source_canonical_type_label}/${destination_canonical_type_label} |
Subinstance Type | http://ns.disqover.com/subinstance/${subinstance_type_label} |
If you want to use a different template for generating the URIs, you can either change the default templates which will be used for all pipelines, or you can configure specific URI templates for a specific pipeline.
Alternatively, you can choose to manually overwrite the URI in the options of a pipeline component.
Auxiliary predicates¶
When a predicate is created in a particular Class, it can be for two reasons. Either this predicate is used in the configuration of one or more Canonical Types for a property and/or facet, or the predicate is used as the basis for data transformations or linkage further in the pipeline. Of course a predicate can also serve both goals at the same time.
If a predicate is should not be published to DISQOVER or RDS, it is advisable to make this predicate auxiliary. The import components, as well as the Transform Literals component has the option to make these special predicates. In the importers, each created predicate can be made auxiliary separately. The Transform Literals component has an option Make Auxiliary that will cause all generated predicates to be auxiliary.
Auxiliary predicates can still be used like regular predicates in subsequent components. However, whenever they are ‘moved’ to a different class they are, by default, ignored. Specifically, this means that if the Source Class of a Merge Classes component contains an auxiliary predicate, the Target Class will not contain that predicate after merging. However, the component has an option Include auxiliary predicates to change this default behaviour. The Extract Class component also ignores the auxiliary predicates when moving resources from the Source Class to the Target class, but this behaviour is not (yet) configurable. The same goes for the Create Compact Class component, which also ignores disabled resources.
The use of auxiliary predicates ultimately increases the performance of the pipeline. Therefore, it is a good practice to make every predicate that is not used directly in a property or facet auxiliary, so that it is only part of the data ingestion process while it is necessary.
Especially in pipelines which are evolving, it can be troublesome to keep track of which predicate are made auxiliary and which are not. Therefore, the Data Ingestion Engine has the option to choose which predicates to make auxiliary itself. More specifically, the Publish In DISQOVER component has the option Automatically drop predicates. If enabled, the Data Ingestion Engine will analyse the pipeline and track which predicates are used for properties and facets and which are not. As a result, the Merge Classes component will ignore the predicates that it knows are unnecessary. Predicates can still be made auxiliary manually, so both methods are complementary.
Note
The automatic tagging of predicates as auxiliary by the Data Ingestion Engine is heavily dependent on the predecessorship between the components in the pipeline (the ‘connectors’). In a Configure Canonical Type component, a Canonical Type can be created from specific Classes, but might also be based on RDF type. For this reason, the Data Ingestion Engine cannot know a priori which Classes will contribute to a Canonical Type, and therefore cannot enforce you to add all components that transform this class as predecessors. The pipeline builder is responsible for creating all necessary connectors. If not, and the auto-auxiliary feature is enabled, predicates may unintentionally be ignored and property/facet values be missing in DISQOVER.
When an auxiliary predicate is ignored by, for example, a Merge Classes component, it does become a part of the Target Class at all. This means that if a new component is introduced downstream in the pipeline that requires that predicate as input, it cannot use these data. When the auxiliary predicate was configured manually, the pipeline builder has to revisit that setting and rerun the pipeline from there. The auto-auxiliary feature of the Data Ingestion Engine comes with its own correction mechanism. Upon each subsequent run (that is not a Full run), the Data Ingestion Engine finds these predicates that have become used in transformations or properties/facets, and traces back at which point upstream in the pipeline they were ignored. Any component from there onwards is automatically scheduled for execution.