9.2. Concepts

The search environment of DISQOVER is targeted to end users of various backgrounds who explore the data to gain new insights. Conversely, the data scientists in charge of the data possess extensive knowledge about the data and often have experience with different data models.

Targeted towards the latter audience, the Data Ingestion Engine adopts terminology stemming from different data models, such as RDF and Semantic Web. This section introduces the concepts and terminology used in the Data Ingestion Engine.

Figure 9.6 and Figure 9.7 show an example representation of a resource in the Data Ingestion Engine and the corresponding instance in DISQOVER.

../_images/resources_in_die.png

Figure 9.6 Tabular representation of resources in the Data Ingestion Engine.

../_images/instance_in_dq.png

Figure 9.7 An instance popout in DISQOVER.

9.2.1. Data representation

Data can be represented in many different ways and formats, depending on the used data model.

In a tabular model, data is represented as a table with rows and columns. Examples are CSV files and relational databases.

name, animal, first_appearance
Mickey Mouse, mouse, 1928
Pluto, dog, 1930

In an object-based model, data is represented as an object with properties stored as key-value pairs. Examples are JSON and XML.

[
    { "name": "Mickey Mouse",
      "animal": "mouse",
      "first_appearance": 1928
    },
    { "name": "Pluto",
      "animal": "dog",
      "first_appearance": 1930
    }
]

In a triple-based model, data is represented as a collection of triples of the form “subject predicate object”. RDF is an example.

(Mickey Mouse)   (first appeared)  1928
(Mickey Mouse)   (is animal)       mouse

Each data model uses its own terminology. For example, the terms “record”, “object” and “resource” refer to the same concept.

The terminology used in the Data Ingestion Engine is inspired by the RDF and Semantic Web data models. It adopts among others the words “resource” and “predicate”. However, the internal representation follows the tabular model, as this improves performance and scalability.

9.2.2. Terminology

In this section, the different terms and concepts used in the Data Ingestion Engine are explained using two example data sets, represented as a table.

The first data set contains the names, species and first appearance dates of Disney characters:

name animal first_appearance
Mickey Mouse mouse 1928
Pluto dog 1930
Goofy dog 1932
Donald Duck duck 1934

The second data set contains the English and Latin names of animals:

name species
dog Canis lupus
mouse Mus musculus
duck Anas platyrhynchos

Resource

A resource is a piece of information. In the tabular representation of Disney characters and animals, each row is a resource. When publishing data from the Data Ingestion Engine into DISQOVER, resources become instances.

Predicate

A predicate is a property of a resource. In the tabular representation of Disney characters and animals, the columns are the predicates.

There are some important ground rules for predicates in the Data Ingestion Engine:

  1. Predicates are multi-valued lists of strings, which can have no value, [], a single value, [“mouse”], or more than one value, [“mouse”, “rat”].
  2. The order of values in a predicate is arbitrary.
  3. Components in the pipeline can add values to predicates, but they cannot remove or change values.

The Data Ingestion Engine uses several predicate data types: literal, path, URI and HURI. Literals can be any kind of information, while URIs and HURIs are identifiers (discussed below). As there is no separate data type for literal predicates containing numbers or dates, they are also represented as strings, eg. [“1928”]. Paths contain lists of HURIs, representing a path from the root node to a given node in a tree. They are used to create tree facets.

You can use the extension of a predicate to see its data type:

  • Literal: .lit, .err
  • URI: .uri
  • HURI: .huri, .fwd, .rev
  • Path: .path

In the example of Disney characters and animals, all predicates are literals. Their names in the Data Ingestion Engine could be for example name.lit, animal.lit and first_appearance.lit.

Class

A class is a set of resources with the same predicates. For example, the class containing Disney characters can be called disney_characters, and the class containing English and Latin names of species can be called animals. Within a class, there should be no duplicate resources and the order of resources is irrelevant.

When configuring the canonical types that are shown in DISQOVER, you can choose to attach one or more classes to a canonical type. In DISQOVER, resources from the class are then shown as instances belonging to that canonical type.

URI

Similar to DISQOVER, the Data Ingestion Engine uses Uniform Resource Identifiers or URIs to uniquely identify resources. There is a dedicated component, Add URI, that creates URIs for resources based on predicate values. They are stored in a special predicate, called disq.uri.uri. All resources need to have at least one URI. When publishing, there should be no duplicate URIs.

Typically, a URI is created by combining a predicate with a prefix. For the class disney_characters, the prefix “http://disney.org/” is a logical choice. For animals, you can choose “http://animals.org/”. In both cases, the “name” predicate can be used to construct the URI.

For efficiency reasons, the Data Ingestion Engine also uses Hashed URIs or HURIs. A HURI is a 128-bit hash of a URI. This is stored in the predicate disq:uri.huri. For example: the HURI corresponding to URI “http://ns.ontoforce.com/datasets/countries/BE” is 111438477708605542085115604658184524903, or 53d645266d35408f530d46f3416c1467 in hexadecimal notation. HURIs are automatically created and cannot be made by a user.

When adding the URIs and HURIs, the tabular representation of the disney_characters and animals classes looks like this:

disq:uri.uri disq:uri.huri name.lit animal.lit first_appearance.lit
http://disney.org/mickey_mouse 257ed34a4… Mickey Mouse mouse 1928
http://disney.org/pluto dd77abb2b… Pluto dog 1930
http://disney.org/goofy 8dd9dc7cb… Goofy dog 1932
http://disney.org/donald_duck f5d6fa0c2… Donald Duck duck 1934

and

disq:uri.uri disq:uri.huri name.lit species.lit
http://animals.org/dog 2633f7265… dog Canis lupus
http://animals.org/mouse 008e7e077… mouse Mus musculus
http://animals.org/duck ce37da7b6… duck Anas platyrhynchos

Label

Labels are human-readable names for resources, and are added using a dedicated component, Add Label. They are stored in the predicate disq:label.lit. A resource can have multiple labels, but only one preferred label. This preferred label is displayed as the name of the corresponding instance after publishing into DISQOVER. All other labels become synonyms.

In disney_characters, name.lit is a logical candidate to form the label, while name.lit and species.lit can both be used as labels for the class animals.

Relationship

Relationships or links between resources are created by refering to the HURI of a resource. For example, it is possible to create a link between Mickey Mouse in disney_characters and mouse in animals, signifying that Mickey Mouse is a mouse.

This link is then stored as the predicate isanimal.fwd in class disney_characters and as isanimal.rev in class animals (or vice versa). Which class receives the .fwd and which one the .rev predicate, depends on the configuration of the link.

The value of the predicate is the HURI of the linked resource:

disq:uri.uri disq:uri.huri name.lit animal.lit first_appearance.lit isanimal.fwd
http://disney.org/mickey_mouse 257ed34a4… Mickey Mouse mouse 1928 008e7e0776…
http://disney.org/pluto dd77abb2b… Pluto dog 1930 2633f72657…
http://disney.org/goofy 8dd9dc7cb… Goofy dog 1932 2633f72657…
http://disney.org/donald_duck f5d6fa0c2… Donald Duck duck 1934 ce37da7b6f…

and

disq:uri.uri disq:uri.huri name.lit species.lit isanimal.rev
http://animals.org/dog 2633f72657… dog Canis lupus dd77abb2b… 8dd9dc7cb…
http://animals.org/mouse 008e7e0776… mouse Mus musculus 257ed34a4…
http://animals.org/duck ce37da7b6f… duck Anas platyrhynchos f5d6fa0c2…

Relationships can be used to infer information from one class to another. For example, by using the link between the Disney character Mickey Mouse and the animal mouse, you can infer that the Latin name for Mickey’s species is “Mus musculus”.

Resource Type

Resource Types are similar to classes, but they can be used to assign individual resources from the Data Ingestion Engine to a canonical type in DISQOVER.

For example, if you want to show two canonical types in DISQOVER, Movie characters and Animals, it suffices to assign the classes disney_characters and animals to their respective canonical types.

However, if you want to show two different canonical types, Mammals and Birds, there is no longer a direct link between the classes and the canonical types. In that case, you can assign a resource type “mammal” or “bird” to each resource that defines to which canonical type the corresponding instance belongs. In the configuration, you then assign the resource types to the canonical types.

Resource types are stored in the special predicate rdf:type.lit. It is good practice to use a prefix for resource types, for example “http://ns.company.com/ontology/classes/”:

disq:uri.uri disq:uri.huri name.lit animal.lit first_appearance.lit rdf:type.lit
http://disney.org/mickey_mouse 257ed34a4… Mickey Mouse mouse 1928 http://ns.company.com/ontology/classes/mammal
http://disney.org/pluto dd77abb2b… Pluto dog 1930 http://ns.company.com/ontology/classes/mammal
http://disney.org/goofy 8dd9dc7cb… Goofy dog 1932 http://ns.company.com/ontology/classes/mammal
http://disney.org/donald_duck f5d6fa0c2… Donald Duck duck 1934 http://ns.company.com/ontology/classes/bird

and

disq:uri.uri disq:uri.huri name.lit species.lit rdf:type.lit
http://animals.org/dog 2633f72657… dog Canis lupus http://ns.company.com/ontology/classes/mammal
http://animals.org/mouse 008e7e0776… mouse Mus musculus http://ns.company.com/ontology/classes/mammal
http://animals.org/duck ce37da7b6f… duck Anas platyrhynchos http://ns.company.com/ontology/classes/bird

9.2.3. Naming conventions

You can make your pipeline easier to understand and ensure the uniqueness of identifiers by adhering to some naming conventions. Names cannot contain white spaces and the use of numbers and special characters is discouraged.

Canonical Types:
 

When using federation, the federated canonical types need to have the same URIs as those on www.disqover.com. You can find an overview of canonical types and their URIs in the data tab (see section 6.3). If the canonical type is not federated, the recommended URI format is http://ns.[company].com/disqover.ontology/canonical_type/[ct_name].

Classes:

[datasource_name]_[class_name]

The names of classes should reflect their content. When the pipeline contains multiple data sources, it is helpful to include the name of the data source.

Predicates:

[future_canonical_type]:[predicate_name].lit

The names of predicates use the class name as a prefix.

Resource URIs:

URIs need to be unique. To create a unique resource URIs, you can attach the identifier to a prefix. The following recommendations apply for URI prefixes:

  1. Adhere to existing guidelines for identifiers, for example using www.identifiers.org.
  2. Use the same scheme as disqover.com.
  3. Use an existing internal format.
  4. In the absence of internal or external examples, use the prefix http://ns.[company].com/[datasource_name]/[class_name]/.
Data Sources:

http://ns.[company].com/disqover.dataset/[name]

The data sources that are used to track the provenance of information also need a unique identifier:

Data Sources:

The data sources are used to track the provenance of information. They also need a unique identifier: http://ns.[company].com/disqover.dataset/[name]

Resource Types:

The following recommendations apply to Resource Types:

  1. Adhere to existing ontologies, such as foaf, dcat, …
  2. In the absence of internal or external examples, use the format http://ns.[company].com/ontology/classes/[resource_type_name].

9.2.4. Advanced concepts

Provenance

The Data Ingestion Engine tracks the provenance of each predicate value. The provenance of a value is a bit of meta-data, specifying the set of data sources that contributed to this value, either directly (by importing) or indirectly (e.g. by a transformation).

Data sources are defined in Define Datasource components. Each data source has a unique URI, as well as a name, description and so on.

In an import component you need to specify in Data Source the URI of a data source corresponding to a predecessor Define Datasource component. All values imported will get provenance equal to that data source URI.

The Import Remote Data Set is a special case: it imports not only the data but also the data source definitions in the Remote Data Set, so no separate Define Datasource component is needed. Values will automatically have the correct provenance.

Each Data Ingestion Engine component that outputs predicates will add provenance to the written values. How this provenance is determined depends on the type of component.

Transform Literals components treats predicates separately. For example if you have a transformation

set @country = StrSplit($$country_list, ",");
set @iso_date = Map($raw_date, _el, IsoDater(_el, "%m%d%Y"));

the provenance of country.lit will be set equal to that of country_list.lit and the provenance of iso_date.lit will be set equal to that of raw_date.lit.

Provenances can also be combined. For example

set @name = [ $$first_name + $$last_name ];

Suppose first_name.lit has provenance {http://my_data_sources/source1} and last_name.lit has provenance {http://my_data_sources/source2} then name.lit will get provenance equal to the union of those sets: {http://my_data_sources/source1, http://my_data_sources/source2}.

In general provenance of output values is based on the provenance of input values. For example in the case of a Infer by Relationship (multiple predicates) component the provenance of an inferred literal predicate is equal to the combination of the provenance of the Aimed Predicate and the provenance of the Relationship Predicate. Two notable exceptions:

  • Predicates only used in filters are not taken into account.
  • In components which match values (Create Relationship and Map Classes) the deduced provenance will take into account the Source Predicate, but not the Matching Predicate (or the subject URI of the Matching Class in the case of Create Relationship (by identifier)).

In most component types it is possible to override the automatic provenance deduction by filling in one or more data source URIs in the compont option Data Sources. All values produced by that component will receive the given provenance instead of the provenance they would receive by default.

Component Merge Classes retains the provenances of the merged values. E.g. if class A has a predicate P with provenance {http://my_data_sources/source1} and class B has the same predicate P with provenance {http://my_data_sources/source2}, and you merge class B into A, then predicate P in A will become multi-provenance, i.e. its values can have different provenances (either {http://my_data_sources/source1} or {http://my_data_sources/source1}).

Note that the provenance of Preferred URIs and Preferred Labels is not tracked.

Publish in DISQOVER will attach provenance to properties and facets configured in Configure Canonical Type components, based on the provenance of the predicates involved. The provenance of these properties can be seen in DISQOVER. See also section 3.2.2 and section 6.

It is also possible to see provenances in the Data Ingestion Engine via the Fetch Resource Data data inspection tool (see section 9.9.7).

URI Templates

By default, the Data Ingestion Engine automatically generates URIs for certain resources. It does this for datasources, canonical types, facets, properties, typed links, relation types and subinstance types.

The default templates which are used for the generation of the URIs are:

Datasource http://ns.disqover.com/datasource/${data_source_label}
Canonical Type http://ns.disqover.com/ct/${canonical_type_label}
Facet http://ns.disqover.com/facet/${canonical_type_label}/${facet_label}
Property http://ns.disqover.com/property/${canonical_type_label}/${property_label}
Typed Link http://ns.disqover.com/typed_link/${source_canonical_type_label}/${destination_canonical_type_label}
Relation Type http://ns.disqover.com/relation_type/${source_canonical_type_label}/${destination_canonical_type_label}
Subinstance Type http://ns.disqover.com/subinstance/${subinstance_type_label}

If you want to use a different template for generating the URIs, you can either change the default templates which will be used for all pipelines, or you can configure specific URI templates for a specific pipeline.

Alternatively, you can choose to manually overwrite the URI in the options of a pipeline component.

Auxiliary predicates

When a predicate is created in a particular Class, it can be for two reasons. Either this predicate is used in the configuration of one or more Canonical Types for a property and/or facet, or the predicate is used as the basis for data transformations or linkage further in the pipeline. Of course a predicate can also serve both goals at the same time.

If a predicate is should not be published to DISQOVER or RDS, it is advisable to make this predicate auxiliary. The import components, as well as the Transform Literals component has the option to make these special predicates. In the importers, each created predicate can be made auxiliary separately. The Transform Literals component has an option Make Auxiliary that will cause all generated predicates to be auxiliary.

Auxiliary predicates can still be used like regular predicates in subsequent components. However, whenever they are ‘moved’ to a different class they are, by default, ignored. Specifically, this means that if the Source Class of a Merge Classes component contains an auxiliary predicate, the Target Class will not contain that predicate after merging. However, the component has an option Include auxiliary predicates to change this default behaviour. The Extract Class component also ignores the auxiliary predicates when moving resources from the Source Class to the Target class, but this behaviour is not (yet) configurable. The same goes for the Create Compact Class component, which also ignores disabled resources.

The use of auxiliary predicates ultimately increases the performance of the pipeline. Therefore, it is a good practice to make every predicate that is not used directly in a property or facet auxiliary, so that it is only part of the data ingestion process while it is necessary.

Especially in pipelines which are evolving, it can be troublesome to keep track of which predicate are made auxiliary and which are not. Therefore, the Data Ingestion Engine has the option to choose which predicates to make auxiliary itself. More specifically, the Publish In DISQOVER component has the option Automatically drop predicates. If enabled, the Data Ingestion Engine will analyse the pipeline and track which predicates are used for properties and facets and which are not. As a result, the Merge Classes component will ignore the predicates that it knows are unnecessary. Predicates can still be made auxiliary manually, so both methods are complementary.

Note

The automatic tagging of predicates as auxiliary by the Data Ingestion Engine is heavily dependent on the predecessorship between the components in the pipeline (the ‘connectors’). In a Configure Canonical Type component, a Canonical Type can be created from specific Classes, but might also be based on RDF type. For this reason, the Data Ingestion Engine cannot know a priori which Classes will contribute to a Canonical Type, and therefore cannot enforce you to add all components that transform this class as predecessors. The pipeline builder is responsible for creating all necessary connectors. If not, and the auto-auxiliary feature is enabled, predicates may unintentionally be ignored and property/facet values be missing in DISQOVER.

When an auxiliary predicate is ignored by, for example, a Merge Classes component, it does become a part of the Target Class at all. This means that if a new component is introduced downstream in the pipeline that requires that predicate as input, it cannot use these data. When the auxiliary predicate was configured manually, the pipeline builder has to revisit that setting and rerun the pipeline from there. The auto-auxiliary feature of the Data Ingestion Engine comes with its own correction mechanism. Upon each subsequent run (that is not a Full run), the Data Ingestion Engine finds these predicates that have become used in transformations or properties/facets, and traces back at which point upstream in the pipeline they were ignored. Any component from there onwards is automatically scheduled for execution.