9.1. Introduction

DISQOVER provides extensive capabilities of searching, filtering and following links through data from various sources. The platform thanks its scalability and near-real-time performance for full-text queries to the indexed document store from which it serves data.

In order to build this indexed document store, DISQOVER is equipped with a powerful Data Ingestion Engine that allows you to ingest and link data. The process begins with importing data from private, public or third-party sources, which you can standardize and transform depending on your needs. Using your thorough understanding of the field, you can then create a network of linked data and generate new insights by inferring information, resulting in a tightly-knit knowledge graph. This knowledge graph can be explored in DISQOVER or exported into TTL format and used elsewhere.

../_images/system_overview.png

Figure 9.1 Overview of the data flow in DISQOVER

9.1.1. The data ingestion pipeline

The Data Ingestion Engine visualizes the data ingestion process as a pipeline. Every step in the process, such as importing source files or creating links, is represented by a component in the pipeline. Connectors between components define the chronological order in which the actions occur. Figure 9.2 shows an example of such a pipeline.

Because of this simple and flexible approach, you do not need an extensive background in coding to ingest and link data. The visual pipeline and reusable components make it easier to understand the data integration and contribute to it.

../_images/pipeline_example_overview.png

Figure 9.2 Screenshot of a pipeline in the Data Ingestion Engine.

The data ingestion process typically consists of three phases, with components facilitating actions that commonly occur in each stage:

  • Setting the stage
  • Integrating
  • Loading

Setting the stage

The first phase consists of importing the source files and preparing the imported data for linking or display in DISQOVER. Every pipeline starts with components that define the used datasources. Next, the necessary source files are imported and the data is stored in classes. By using this approach, the provenance of each piece of information can be tracked throughout the pipeline and later displayed in DISQOVER. Additional components set the URI and labels for resources in the classes, which allow identification of the data. If necessary, the data can be pruned and transformed.

Components often used in this phase:

Integrating

In the second phase, the prepared data from various sources is integrated into a densely linked knowledge graph. Links are made between related classes and afterwards information can be inferred from one linked class to another. Classes can be merged or extracted as well.

Components often used in this phase:

Configuring and Publishing

The final step consists of preparing the data for display in DISQOVER by configuring canonical types, facets and properties. The final component then publishes the entire configuration to DISQOVER or TTL.

Components often used in this phase:

Figure 9.3 shows the chronology of these three phases in a simple pipeline. As pipelines become more complex, the three phases become less well-defined. For example, it is often necessary to add a label and URI to an extracted class or transform inferred data. This pattern of first preparing data and then merging, linking and inferring can be repeated multiple times.

The following sections contain a tutorial on building pipelines and detailed explanations of components and other functionalities (section 9.3 to section 9.7).

../_images/pipeline_three_phases.png

Figure 9.3 Illustration of the chronology of the three phases.

9.1.2. Advanced Data Integration

In the Data Ingestion Engine, you can enrich your own data sources with data from the public domain or with data that was ingested by another DISQOVER installation.

There are two distinct mechanisms that allow such advanced data integration: Remote Data Subscription and Federation.

Remote Data Subscription

Remote Data Subscription (RDS) is a novel way to integrate data that has been prepared by other DISQOVER installations.

Figure 9.4 shows a schematic overview of the interaction of two DISQOVER installations with Remote Data Subscription.

  • The Publisher is a DISQOVER installation that prepares the data by ingesting data sources and configuring canonical types available for RDS. The canonical types that are published to RDS are called data sets.
  • The Subscriber is a DISQOVER installation that subscribes to one or more Publishers and retrieves their data sets. These data sets can then be ingested in a pipeline as any other data source. Data retrievals can be scheduled.

ONTOFORCE uses RDS to publish data sets that contain information from the public domain. Customers of ONTOFORCE can then subscribe to retrieve those data sets and use them in their own pipelines. You can also set up RDS between two or more of your own DISQOVER installations. For example, one installation can be used to collect and ingest data that has been prepared by other installations. A single DISQOVER installation can be a Publisher and a Subscriber at the same time.

For a step-by-step walkthrough of setting up Remote Data Subscription, see section 9.3.9. For detailed information, see section 10

../_images/rds_overview1.png

Figure 9.4 Schematic overview of Remote Data Subscription.

Federation

Federation is a way to make public data ingested and provided by Ontoforce available for a customer’s DISQOVER installation.

Figure 9.5 shows a schematic overview of Federation.

  • ONTOFORCE ingests public data sources and makes them available to their customers.
  • The customer DISQOVER INSTALLATION can enable Federation. In federated mode, the customer DISQOVER installation is enriched with public data:
    • Customer users can see and search ONTOFORCE’s canonical types.
    • Data from the customer data sources can be linked to public canonical types.
    • Customer canonical types can be enriched with data from public canonical types.

When querying the data, the customer server searches the public data made available by ONTOFORCE. The results are then sent to the customer installation. With Federation, it is not possible to integrate the public data within the customer’s Data Ingestion Engine.

For a step-by-step walkthrough of setting up Federation, see section 9.3.10.

../_images/federation_overview.png

Figure 9.5 Schematic overview of Federation.

Difference between RDS and Federation

The two mechanisms of ingesting external data each have their merits and can be used in different circumstances.

Remote Data Subsciption Federation
Ingesting data from any DISQOVER installation Linking and displaying data from ONTOFORCE
Data is retrieved and stored on the server Data is not stored on the server
No limitation of analytics on data with mixed provenance No analytics widgets for mixed datasets
Full flexibility on data ingestion No flexibility
No information leaves the Subscriber installation Information leaves the customer’s premises

9.1.3. Transparency and traceability

A transparent data ingestion process is key to keeping track of the provenance of information and retaining control over the data. On top of its visual interface that allows an easier understanding of the process and the chronology of events, the Data Ingestion Engine has some built in tools to facilitate this transparency. A complete overview of these tools is given in section 9.9.

Bi-directional lineage analysis

The Data Ingestion Engine tracks which data sources contributed to each piece of data. This allows to trace every property shown in DISQOVER back to the data sources that contributed to it, and to restrict information to groups of users based on their access rights to the data sources that contributed to it.

At the start of a pipeline, one or more datesources are defined and throughout the pipeline each piece of information is “tagged” with the datasource or datasources it is derived from. After loading into DISQOVER, this provenance if featured prominently (see also section 6).

Similar to tracking provenance, the Data Ingestion Engine is capable of tracking data dependencies throughout the pipeline. You can see all facets and properties an individual data source contributed to, and how that information was transformed and inferred.

Quality Control

The Data Ingestion Engine has built-in tools for data quality control (QC), such as automatically calculated metrics or customizable pipeline components to validate the data. The QC can therefor occur in different stages of the data ingestion process.

As real-world data is never fully complete or accurate, the Data Ingestion Engine uses a tolerance based approach for QC. You can choose to set warning and error thresholds to allow for an expected number of imperfections in the data.