9.8. Starting a run

In the Tutorial, you learned how to start a pipeline with the ‘Start pipeline execution’ window. You may have noticed that the option ‘Publish to DISQOVER’ is switched on by default, but can be turned off if desired. If you let your pipeline publish to RDS as described in the section Using Remote Data Subscription, you will have also switched on the option ‘Publish for RDS’, found in the same window. This option is not switched on by default.

Besides these two switches, there are a few more options in the ‘Start pipeline execution’ window. These are shown in the screenshot below.

image

Figure 9.97 The ‘Start pipeline execution’ window with the default options.

9.8.1. General

The first option is the ‘Run modus’. The run modus determines which components will be executed during the pipeline run and which optimizations will be used.

  • Full run: all components will be executed. This option is useful for small pipelines or during pipeline development, but it becomes impractical for larger pipelines. For regularly running such pipelines, you should consider selecting another run modus.
  • Run outdated: only outdated components will be executed. These are components that are new or have changed since the last run. If the data source used in a Define Datasource component was updated, this counts as a change in the component. All successors of ‘out of date’ components are also executed. The “Publish in DISQOVER component is always executed in this run mode.
  • Differential: only outdated components will be executed, and the Publish in DISQOVER component only processes instances that have changed compared to the last run. This is a further optimization of “Run outdated”, and an optimal run modus for pipelines with occasional data updates and component changes.
  • Incremental: only changed resources in the source data are processed throughout the pipeline. This is an optimization of “Run outdated”, and an optimal run modus for stable pipelines with source data updates. For more information, see Incremental data ingestion.

If the option Publish to DISQOVER is switched on, the data and configuration will be published to DISQOVER, and become accessible to end users. This will also trigger the DISQOVER Data updated event. In the Event actions tab under the Admin panel, you can see that by default this event will trigger a re-evaluation of all watched dashboards. (Default DISQOVER data updated -> evaluate watched dashboards) A user with access to the Server tab in the Admin``panel can also trigger this event manually, by clicking the ``DISQOVER data updated button.

9.8.2. Debugging options

If one or more pipeline components produced an error during the run, the data is not published to RDS or DISQOVER by default. If desired, you can choose to still publish the data (but be aware that it will be partial) if you toggle the Proceed in case of pipeline execution errors option.

In the tutorial, you saw that you must give the resources a URI and label by using the Add URI and Add Label components respectively. If a source does not obtain a label or URI, for example because the input predicate is empty for that resource, it will not be published. The same goes for resources that share one or more URIs. In other words, if a resource has a URI that is not unique, it is not published. When debugging these cases, however, it can be useful to still publish these instances to DISQOVER and search for them. If you switch on the Publish Malformed Instances option, instances without a label and/or duplicate URI(s) will be published [1] . These instances can then be found in DISQOVER by searching on the special labels ‘[MISSING LABEL]’, ‘[DUPLICATE URI]’ or ‘[DUPLICATE PREFERRED URI]’.

Another option that is useful to test a pipeline under construction is the Subset parameter. A value from 0 (1 record per importer) to 1 (all records) can be used. Using a subset can significantly shrink the pipeline execution time while still running all components. For more information on the subset parameter, see Subset parameter.

9.8.3. Solr indexing

If the option Publish to DISQOVER is turned on, the option Solr Endpoint URL allows the user to select a Solr instance or SolrCloud cluster to use for indexing the DISQOVER data. Format: http://<hostname>:<port>/solr/<collection_name>.

Note that setting a collection name is only supported for SolrCloud. When indexing to a stand-alone Solr instance, the collection name must be set to core_data.

If the option Solr Endpoint URL is not set, the DISQOVER indexing component uses the URL set in Admin–>System Settings–>Solr–>Solr Endpoint URL. If the system setting is also empty, the DISQOVER indexing component falls back to indexing to the local Solr instance.

9.8.4. Other options

If the option Publish for RDS is turned on, the selected canonical types will be made available for Remote Data Subscription. You first need to specify which canonical types should be published to RDS within the Configure Canonical Type components.

If publishing to RDS is enabled, the option Clear before publishing will become visible. This first removes all data published in a previous pipeline run. Use this option when you have made changes to the publishing pipeline. If this option is turned on, the first pipeline run of a Subscriber that uses your published data will not be incremental.

To export the pipeline output to RDF files, you can use the option Export data to file (Turtle format). In that case you need to specify a If switched on, an option Export Path will appear where you must specify a directory for the exported files.

9.8.5. Incremental data ingestion

As explained in section The data ingestion pipeline, a Data Ingestion Engine pipeline consists of individual components that, except from configuration components, import, transform or link data. The structure of the pipeline defines the flow of the data from the import components to the Publish in DISQOVER component.

When options in any of the components have changed, the Data Ingestion Engine can detect this on the next pipeline run, and will only trigger the affected components [2]: the modified components and all its successors. As discussed in Define Datasource, a Define Datasource component will also be re-executed if the associated source data has become updated. If, for example, new files are added for a specific data source (and the corresponding Info File is updated), all importers and other successor components are triggered for execution upon the next run. This makes sure that the new resources are indexed to DISQOVER or published as part of a RDS Data Set. Differential indexing is what causes only the new resources to be published, saving a substantial amount of time in the Publish in DISQOVER component.

If no options have been changed in the pipeline, this means that there is only new or updated data to process by the Data Ingestion Engine. The data has to be ingested and transformed in the same manner as the data processed in the previous pipeline run. If the corresponding Run Modus is selected, the Data Ingestion Engine can perform the pipeline run in so-called Incremental mode. In this mode, each component performs only those operations that are strictly necessary to integrate the updated or new data with the data that was imported during the previous run. Thus, Incremental Data Ingestion is an optimization of the Data Ingestion Engine that is able to shorten the overall pipeline execution time considerably, depending on the number of data sources that were updated and the amount of linkage between the data.

To enable Incremental Mode, you will have to select ‘Incremental’ as the ‘Run modus’ in the ‘Start pipeline execution’ window. This will attempt to trigger the incremental execution of the out-of-date Define Datasource components and successors, and the differential publishing to DISQOVER and RDS. As mentioned above, a prerequisite for this to work is that none of the options in the pipeline have been changed. If Incremental Mode is not possible, the Data Ingestion Engine will perform a partial run, meaning that the out-of-date Create Datasource components and their successors will be run in normal mode.

Conditions

Besides the requirement that ‘Incremental’ run modus is used (1) and the pipeline did not change (2), there are other conditions for incremental data ingestion. We will list these conditions below.

The most fundamental condition is that the pipeline has been run before (3). All components should have run, albeit in subsequent (partial) runs, and the components cannot have produced errors (4). If publishing to DISQOVER is enabled in the ‘Start pipeline execution’ window, the previous results should also be published in DISQOVER. For example, if this option was not enabled during the previous runs, or in the meantime an other pipeline has published instead, incremental pipeline execution will not be possible (5). Publishing to DISQOVER does not necessarily have to be enabled (one can also just publish to RDS), but the pipeline must contain the publishing component (6).

The ‘Start pipeline execution’ window also allows ‘Execute only selected component(s)’ if one or several components in the pipeline are manually selected. This is incompatible with incremental data ingestion (7). The run window also has the option ‘Publish Malformed Instances’, meaning that instances without a label or instances with non-unique URIs will be published to DISQOVER (by default, they are not). If this option was enabled in the previous run, it also has to be enabled for the current run, and vice versa (8). Moreover there should not be any instances with a duplicate (preferred) URIs in the previous run.

A last condition for incremental execution is that the Subset parameter was 1 in the previous run, and that it is also 1 in the present run (9). Whether or not the pipeline has run incrementally is visible in the Pipeline execution log and the Data Ingestion Report, as well as the reasons in case it didn’t.

For clarity, we briefly re-iterate all necessary conditions for Incremental mode below.

  1. ‘Incremental’ Run mode is used
  2. Pipeline was not changed
  3. Pipeline has run before
  4. Pipeline has run successfully
  5. Differential indexing possible
  6. Publishing component present
  7. No selected components
  8. Same value for ‘Publish Malformed Instances’
  9. No instances with duplicate (preferred) URIs
  10. Subset parameter is 1

Limitations

Incremental Data Ingestion can be very beneficial for the performance of relatively stable pipelines, meaning that the source data update much more frequently than the component options. For example, Incremental Data Ingestion is a great shortcut to publishing updated data into DISQOVER or RDS on a regular basis. Needless to say, every shortcut is bound by certain limitations.

One limitation is that preferred labels and preferred URIs are immutable during incremental runs. If a change in either of those happens in an instance during an incremental pipeline run, a warning will be issued by the Publish in DISQOVER component. Not all data in DISQOVER will necessarily reflect this change. This means that for example the label of links from one instance to an other instance will not always be up-to-date with a change in the label of the latter instance. In order to see those changes in DISQOVER, a Full Run is required.

Similarly, during incremental runs the structure of hierarchical facets is immutable: albeit the hierarchy can be extended (new child nodes can be added), existing nodes cannot get extra or new parent nodes. A warning will be issued when this happens by the Extract Hierarchical Class component that created the tree Class. Labels in the tree may also not be updated in DISQOVER. The same applies as above; the inconsistencies are always resolved by a full pipeline run.

9.8.6. Scheduled pipeline runs

Apart from running a pipeline manually, you can also schedule a pipeline to run at a certain time. From the pipeline overview, you can click the top tab “Schedules” and “ui-plus Add scheduler”. You can then choose a name, description, pipeline, time and options for the scheduled pipeline run. After saving the schedule, it becomes active and the pipeline will run on the chosen time.

To check the status of a pipeline run, you can click the clock icon in the schedules tab. To temporarily disable (pause) a schedule, you can use the toggle.

image

Figure 9.98 The overview of pipeline schedules.

image

Figure 9.99 Configuring a pipeline schedule.

[1]Instances without a URI will never be published.
[2]Except when ‘Full run’ is selected as the Run Modus in the ‘Start pipeline execution’ window.