10.3. Publisher

Remote Data Sets can be retrieved from public or private RDS publishers such as ontoforce.com. You can, however, also publish your own Remote Data Sets, i.e. make them available for subscription and retrieval.

If you import a Remote Data Set the data is imported into a class. However, if you want to publish a Remote Data Set its data is based on a Canonical Type, not on a class. This is explained in more detail in what follows.

10.3.1. Configuring a Remote Data Set for publishing

The Publish in DISQOVER component can publish Remote Data Sets. Each Remote Data Set is based on a Canonical Type and is configured in a Configure Canonical Type component.

If you want a Canonical Type to be published as a Remote Data Set, you have to turn option Publish for Remote Data Subscription on in the Configure Canonical Type component, and execute the pipeline with the option Publish for RDS on.

The name of the Remote Data Set is the last part of the URI of the Canonical Type. For example if the URI of a Canonical Type is http://my_company.com/ct/Person then the name of the Data Set will be Person. It is possible to override this behavior and specify an explicit name in option Data Set Name.

Instances

All instances of the Canonical Type will be published in the Remote Data Set. When importing the Remote Data Set each instance becomes a resource.

Recall that a Canonical Type can be populated with instances from multiple classes, as specified in the options Resource Types and Classes. In the most common case all instance stem from a single class.

Important remark: In a pipeline it is possible that an instance is included in two (or more) Canonical Types (for example an actor might be present in Canonical Types Person and Actor). If both Canonical Types are published as Remote Data Sets the instance data will be duplicated. That is not a problem as such, but if both Remote Data Sets are imported in an importing pipeline, then that pipeline will end up with resources with the same (preferred) URI and (preferred) label in two classes. This is likely to cause problems in the importer’s Publish in DISQOVER component, unless appropriate countermeasures are taken, such as merging both classes or removing the duplicates.

Predicates based on Properties

By default each property becomes a predicate in the Remote Data Set, unless you turn off option Publish for Remote Data Subscription for that property. The name of the predicate is the last part of the property URI. For example, if you have a property with URI http://my_company.com/prop/Person/first_name then the predicate name will be first_name.

For each instance the value of an RDS predicate is the concatenation of the values of all predicates of the corresponding property. For example, suppose property http://…/first_name has predicates name.lit and nickname.lit, and for some instance the value of name.lit is [‘Thomas’, ‘Tom’] and the value of nickname.lit is [‘Tommy’], then the value of the published RDS predicate first_name.lit will be [‘Thomas’, ‘Tom’, ‘Tommy’].

The type of the RDS predicate depends on the type of the property predicates. In the example the resulting predicate is a literal (.lit) because the composing predicates are literals too.

If the property predicates are links (.fwd or .rev), then the resulting predicate will be a link predicate too, but note that the type will always be .fwd. For example, if a property http://…/CausedBy has predicate caused_by.fwd, then the resulting RDS predicate will be CausedBy.fwd. If a property http://…/Causes has predicate caused_by.rev, then the resulting RDS predicate will be Causes.fwd. If a property consists of link predicate of mixed type (some .fwd, some .rev), then a single .fwd RDS predicate is produced.

However, if a property’s predicates are a mixture of literal and link types, then two RDS predicates will be generated, one a .lit and one a .fwd.

It is possible that different properties in a Canonical Type refer to the same predicate. In such a case the values of that predicate will be duplicated in each RDS predicate.

Other predicates

The following predicates are published if present:

  • URI (disq:uri.uri) and its hashed counterpart (disq:uri.huri)
  • Preferred URI (disq:uri.puri) and its hashed counterpart (disq:uri.phuri)
  • Label (disq:label.lit)
  • Preferred Label (disq:pref_label.plabel)
  • Resource Type (rdf:type.lit)
  • Instance Data Source (disq:data_source.lit)

Link predicates used in typed links will also be published to both the source and destination Data sets, if the option Publish for Remote Data Subscription is enabled for the Configure Typed Link component. The predicate name will be will be derived from the URI of the relation type, but can be changed by using the Custom direct predicate used in published data set and Custom inverse predicate used in published data set options.

Predicates which are used as parent predicate (option facet_parent_predicate) in a Hierarchical Facet will also be published. If such a predicate is mentioned in a Canonical Type’s property, then the name of the published predicate will be derived from the URI of that property. Otherwise, its own name will be used.

All Link predicates in the Canonical Type’s classes which don’t appear in any property or typed link will be collected and published as a single predicate called anonymous_links.fwd.

Note: facets which are not configured as properties are not published in the Data Set.

10.3.2. Setting Data Set Visibility

Each Data Set can be made available to one or more user groups, by setting the ‘Remote Data Groups’ on the Canonical Type configuration component. If one or more user groups are entered, only Remote Data Subscriber user accounts belonging to those groups will be able to view the Data Set and launch data retrievals for it.

By default, the ‘Remote Data Groups’ is empty and a published Data Set is available to all Remote Data Subcribers.

It is not possible to change the visibility of a Data Set after it is published.

10.3.3. Publishing Data Sets in differential mode

The Remote Data Subscription Publisher works differentially by design: existing data in published Data Sets will not be touched when it is updated or removed.

If an instance is no longer present after rerunning a pipeline, the state of the existing Data Set record is set to “REMOVED”. When the instance is updated, the state of the original record is set to “REMOVED” and a new record (in a new file) is written to the Data Set to represent the new record data.

This approach minimizes the file transfers required between Remote Data Subscription Publisher and Subscriber when running incremental schedules: only the state and new data is downloaded, and existing data can be skipped.

To force the publisher to use ‘non-differential’ mode, the Clear before publishing flag in the pipeline run parameter should be used to clear the data of existing Data Sets in the pipeline. This will result in a new set of files, including only “new” records. The Subscriber will be forced to download the entire Data Set again (even if running an incremental schedule).

10.3.4. Publishing Data Sets in incremental mode

During incremental mode, only added, updated and removed instances are sent through the Remote Data Subscription publisher. As the publisher works in differential mode by design, the incremental mode works by the same design principles: only new data and additional state for existing data is written to the Data Sets.

10.3.5. Publishing Data Sets from multiple pipelines

When publishing Data Sets from multiple pipelines, each Data Set needs a unique name across those pipelines. A unique name violation will result in the Remote Data Subscription Publisher failing with error message Data set <name> is already created by another pipeline. The name of the Data Set can be specified in the Canonical Type configuration component.

If the name of the Data Set cannot be changed and the pipeline that published the Data Set has been removed, the Data Set must be cleaned up manually by removing the directory /disqover/data/sync_data/<name of the Data Set> on the DISQOVER server.