Dataset API Discovery 0.2

Draft Community Group Report

Latest editor's draft:
https://openactive.io/dataset-api-discovery/EditorsDraft/
Editors:
Leigh Dodds (Open Data Institute)
Nick Evans (Open Data Institute)
Timothy Hill (Open Data Institute)
Participate:
GitHub openactive/dataset-api-discovery
File a bug
Commit history
Pull requests
Version:
0.1

Abstract

This document specifies a dataset site and embedded JSON-LD document that together describe an open data dataset and define related APIs that are available to manipulate it.

Status of This Document

This specification was published by the OpenActive Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.

Contributions to this document are not only welcomed but are actively solicited and should be made via GitHub Issues and pull requests. The source code is available on GitHub.

Note

This document represents an early Editors Draft of the planned API design. It is likely to change between now and its final version. These early drafts are intended to help developers provide that feedback by developing proof-of-concept implementations. We encourage developers to explore this API and contribute to the development of the specification.

If you wish to make comments regarding this document, please send them to [email protected] (subscribe, archives).

1. Introduction

This section is non-normative.

The document is an output of the OpenActive Community Group. As part of the OpenActive initiative, the community group is developing standards that will promote the publication and use of open opportunity data in helping people to become more physically active.

This specification aims to build on existing work by the WebAPI Discovery Community Group, the W3C's DCAT standard, and the Schema.org discussion, providing a profile and guidance specifically both for open data publishers and implementers of Web APIs that manipulate openly available datasets. It also aims to provide conformance rules to ensure that implementers include the details necessary to allow a Data Consumer to reliably integrate with standards-compliant services without human intervention.

The specification defines both the requirements of a Dataset Site provided by a Data Publisher (server) for use by a Data Consumer (client), and of any Data Catalog designed to enable discovery of such sites. In addition, it includes high level requirements for human-readable content, and detailed requirements and conformance rules for machine-readable content.

1.1 Scope and requirements

Dataset Sites that conform to this specification will be:

Dataset Sites will also provide the following information about the implementation of APIs it describes:

Data Catalogs published in accordance with this standard will be

Note that although this specification of the OpenActive Community Group, it is designed to apply to any open dataset where an API is available to manipulate it.

1.1.1 Functionality that is out of scope

By design this specification will not define some types of functionality.

These have been declared as permanently out of scope because they are adequately covered by existing specifications:

  • API documentation - each system should provide its own documentation, the specification only requires that such documentation exist.
  • API specification and API definition - many existing formats are available for API specification (e.g. Swagger, RAML, etc.).

1.2 Audience

The document is primarily intended for the following audiences:

2. Conformance

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

The key words MUST, OPTIONAL, RECOMMENDED, and REQUIRED in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

This specification makes use of the compact IRI Syntax; please refer to the Compact IRIs from [JSON-LD].

3. Typographical Conventions

The following typographic conventions are used in this specification:

markup
Markup (elements, attributes, properties), machine processable values (string, characters, media types), property name, or a file name is in a monospace font.
Definition
A definition of a term, to be used elsewhere in this or other specifications, is underlined and in black.
hyperlink
A hyperlink is underlined and in blue.
[reference]
A document reference (normative or informative) is enclosed in square brackets and links to the references section.
Note

Notes are in light green boxes with a green left border and with a "Note" header in green. Notes are normative or informative depending on the whether they are in a normative or informative section, respectively.

Examples are in light khaki boxes, with khaki left border, and with a
numbered "Example" header in khaki. Examples are always informative.
The content of the example is in monospace font and may be syntax colored.

4. Key Actors

Data Publisher
The application that is publishing the Dataset Site, that will also be making available open data endpoints and API endpoints for use by the Data Consumer.
Data Consumer
The application or human that is reading the Dataset Site in order to make use of the available open data endpoints and API endpoints.

5. Definitions

5.1 Dataset Sites

A Dataset Site is a human and machine readable web page ("Dataset Page") that describes a dataset and the APIs available to interact with it, with associated functionality that allows for feedback to be provided about the dataset.

5.2 Data Catalogs

A Data Catalog is a JSON structure that supports and enables the discoverability of Dataset Sites. They do so by providing metadata and links, either to Dataset Sites directly or to other Data Catalogs.

5.3 Purpose

5.3.1 Dataset Sites

The purpose of a Dataset Site is to provide:

  • A web page that can be referenced when discussing the dataset.
  • A human and machine readable licence associated with the data (the Dataset Page contains invisible metadata which allows its details to be read automatically).
  • A human and machine readable rights statement to specify how dataset users (innovators who want to build on top of/use data) should attribute your data.
  • An accessible "single point of truth" that explains where the data can be found.
  • Details ("documentation") and historical record ("changelog") relating to the format of the data, including the specifications it follows, and the data fields it contains.
  • A place where the community can contribute with comments, and raise issues.
  • A mechanism by which Data Consumers can subscribe to get updates about changes to the data format, specifications and fields.
  • A human and machine readable description of any APIs that can be used to manipulate the data, and the process to gain access to such APIs.

5.3.2 Data Catalogs

The purpose of a Data Catalog is to provide:

  • machine-readable metadata about dataset sites
  • URL(s) pointing to dataset sites
  • machine-readable metadata about, and URLS pointing to, other Data Catalogs where appropriate
  • a start-point for spidering of data collections

5.4 Dataset Sites

5.4.1 HTML Content

5.4.1.1 Human-readable content

With the exception of licensing information, there are no strong requirements for the human-readable content of dataset pages, and implementers may provide whatever information they see fit here. For the convenience of end-users, however, it is normally expected that a Dataset Page will provide at least the following information and markup:

  • the name of the organisation publishing the data
  • the standards to which the published data conforms (e.g., the Opportunity standard)
  • the version of each of these standards (e.g., '2.0')
  • where a link to a data feed is provided, their text should refer to the entity types this feed contains (e.g. SessionSeries, Slots)
  • an appropriately-labelled link to documentation relevant to the data feed(s)
  • an appropriately-labelled link to a discussion channel for the data feed(s)
  • licensing information

Note that, of the list above, only licensing information is REQUIRED to be available in human-readable form, and this license MUST be a Creative Commons Attribution 4.0 International License (often abbreviated as 'cc-by').

Note further that, in the event that you are republishing OpenActive data from another source, the original publisher must be credited as per the terms of this license.

5.4.1.2 HTML meta tags

In addition to the directly-readable content of the HTML body, information contained in the HTML head may sometimes be used by search engines and social-media platforms to aid findability and provide snippets.

It is accordingly RECOMMENDED that the following <meta> tags be supplied.

Property Value
title The name of the publishing organisation, followed by the string ' Open Data'.
identifier The URL of the dataset site.
keywords Short, descriptive words or phrases to aid discoverability
description A human-readable description of the dataset.
language The language of the dataset site.
5.4.1.3 OpenGraph <meta> tags

OpenGraph is a protocol created by Facebook that allows useful snippets to be extracted on social media platforms, including also LinkedIn and Twitter.

The following OpenGraph properties are RECOMMENDED for use in <meta> elements in the HTML head of Dataset Pages.

Property Value
og:title The name of the publishing organisation, followed by the string ' Open Data'.
og:description A human-readable description of the dataset.
og:locale For publishers within the UK, this should be 'en_GB'.
og:url The URL of the dataset site.
og:image The logo of the publishing organisation

5.4.2 Embedded JSON

Dataset Sites must be machine-readable via embedded JSON-LD.

Property Status Type Notes
@context REQUIRED Array of URL values Note that, in conformity with RFC3986, trailing slashes MUST be supplied.
@type REQUIRED Text Dataset
@id REQUIRED URL A URL uniquely identifying the dataset site resource. May be the URL of the Dataset Site itself
url REQUIRED URL Typically the URL of the dataset site itself.
name REQUIRED Text The name of the collection of datasets referenced by the site. Often this will simply be the name of the publishing organisation.
description RECOMMENDED Text A human-readable description of the datasets referenced by the site.
keywords OPTIONAL Array of Text Short descriptive metadata tags for the dataset collection.
license REQUIRED URL A URL reference to the license under which the dataset site is published. For OpenActive dataset sites this should be https://creativecommons.org/licenses/by/4.0/.
distribution REQUIRED Array of dcat:Distribution object See below, Describing Individual Feeds
discussionUrl RECOMMENDED URL A link to a resource for discussing and raising issues with the published datasets. Typically, although not necessarily, this will be a link to a GitHub repository.
documentation RECOMMENDED Array of URL Link(s) to further resources concerning the dataset site and its referenced datasets - e.g., GitHub READMEs or status summaries.
inLanguage RECOMMENDED String The language of the dataset. Should be expressed as an ISO 639-2 language code.
publisher REQUIRED schema:Organization The organization responsible for publishing the collection of datasets linked to by the dataset site. For further information, see below, Describing Organizations.
datePublished REQUIRED schema:Date The date the dataset site was published.
schemaVersion REQUIRED URL The version of the dataset site specification to which the site conforms.
softwareVersion RECOMMENDED URL A link to a repository holding the code by which the site was generated.

The MIME-type of this JSON object MUST be application/ld+json.

Note: Trailing slashes and @context

It is common practice is to reference https://schema.org without a trailing / within @context. However to be consistent with the OpenActive Modelling Opportunity Data specification, which uses the full URI of https://openactive.io/ (including a path as per RFC 3986, the specification requires the schema.org context to be referenced with a trailing slash, i.e. https://schema.org/.

5.4.2.1 Describing Individual Feeds (dcat:Distribution objects)
Property Status Type Notes
@type REQUIRED Text DataDownload
name REQUIRED Text A human-readable name for the dataset.
additionalType RECOMMENDED URL A link to a definition of the type of the feed - e.g of ScheduledSessions or CourseInstances
encodingFormat RECOMMENDED Text or URL The MIME-type of the data accessible via the contentUrl
contentUrl REQUIRED URL The URL of the feed containing the dataset.
totalItems RECOMMENDED Integer The total number of items that can be expected in the feed. Note that this number will often be approximate only, given the rapidity with which updates may be made to backend datastores.
5.4.2.2 Supporting Booking (schema:webAPI)

In addition to the above markup for discoverability, dataset sites that support Open Booking API functionality MUST indicate this with markup enabling discovery and use of the relevant API endpoints.

Property Status Type Notes
@type REQUIRED Text WebAPI
name RECOMMENDED Text A human-readable name for the dataset.
description OPTIONAL Text A human-readable description of the API
documentation RECOMMENDED URL or schema:CreativeWork Human-readable API documentation. See Describing API Endpoints, below.
termsOfService REQUIRED Text or URL Human-readable terms of service documentation.
provider REQUIRED schema:Organization The Organization providing the API endpoint.
endpointUrl REQUIRED URL The root location or primary endpoint of the API.
conformsTo RECOMMENDED URL The URL reference of an established standard to which the described API conforms.
version RECOMMENDED Text The version of the URL.
license REQUIRED URL A URL reference to the license under which the dataset site is published. For OpenActive dataset sites this should be https://creativecommons.org/licenses/by/4.0/.
endpointDescription RECOMMENDED schema:CreativeWork A machine-readable description of the API. See Describing API Endpoints, below
bookingService RECOMMENDED schema:SoftwareApplication The software system responsible for handling booking over the Open Booking API.
5.4.2.3 Describing API Endpoints (schema:CreativeWork)

Supporting documentation is crucial for the successful uptake and use of APIs. Ideally, both human-readable freetext and machine-readable structured data are made available.

The schema.org objects for human- and machine-readable documents are largely identical in terms of content and structure. However, the MIME-type associated with each will normally differ.

Property Status Type Notes
@type REQUIRED Text CreativeWork
url REQUIRED URL A URL pointing to supporting documentation for the API.
encodingFormat RECOMMENDED Text The MIME-type delivered by the url. For human-readable documentation (schema:documentation) this will normally be text/html; for machine-readable documentation (schema:endpointUrl), application/json or a more-specific subtype of this.
beta:ordersFeedURL OPTIONAL URL In cases where the Open Booking Orders Feed is found at a URL outside the domain of the other API endpoints, its absolute URL should be supplied here. For further information see the Open Booking API specification.
5.4.2.4 Describing Booking Services (schema:SoftwareApplication)
Property Status Type Notes
@type REQUIRED Text SoftwareApplication
name REQUIRED Text The name of the software application
version REQUIRED Text The version of the software application
url OPTIONAL schema:URL The URL of a human-readable web-page providing further information about the software.
featureList RECOMMENDED Array of schema:URL A URL or URLs pointing to a machine-readable description of the Open Booking API features implemented by the system, e.g. as generated by the OpenActive Test Suite.
Note

The schema:WebAPI specification has been assigned Pending status by the schema.org organisation, and is scheduled for release in schema version 10.0. While schema:WebAPI is relatively stable, then, points of detail are still subject to review and this specification may change at short notice.

5.4.2.4.1 Worked Example

The below illustrates a Dataset Site pointing to feeds consisting of ScheduledSessions, SessionSeries, and Events. As the presence of the webAPI attribute indicates, data items from these feeds are bookable.

{
   "@context":[
      "https://schema.org/",
      "https://openactive.io/",
      "https://openactive.io/ns-beta"
   ],
   "@type":"Dataset",
   "@id":"https://data.example.com/",
   "name":"Example Sessions and Events",
   "description":"Near real-time availability and rich descriptions relating to sessions and events available from Example.com",
   "url":"https://data.example.com/",
   "dateModified":"2019-08-25T11:23:27+00:00",
   "keywords":[
      "Courses",
      "Sessions",
      "Events",
      "Activities",
      "Sports",
      "Physical Activity",
      "OpenActive"
   ],
   "schemaVersion":"https://www.openactive.io/modelling-opportunity-data/2.0/",
   "license":"https://creativecommons.org/licenses/by/4.0/",
   "publisher":{
      "@type":"Organization",
      "name":"Example.com",
      "description":"Example.com makes it easy to get active!",
      "url":"https://example.com/home",
      "legalName":"Example Ltd",
      "logo":{
         "@type":"ImageObject",
         "url":"https://cdn.example.com/assets/logo.png"
      },
      "email":"[email protected]"
   },
   "discussionUrl":"https://github.com/example/repo/issues",
   "datePublished":"2019-07-11T00:00:00+00:00",
   "inLanguage":[
      "en-GB"
   ],
   "distribution":[
      {
         "@type":"DataDownload",
         "name":"ScheduledSession",
         "additionalType":"https://openactive.io/ScheduledSession",
         "encodingFormat":"application/vnd.openactive.rpde+json; version=1",
         "contentUrl":"https://example.com/api/openactive/scheduledsessions"
      },
      {
         "@type":"DataDownload",
         "name":"SessionSeries",
         "additionalType":"https://openactive.io/SessionSeries",
         "encodingFormat":"application/vnd.openactive.rpde+json; version=1",
         "contentUrl":"https://example.com/api/openactive/sessionseries"
      },
      {
         "@type":"DataDownload",
         "name":"Event",
         "additionalType":"https://schema.org/Event",
         "encodingFormat":"application/vnd.openactive.rpde+json; version=1",
         "contentUrl":"https://example.com/api/openactive/events"
      }
   ],
   "backgroundImage":{
      "@type":"ImageObject",
      "url":"https://cdn.example.com/images/background.jpg"
   },
   "documentation":"https://developer.openactive.io/",
   "accessService":{
      "@type":"WebAPI",
      "name":"Open Booking API",
      "description":"The Open Booking API lets you to book OpenActive Opportunities. The API uses standard schema.org types and is compliant with the JSON-LD specification.",
      "documentation":"https://openactive.io/open-booking-api/EditorsDraft",
      "termsOfService":"https://example.com/api/booking/documentation/terms-of-service",
      "endpointUrl":"https://example.com/api/booking/",
      "conformsTo":[
         "https://www.openactive.io/open-booking-api/2.0/"
      ],
      "endpointDescription":"https://www.openactive.io/open-booking-api/2.0/swagger.json"
   }
}

5.4.3 Discoverability and Dataset Sites (using schema:DataCatalog)

Data Catalogs will normally be published as JSON-LD objects accessible via a URL.

Property Status Type Notes
@context REQUIRED Array of URL values Will normally consist only of the value http://schema.org/. Note that, in conformity with RFC3986, trailing slashes MUST be supplied where appropriate.
@type REQUIRED String DataCatalog
@id RECOMMENDED URL A unique identifier for the DataCatalog, often identical to the URL at which the DataCatalog is found.
dataPublished RECOMMENDED schema:Date The date the DataCatalog was published.
publisher RECOMMENDED schema:Organization The Organization responsible for publishing the DataCatalog.
license REQUIRED URL A URL reference to the license under which the dataset site is published. For OpenActive dataset sites this should be https://creativecommons.org/licenses/by/4.0/.
dataset REQUIRED if hasPart is absent, OPTIONAL otherwise. Array of URL One or more URLs pointing to OpenActive Dataset Sites.
hasPart REQUIRED if dataset is absent, OPTIONAL otherwise. Array of URL One or more URLs pointing to other OpenActive DataCatalogs.
Note: Trailing slashes and @context

It is common practice is to reference https://schema.org without a trailing / within @context. However to be consistent with the OpenActive Modelling Opportunity Data specification, which uses the full URI of https://openactive.io/ (including a path as per RFC 3986, the specification requires the schema.org context to be referenced with a trailing slash, i.e. https://schema.org/.

5.4.3.1 Worked example

The below is an example of a DataCatalog JSON object.

{
     "@context": " https://schema.org/",
     "@type": "DataCatalog",
     "id": "https://opendata.example.live/api/datacatalog",
     "dataset": [
          "https://api.example.org.uk/OpenActive/",
          "https://booking.example.co.uk/OpenActive/",
          "https://active.example.net/OpenActive/",
          "https://camp.example.net/OpenActive/"
     ],
     "datePublished": "2020-10-21T12:28:09.7981681+00:00",
     "publisher": {
          "type": "Organization",
          "name": "Example.com",
          "url": "https://www.example.com/systems"
     },
     "license": "https://creativecommons.org/licenses/by/4.0/"
}

5.4.4 Describing Publishers (schema:Organization)

Property Status Type Notes
@type REQUIRED Text Organization
name RECOMMENDED Text The name of the Organization publishing the datasets.
logo OPTIONAL URL A link to the publishing Organization's logo.
url RECOMMENDED URL A link to the publishing Organization's website.

5.4.5 Removing data feeds

5.4.5.1 Data Publishers

In the event that a feed is to be removed permanently, publishers MUST:

  1. Remove the link to the feed from their dataset site
  2. Ensure that the feed URL returns a 404 ('Not Found') status code. This response should be returned for a period of at least seven (7) days from the date of initial removal, in order to ensure that regularly-consuming applications receive an explicit indication of removal within a reasonable timeframe.
5.4.5.2 Data Consumers

In the event that a consuming application receives a 404 response from a previously-harvested feed URL, all records associated with that feed MUST be purged from its datastore. This is to ensure data privacy and compliance with related legislation, such as e.g. the General Data Protection Regulation (GDPR).

6. Future versions of this API

Future iterations of the specification be shaped by the OpenActive community, and we encourage you to get involved.

A. Acknowledgements

This section is non-normative.

The editors thank all members of the OpenActive Community Group for their contributions.

B. References

B.1 Normative references

[JSON-LD]
JSON-LD 1.0. W3C. W3C Recommendation. URL: https://www.w3.org/TR/json-ld/
[RFC2119]
Key words for use in RFCs to Indicate Requirement Levels. S. Bradner. IETF. March 1997. Best Current Practice. URL: https://tools.ietf.org/html/rfc2119
[RFC8174]
Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words. B. Leiba. IETF. May 2017. Best Current Practice. URL: https://tools.ietf.org/html/rfc8174