This document is also available in this non-normative format: diff to previous version
Copyright © 2015-2016 the Contributors to the Realtime Paged Data Exchange 0.2.4 Specification, published by the Openactive Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.
This specification tackles the generic use-case of unidirectional real-time data synchronisation between two systems, where the receiving system requires only summary data from the origin system.
This specification was published by the Openactive Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.
The purpose of the document is to represent an early draft. It is likely to change, so this document should only be used to guide implementations in conversation with the Openactive community where the quick win of data sharing and shared learning are high priorities.
Comments on this document are not only welcomed but are actively solicited and should be made on GitHub Issues. The source code is available on GitHub.
If you wish to make comments regarding this document, please send them to [email protected] (subscribe, archives).
This section is non-normative.
The W3C Openactive Community Group was established with the objective of facilitating the sharing and use of physical activity data. The motivation for that is physical activity data is currently largely closed, and sharing and opening up of this data has enormous potential.
Openactive specifications are modular, each focussing on a specific use cases regarding physical activity data.
Other aspects of the Community Group's work includes the collation and specification of test suites to assess conformance to referenced standards, and the specification of a framework within which such conformance testing might be assessed are referenced.
Booking systems contains session data (a specific event at a time in a location). Examples include bookable squash courts, Yoga classes, and running groups.
This session data is frequently updated, as sessions are booked, descriptions changed. The data must be available as close to real-time as possible.
Many booking systems are maintained by an agency or supplier, and changes are funded by organisations with a small budgets. Hence simplicity and speed of implementation is paramount.
Due to the cost of work on the booking system side, this interface must not be a constraint to innovation. Innovators should be able to easily implement novel solutions on top of the data without additional effort on the booking system side. This means ideally pushing all query complexity out of the booking system in order to maximise flexibility, and providing a simple sync.
This specification is tightly defined to cover data exchange and synchronisation itself; to cover the real-time exchange of generic entities between two systems.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The key words MUST, MUST NOT, and SHOULD are to be interpreted as described in [RFC2119].
This specification describes the conformance criteria for Openactive data synchronisation endpoints.
In order to create a simple specification that is robust and scalable, the transport mechanism is separated from the paged exchange specifics. By applying paging to all transport alternatives, the approach is inherently scalable.
The specification is split into the three labelled elements of Fig. 1 Elements of this specification.
A paged approach to [JSON] data exchange is used, requiring minimum traffic for real-time or near-real-time synchronisation.
The paged [JSON] data exchange standard is incredibly simple to implement, but conceptually requires some explanation.
Consider an ordered list as shown in Fig. 2 Illustration of the deterministic list, with the following invariants:
A continuous list of records that MUST be sorted deterministically and chronological (in the order they were updated). Either (i) ordered first by modified timestamp, and second by ID or (ii) ordered by an incrementing counter where records are assigned a new unique value on each update.
Every record MUST only be represented once in this list at a given moment, with its position in the list depending on when it was last updated. Records can freely move position in the list as they are updated.
This deterministic ordering based on timestamp allows for pages of arbitrary size to be sent without concern for race conditions; if a record is updated during the transfer of a page it MUST appear on a subsiquent page (i.e. simply reappear further down the list).
Pages are defined using a "next page URL", which MUST contain enough information to identify a position in the list (e.g. by a "timestamp" and "ID" combination). It MUST NOT reference a specific record, as that record can change position in the list.
If a "next page URL" is not used to access the list, the first page MUST be returned.
If the consumer reaches the end of the list they consider themselves up-to-date at that moment, and can frequently revisit the end of the list in order to retrieve further updates.
Two example implementations are described below, both adhere to the specified invariants. One of these two examples SHOULD be used as a basis for any implementation, unless the invariants are well understood and the implementation can be thoroughly tested.
The content of a [JSON] page which could be defined by two parameters:
In this example, afterTimestamp=1453931925&afterId=12
would return Page 2 labelled in Fig. 2 Illustration of the deterministic list. ence the last record of a returned page of results can be used as parameters to retrieve the next page of results. This paging allows for an ongoing data synchronisation that synchronises all data, which can be replayed arbitrarily by the client.
Note that the timestamp does not need to reflect actual time, as long as it providers a chronological ordering. An integer, string or other representation is sufficient, provided it is sufficiently comparable to support the invariants.
For a single entity type (e.g. sessions), the data returned can be defined as follows:
var query;
if (queryParams.from) {
query = Session.query().filter((Session.modified == queryParams.afterTimestamp && Session.id > queryParams.afterId) || (Session.modified > queryParams.afterTimestamp));
} else {
query = Session.query()
}
return query.sort([Session.modified, Session.id])
The content of a [JSON] page could be defined by just one parameter:
The ID MUST provide a deterministic chronological ordering within the scope of the endpoint. A database-wide counter is sufficient (such as SQL Server's timestamp or rowversion). The consumer ("System 2") simply maintains the "next page URL", so the detail of the ID is not constrained.
This specification can be implemented for each relevant entity type within System 1 (e.g. club, courses, sessions).
Data from related entities can either be:
Embedded inside the parent entity. This requires the parent "updated" field to reflect the maximum value of all children - achievable via database triggers or similar.
Provided by separate endpoints implementing this specification, the combination of which can be reassembled by System 2.
It should be noted that although the <data>
element is open to conformance with [JSON-LD], the paging has been deemed too trivial to require the application of any specific serialization.
<response>
<response> => {
items: [<item>,<item>,<item>,...],
next: "/getSessions?afterTimestamp=<date>&afterId=<id>"
}
A generic response specification is included here, the idea being that we standardise the transport encapsulation for the records (paging and polling logic) across entities and systems so that, we can genericise this logic on both sides.
Key | Description |
---|---|
items | An array of <item> , which should simply by empty [] if no results are returned. |
next | For polling, the "next" URL in the response is a precomputed next URL that would be called by the client to get the next page (which would be polled after a delay if the previous page had returned no data). Note "polling" and "paging" are differentiated only by the duration between requests. Although an example endpoint name is provided, this is outside the scope of this standard. |
<item>
<item> => {
state: 'updated' | 'deleted', (4)
kind: "session", (2)
id: "{21EC2020-3AEA-4069-A2DD-08002B303123}", (6)
modified: Date(a),
data: <data> (4)
}
Key | Description |
---|---|
state | Deleted items are included in the response with a "deleted" state, but no <data> associated. |
kind | The "kind" attribute allows for the representation of different entity types. The standard does not advocate embedding of child entities if they change more frequently than the parent. Each entity type ("kind") can be synchronised separately, this allows us to decouple the sync logic from the data structure, and allows us to reassemble the data structure on the client side. It also makes the implementation very simple. |
id | Although IDs shown here are GUIDs, and above are numeric, and although the example above shows Unix timestamps, the standard does not prescribe any specific format of either. |
modified | Modified timestamp of the item. Must be comparable to itself. |
data | Note this key is not included if state is 'deleted' |
<data>
Note that this element is non-normative.
<data> => {
lat: 51.5072,
lng: -0.1275,
name: 'Acrobatics with Dave',
groupId: "{0657FD6D-A4AB-43C4-84E5-0933C84B4F4F}" (7)
clubId: "{38A52BE4-9352-453E-AF97-5C3B448652F0}"
}
The <data>
part of the response is then open for whatever data structure is most appropriate for the particular entity type ("kind"). Other Openactive specifications will cover this content.
Although IDs are provided for related entities here, this example structure is not part of the standard.
A full example REST response from polling:
/getSessions?afterTimestamp=1453931101&afterId={c15814e5-8931-470c-8a16-ef45afedaece}
-> { items: [
{
state: "updated",
kind: "session",
id: "{c15814e5-8931-470c-8a16-ef45afedaece}",
modified: 1453931101,
data: {
lat: 51.5072,
lng: -0.1275,
name: 'Acrobatics with Dave',
clubId: "{fc1f0f87-0538-4b05-96a0-cee88b9c3377}"
}
},
{
state: "deleted",
kind: "session",
id: "{d97f73fb-4718-48ee-a6a9-9c7d717ebd85}",
modified: 1453931925
}
],
next: '/getSessions?afterTimestamp=1453931925&afterId={d97f73fb-4718-48ee-a6a9-9c7d717ebd85}'
}
These transport options cover different levels of complexity and data volume. Note that in all cases polling must be implemented to support a full cache refresh and data download. The real-time transport mechanisms work alongside infrequent polling to keep the data current.
In the case of real time transport failure, a production client implementation can fall back to polling.
Transport Options | Advantages | Disadvantages | Primary Use Case |
---|---|---|---|
Polling (Simple download) | Simple to implement | Does not provide a real-time feed, and heuristic polling will result in patchy sync | Full cache refresh (also can be used in isolation for prototype implementation). |
Webhooks (Real-time) | Less traffic than polling, more server-side control, allows for real-time, uses standard REST interface | Uses many high-latency connections | Basic production implementation |
Server-Sent Events (Real-time) | Optimisation over webhooks as uses one connection, so can handle higher volume | Requires additional libraries | High volume production implementation |
AMQP (Real-time) | Pages can be handed off to the queue to facilitate even higher volume than Server-Sent Events | Requires additional infrastructure | Very high volume production implementation |
A basic REST endpoint which accepts the from and after parameters is required to allow a full cache refresh / data download on demand (e.g. /getSessions?afterTimestamp=Date(b)&afterId={d97f73fb}
).
For cases where only a polling endpoint is available, the client will poll the endpoint using heuristic backoff.
Implementation of at least one other type of endpoint is recommended in order to enable real-time updates.
Webhooks use the same mechanism as polling, except that pages are pushed from server to client, rather than requested explicitly by the client from the server.
The client registers an endpoint with the server, and the server repeatedly sends subsequent pages to the client. Using the same paging features as with polling allows the server to batch items to increase throughput.
When sending a particular page to the client the server is expected to wait for a successful acknowledgement of the page before sending the next page. If sending of a page fails that page should be continuously retried with an exponential backoff. The server should only proceed to the following page after successfully sending the previous one.
Note that during a full cache refresh the client will page the server for all data, and may simultaneously be receiving webhook requests from the client. Using the timestamp of each record to ensure records are only updated with newer data, the client is able to perform both the full cache refresh and receive webhook requests simultaneously. Alternatively the client can choose to drop webhook requests until its full cache refresh is complete, which should trigger the exponential backoff behaviour from the server, ensuring a good crossover in items between the end of the cache refresh and the webhook updates resuming.
Example payload for the webhook:
/putSessions
-> { items: [{
state: 'updated',
kind: "session",
id: "{c15814e5-8931-470c-8a16-ef45afedaece}",
modified: Date(a),
data: {
lat: 51.5072,
lng: -0.1275,
name: 'Acrobatics with Dave',
clubId: "{fc1f0f87-0538-4b05-96a0-cee88b9c3377}"
}
},{
state: 'deleted',
kind: "session",
id: "{d97f73fb-4718-48ee-a6a9-9c7d717ebd85}",
modified: Date(b)
}]
}
Note that the "next" from the response is not required here. Instead, from and after should be stored for each client of the server, in order that the server is able to send the relevant next page to each client.
Server-Sent Events ([eventsource]) (over HTTPS) provides a simple and efficient channel through which a high volume of updates can pass. Although an additional library may be required depending on the servers' platform, it is a very light implementation.
Server-Sent Events require System 1 to implement a system to keep track of events and trigger the sending of data through this channel. This non-trivial implementation, together with the complexities of library support in various languages and frameworks compared with a straightforward [JSON] API implementation used in Polling and Webhooks is the reason that the other transport mechanisms are provided.
The responsibility is on the client to reestablish a connection to the server and inform it of the last retrieved record in order to continue the stream, which is closer to polling than to webhooks.
Response grammar / example:
/stream?afterTimestamp=Date(b)&afterId={d97f73fb-4718-48ee-a6a9-9c7d717ebd85}
->
event: itemupdate
data: <item>
Only the <item>
from the <response>
is required here, and is passed as "data" (the Server-Sent Event specification of "data", which is different from the "data" part of the item). The explicit paging used with polling and webhooks is made redundant as this is a continuous stream over one connection.
The event type is always set to "itemupdate", as the state of each item is set with the "state" field consistent with polling.
An example stream is below:
/stream?afterTimestamp=Date(a)&afterId={a97f73fb-4718-48ee-a6a9-9c7d717ebd85}
->
event: itemupdate
data: {
state: 'updated',
kind: "session",
id: "{c15814e5-8931-470c-8a16-ef45afedaece}",
modified: Date(a),
data: {
lat: 51.5072,
lng: -0.1275,
name: 'Acrobatics with Dave',
clubId: "{fc1f0f87-0538-4b05-96a0-cee88b9c3377}"
}
event: itemupdate
data: {
state: 'deleted',
kind: "session",
id: "{d97f73fb-4718-48ee-a6a9-9c7d717ebd85}",
modified: Date(b)
}
[OASIS AMQP] (RabbitMQ et. al.) provides a two way channel for events which includes buffering and multiple connections to increase throughput.
As with Server-sent events, only the <item>
from the <response>
is required to be sent in the message, as AMQP makes the explicit paging redundant.
For very high volumes, this allows the server to send multiple pages in parallel, as it can calculate the "afterTimestamp" and "afterId" parameter and control the send. The client can also process these in parallel (particularly useful in the case of shared or No-SQL data stores, with scaling queue processors), using the timestamp of each record to ensure records are only updated with newer data. However this requires additional infrastructure and the use of certificates (more complex to configure than HTTPS).
This section is non-normative.
Three common patterns of implementation are presented, along with specific advantages and disadvantages of each, together with miscellaneous notes.
Create a cache table which is written to on each entity change (or related entity change, if a calculated field is created), either via an application or database trigger (which can also be used to trigger the webhook). The table contains the rendered [JSON] <item>
, along with the modified timestamp and the ID.
Entries in the table overwrite old items with a newer modified timestamp.
This table can be easily parsed into output for the client. This has the advantage of allowing one endpoint and one process to manage the real-time sync by watching this single table, as the table can maintain a sort and page across all entity "kinds".
The [JSON] is generated from each table individually at the point that is requested by either the webhook or poll.
An endpoint will be required for each entity "kind", as the sort cannot efficiently happen across tables. These endpoints would require separate webhook / polling processes to keep them in sync (though the webhooks can all share the same endpoint on the client).
A paging table could be created that contains only the Kind, Modified Timestamp and ID. This table is then updated with each entity update, however the [JSON] is only generated on-demand by getting the next page from the paging table and rendering [JSON] for each of IDs of the "kinds" returned.
As some entities will not need to be synchronised, but fields calculated from them will need to be known to the client (e.g. the "tickets" table may not need to be synchronised, but the "available tickets" calculated field on the "sessions" table will be required).
The suggested approach is to calculate the field "available tickets" and store it in the "sessions" table on each ticket sale. This has three advantages:
An alternative could be to calculate it on each synchronisation, however this will slow down the sync. Assuming that reads will occur on this calculated data more frequently than writes, caching the calculated field is recommended.
For high-throughput AMQP, when updating data in the client's index, the timestamp of each record will be used to ensure records are only updated with newest data. Although this technique can be used specifically for AMQP, all other methods of transport must adhere to a strict ordering of items by modified timestamp and ID (per kind) in order to ensure data consistency. Ordering between kinds is not important.
If using a timestamp/rowversion field on the parent table as the "modified timestamp", using the trigger below for each child table will update the rowversion field in the relevant rows on the parent table when the child table is updated.
The SET SomeColumn = SomeColumn
part of this trigger could easily be replaced with setting materialised calculated fields (e.g. "total number of tickets sold") which contain the summary data required by "System 2". This reduces the need to join the child table during the API endpoint response, which helps to optimise the endpoint.
The example below has been adapted from here, see here for an explanation of the mechanics.
CREATE TRIGGER tgUpdateParentRowVersion ON ChildTable FOR INSERT, DELETE, UPDATE
AS
BEGIN
/
The updates below force the update of the parent table rowversion
/
/ Materialised field calculation goes here*/
UPDATE ParentTable
SET SomeColumn = SomeColumn
FROM ParentTable a
JOIN inserted i on a.pkParentTable = i.fkParentTable
UPDATE ParentTable
SET SomeColumn = SomeColumn
FROM ParentTable a
JOIN deleted d on a.pkParentTable = d.fkParentTable
END
This section is non-normative.
The editors thank all members of the Openactive CG for contributions of various kinds.