Wednesday, October 18, 2017

Universal Data Replication

One of the Darwino piece of code that I'm the most proud of is the replication engine. If a majority of our customers see it as a Domino to JSON replication engine, it goes far beyond that. In reality, it can replicate between virtually any data sources.
It is a true, two ways, multi-point replication engine, borrowing some ideas from IBM Domino but going beyond its venerable ancestor in multiple places.

The architecture

The main idea is that any data set can be represented as a set decorated JSON objects. Decorations include meta-data (ex: data security, replication attributes...) and binary attachments. The latest are stored apart from the JSON itself for performance reasons.
As an example, a Domino document can be transformed to a JSON object where every Domino document item is converted to a JSON attribute in the object, along with the attachments.
A relational table can also be transformed to a set of JSON objects or arrays, one per row. Even more interesting, a set of rows coming from multiple tables can be grouped to create a consistent, atomic  JSON object. This way, the replication granularity is not limited to a row, but can be a whole business object even if the physical storage split it into multiple records

Dawino's engine is mostly made of two components:

  1. A universal replication engine
    This engine does *not* know about the data sources. It contains all the replication logic while it accesses the data through a generic connector API
  2. Connectors
    A connector encapsulates the physical access to the data, via an API. Its API provides a "JSON" vision of the data that we'll be consumed by the engine
As of today, the Darwino replication engine comes with ready to use connector implementations:

  • Darwino JSON store, running atop RDBMS
  • IBM Domino
  • FlowBuilder XML store, used by ProjExec
  • HTTP to wrap a local connector and use HTTP/HTTPS as the transport
Some other connectors have been started:
  • Pure RDBMS used, for example, to replicate to DashDB
  • MongoDB
  • MS Sharepoint List

We are also waiting for a first beta release of IBM LiveGrid to replicate these data to your mobile devices and get a true offline expereince.

Writing a basic connector is a pretty easy operation. Then, the connector can be enhanced incrementally to provide fine tuned capabilities (better conflict detection....)

Darwino replication strengths


If the concept of replication feels easy to understand, there are many details that make Darwino unique so far:

Data transformation

The data can be transformed two ways during the replication process. These transformations can happen at the basic field level (ex: transforming a Domino name to its LDAP counterpart) or at a more global level, like grouping a set of Domino multiple fields into a JSON array. This way, the replicated JSON is clean, and it does not carry any legacy tricks from the source platform.
The transformation can obviously be coded, in Java or Groovy, but Darwino already comes with a pretty large set of standard transformations you can apply to any data. Thus, setting up the transformation is mostly about scripting the pre-defined transformers, unless you have more advanced need.
One of the benefit of data transformation is data normalization: you can ensure that the fields in a JSON document are converted to the expected data type. Domino is well know have inconsistent fields across the whole database, as they evolved over time. Darwino can normalized them.

Selective replication

Sometimes you want to limit the data replicated to a target system. It can be because of storage requirements (ex: your mobile device can only handle a subset of the data), or security (ex: you only want some data to be available on the cloud).
Darwino handles the selective replication by either providing selection formulas based on data. It also provides a unique feature called "Business Replication" where related documents can be grouped and replicated as a whole.
Selective replication poses a problem: how to propagate deletions as the data used to select the documents no longer exists, or how to remove a document from the target when the formula no longer selects that document? Well, we solved this problem with an efficient and reliable method that I'll be happy to talk about around a beer :-) This is a pretty complex problem.

Transaction and error recovery

If the target database supports transactions, then the engine can take advantage of them. If the replication fails for any reason, then it can rollback to the initial state, and eventually attempts another time. But this is not enough: when you have a large data set to replicate, you might not want to restart from scratch while you already replicated 98% of the data. For that purpose, the engine features commit points. If it fails at any time, it only restarts from the latest successful commit point.

Conflict detection

This is one of the area were Darwino excels. It has multiple conflict detection systems, going way beyond Domino. A particular connector might only implement a subset of them, depending on the data source capabilities. In this happens, then the engine will degrade gracefully and deal with what is available.
The existing mechanisms include: the last modification date, a sequence id (similar to Domino or some RDBMS), a change string carrying the update history, an update id...
Once a conflict is detected, an API is called and the developer can choose to execute a predefined behavior (ignore, last wins, create a copy data set...) or do a custom resolution like a data merge.
Finally, Darwino has an interesting mechanism to deal with records deletion. If the source database has deletion stubs, it will effectively use them. Else, it provides some helpers to find the deleted records, in particular within relational tables.

Performance

The whole Darwino replication engine is designed to handle very large datasets. The Domino connector is mostly written in "C", thus calling the native "C" API. Not only this allows critical features that are not available with the regular back-end classes, but it also provide high performance data replication. On my laptop, it replicates ~400 complex documents/second, including attachments and rich text conversion. For simpler documents, with only basic data, it goes to the range of 1000-2000 a second! In many cases, you can get your existing NSF replicated in minutes.

Domino data accuracy

A replication engine only makes sense if it is high fidelity. From a Domino perspective, the Darwino replication should act like a regular Domino replication, maintaining the same data (sequence id, creation and modification dates, list of authors...). We can do that thanks to the "C" API.
Darwino can replicate most of the Domino field types: string, number, name, rich text, mime, date/time... with maximum precision. For example, it deals particularly well with the date/time and the time zones.
Any domino document can be replicated, including design elements or profile documents. If you choose to replicate the database ACL document, then it will then be used by the Darwino runtime to apply the same security. Talking about security, we also handle the reader/author fields.

Aggregation

Data can be aggregated during replication: multiple sources of data can be replicated in one single target database, thus allowing a more global view of the whole data set. Think about it: you can replicate many NSF into a single set of relational tables and then run global queries on top of these tables. Isn't that a dream? Darwino liberates you Domino data!


As you see, the Darwino replication engine is a very complete engine that is flexible to handle many data sources. Out of the box, it can replicate your databases as is. But it also give you a lot of options to select and transform your data.

0 comments:

Post a Comment