A deep-dive into managing your SaaS data

At DeltaXML we are actively working on Software as a Service (SaaS) products. These are services for JSON and XML where we provide both the comparison and merge algorithms and also the cloud compute resources to run them. These operate on your data and will provide results that show how your data changes or how we have processed your changes in operations such merge or graft.

In these SaaS processing scenarios users will rightly ask how we manage and secure their data. You may be concerned about security, confidentiality or data protection regulations.

When we talk about data this could include the comparison or merge input data and also the results which contain fragments or copies of the input data.

This blog post will discuss the ways in which you can use our services without us having to store any of your data (where we only process it in memory) and show how you can manage data storage in the workflows where we do use persistent storage for your result data.

Synchronous Operations

One of the simplest ways of using our online services is through the Synchronous API endpoints. Our JSON service has several synchronous endpoints for operations including compare, merge and graft.

In these operations the result of the processing is returned in the HTTP response to your REST operation. We parse and load the data into memory and align and process it in memory. The operation result is returned in the REST response from memory. All the processing is done in memory without needing to use persistent storage. We may record meta data, for example in order to charge for that operation in certain subscription tiers, and generally as part of the operation of the system – more on that meta data later.

Asynchronous with Cloud IO

The synchronous API endpoints discussed above are suited to small files and certain use cases. For larger files and asynchronous use cases there are additional endpoints. There are two ways of using these asynchronous endpoints. Firstly, let’s look at using them with ‘Cloud IO’, here the inputs and outputs of our processes are coming from and to cloud storage respectively; but we’ve designed the system so that you can use your own cloud storage. You then need to provide the location and access rights to allow the service to read and write data.

Here’s an example of a JSON comparison using data stored in AWS S3 buckets:

POST /api/json/v1/compare
{
  "a":{ "type":"awsS3",
    "accessKeyId":"AKIZHTKZ2120XHSJKWEU3A",
    "secretAccessKey":"**********************",
    "region":"eu-west-2", "bucket":"deltaxml-inputs",
    "fileName": "jan-report.json"
  },
  "b":{ "type":"awsS3",
    "accessKeyId":"AKIZHTKZ2120XHSJKWEU3A",
    "secretAccessKey":"**********************",
    "region":"eu-west-2", "bucket":"deltaxml-inputs",
    "fileName":"feb-report.json"
    },
  "wordByWord": true,
  "async":{
    "output":{ "type":"awsS3",
      "accessKeyId":"ATIIKTKZ2120333W33",
      "secretAccessKey":"**********************",
      "region":"eu-west-2", "bucket":"deltaxml-results",
      "fileName":"jan-feb-diff-report.json" }
  }
}

In the example above different AWS buckets and access keys are provided for the inputs and outputs of the comparison. This is by design and allows the use of different IAM security policies. The policy used for the inputs would allow the file to be read from bucket, but a different bucket is used for the output with a policy that allows files to be written, but not read. Additionally, there would be no need to allow listing the bucket contents in either policy.

The example above shows JSON comparison with AWS S3 bucket storage and behind the scenes the use of IAM policies. The same principles apply to our different SaaS services and operations and we currently support Cloud IO for AWS, Azure and Google clouds which have similar access and policy mechanisms.

Asynchronous results

The final type of REST use to consider is one that’s asynchronous and does involve data storage. We provided an asynchronous REST API because some users will want to compare large amounts of data and this could take several minutes and possible exceed connection time outs. In the asynchronous API you include or point to the inputs and describe the configuration options. When you POST the operation there will be an immediate HTTP 202 or accepted response and processing will begin. When you submit the ‘job’ you get a URI you could then poll to find when the processing has finished, or you could register a callback hook to be notified. Upon completion this is what the job result will look like when you don’t use our Cloud IO mechanisms:

GET /api/json/v1/jobs/12346124
{
  "startTime":"2020-02-04T11:24:03.734+0000",
  "creationTime":"2020-02-04T11:24:03.734+0000",
  "finishedTime":"2020-02-04T11:24:10.846+0000",
  "comparisonTime":"7112 ms", "jobId":"12346124",
  "numberOfStages":19, "progressInPercentage":100.0,
  "pipelineStage":"jsondelta", "jobStatus":"FINISHED",
  "output": { "type":"http", "size":1802959,
    "uri": "https://api.deltaxml.com/api/json/v1/downloads/12346124"     } 
}

The result has been stored and there’s a URI where a HTTP GET will obtain the result of your operation. After you’ve got the file you could then do HTTP DELETE on the jobs endpoint, for example:

DELETE /api/json/v1/jobs/12346124

This will delete the job object from the system together with the storage we are using for the result.

We also need to put in place some housekeeping processes, such as deleting old jobs and result files (we can only suggest that users follow the above process and make use of HTTP DELETE) and so the SaaS subscription tiers will provide different job/file retention times and storage limits.

Meta Data

So far we’ve talked about your actual data. But there’s also meta-data. We store this to:

  • charge you for usage – we’ll capture data sizes, times of comparison
  • improve the service – we want to capture information about what options you select for your various operations
  • maintain logs to analyze failures and to respond to security incidents

But this is meta-data is about your data; we avoid storing your data except in the asynchronous API workflow described above and we provide mechanisms to allow you to manage your data in that REST workflow.

Please do raise support cases or comment here if you have any concerns about data storage or security.

Keep Reading