System integration reviewed
The system integration tips presented here are to achieve data integrity, performance, quality, cooperation, and finally fewer support issues.
Tips are based on an integration with web service communication and data transfer between at least two parties.
If you prefer bullet-point version of best practices, scroll down this page.
Never take assumptions
First of all, never assume anything. Unless action is tested by all parties, the only assumption we can make is that “it does not work yet”.
Data contract should be detailed and updated
When providing documentation, we need to present all of the details
Over the years there were usually problems on:
-maximum length not agreed and different in two systems,
-type of numbers not agreed and different in two systems,
-capital letters handling not agreed or differences in documents provided by the same vendor,
-missing loops in the code – when a list is possible in the data contract, use loops and test cases for 0, 1 and n elements.
Dependency on previous packages
When designing an integration, often one package will depend on data from the previous package. When a series of packages are sent, one option to ensure data integrity would be to assign a version (timestamp) to each package. If further packages do not find a dependent package with correct versions, they will be rejected.
Without versioning, if there was a data update in a dependent package, it will be missed until next update, but errors happening where ID of the dependent package is required will make it crash.
An example could be an updated promotion with a new description, new percentage and a new list of items. If the list of items does not arrive (as an update), but promotion arrives, promotion can be successfully passed onto a live store, and faulty item package left for support to fix later. This would cause people having high discounts for items that were not supposed to be discounted that much. And no database ID could save us from this, except that promotions could not be altered after a first insert, which would increase data upload.
Until the data transfer finishes the full cycle, it should be considered as “in progress”. Even if we insert the data into our database, if we do not send the response and get the confirmation that our response was received, the data is still “in progress”. It can be only completed if both parties mark it completed on their ends.
The status is useful for lack of confirmation from the other end.
We have sent items but got no response. We do not know if they were added or not.
We have received items, sent a response, but not get it confirmed. We do not know if the other party knows that we have added items.
In those situations, packages can be rejected, retried, or synced to be confirmed.
A ghost from the past
An example. Item change has been performed at 12:01 and queued to be sent. Then the same item has been changed again at 12:02. The first request which arrived at the target system was from 12:02, and then the one from 12:01 arrived.
The target system must use versioning (e.g. timestamp) to reject any outdated information. If the item will get a timestamp “updated with origin date: 12:02“, when a request would come from 12:01, it will be easily rejected.
To help with all the hassle and performance, data can be bulk imported to the temporary tables. No application validation will be performed, and no delays with database validation. Validation can be done first on the client side.
The temporary table can also be used to send the data to the other system, not only receive it. That way a complicated data set which takes longer to process (e.g. items and stock values for each store, financial data that needs to be calculated) can be gathered with a timestamp (automatic job an hour before the data synchronization), and then sent to the other system. Additional value of that solution is a full control which values were sent to the other system.
Data processing – errors
Are we sure it processed?
To ensure the status of a package (correct processing), the party responsible for receiving data and processing it could provide an API, where other parties could ask about certain package or data directly e.g. order status. Otherwise, we rely on the response information, and sometimes request can process correctly but the response fails.
In the case described above, we can try to undo the package we sent or mark it as success on our side. After a given time, we could ask the service which did not reply to us about its status or data directly, then update our status accordingly. We could try to repeat the operation, assuming the other service will not allow for a duplicate entry to appear.
The other service could try to send the response or data reversal to us until succeeded (retrying in some time intervals).
An example could be a payment approval. It is sent from one system to the payment operator, but no response from the operator arrives. It is possible that the operator processed the payment and charged the client, but we did not get the response.
In that case, the operator can wait for our confirmation and then cancel the operation when not received, or ask us about that certain operation. Or we can send an operation cancellation request and cancel payment on our side. Then the operator would cancel it on its side, and if the payment never reached the operator, nothing would happen. That way data integrity stays intact. Of course, longer power outages can break the data integrity for a longer time.
To reduce problems coming from unknown statuses (stuck responses), the timeouts should be configured correctly on the way the request is sent. If there are multiple applications forwarding the request further, they must have their timeouts aligned (or use timeout dynamically from the original request).
An example could be a payment request set at 30 seconds of a timeout, sent to a gateway application, which has 60 seconds timeout in its configuration to the payment provider. After 30 seconds original client closes the connection with a timeout error, but 20 seconds later, gateway application actually finishes its request correctly to the payment provider.
If the gateway application had 25 seconds to complete its operation, that would never happen. And data would remain the same on all ends.
How would errors be transmitted? One way, when communicating through HTPP, is to throw 4xx or 5xx error. Error details can be later extracted from the response.
Another way is to always provide a 200 response, but have a Status field, and provide error details in the response.
The second option sounds better for me, as an error handling path on the client will be reserved for errors on the client only, and all server errors would go a different path.
For managing packages of data, there should be always understanding how both ends handle errors in the middle of a package processing. E.g. if we processed 15 of 20 items, failed at 16, should we save the 15 and report an error? Or try to process remaining ones and send error only on those which failed? Would the other party correctly mark items on their end? A great thing to do in these cases is to test the agreed way of failing packages.
Error event IDs
It may be useful to send to the other party unique error event ID, so when a specific case is investigated, it is not done only by the general error message and somewhat accurate timestamp, but a specific ID is used.
For example, an item was not uploaded correctly with event ID 1445. That ID is logged to the log file/database with a timestamp, stack dump, all other details. When the other party asks what happened, they provide ID. It might be possible to have similar error two seconds later, and without a specific ID, it would be hard to investigate.
Other option would be to have a request ID assigned to a request, to track its progress.
Sometimes it would be easier to handle edge case with a simple error message “Operation error 49 – contact Support”, have it described in the design document, but if the changes are once five years, just make sure to catch it but not handle it.
All hands on deck
During the integration testing, E2E testing, where all components of all parties are being used, all parties should be present or at least review the test results (data in, data out, logs, screenshots, etc.). Without one party being present, other parties can make assumptions or just look at their components, and declare E2E testing as compete and successful, while there was a bug in one component – the response was successful but no data has been saved. There was just nobody to check that.
All hands destroy the API
Integration testing in E2E scenario gives unique opportunity to correctly simulate error and offline scenarios. Each party, having control over its components and data sent, can simulate wrong input, a server being offline, connectivity errors, database errors.
Testing error scenarios is as much as important as success path. If the error path is not tested, it is possible that e.g. request fails but the application will show a success message. It doesn’t sound as bad as we do not say the example is a credit card payment or order for production of a complicated and luxury item. Card payment not taken or charged twice, or a factory not getting data to produce a part on time (or producing two parts when only one is required) would be results of no error scenarios testing.
I have faced one major error where many of the principles described in this article were not followed, and no error testing one on top of them (all packages were marked as a success, regardless of their actual state).
One of the agreements can be around testing tools. If there a problem with a client application (general request issue or performance), an external tool can be agreed to be used as a benchmark e.g. SOAP UI with a certain version number.
Another agreement for the performance should be to measure it using logs only on all the way request is processing (on client click, request start, leaving client, load balancer, server arrival, application arrival, started processing, finished processing, and response all the way back).
Generating a documentation schema for two hundred fields can take time, so if possible, find a tool to automate the job. It will be useful after all the changes made to the schema when a new document would be required.
There are also tools to convert e.g. JSON data examples into C# classes – e.g. json2csharp.com
When relying on external services, they may stop to work on the crucial QA test phase, forcing a delivery to be delayed. The downtime can happen during development as well, even multiple times from different reasons (network, power, the hardware on either our or service end).
At least for development purposes, a mock server or mock client can be created to simulate the other party. When the service is offline, we can turn on our internal one for the internal E2E tests, and have some configuration prepared to simulate success or error responses from the service.
That way we can also solve long processing time on the service side, or lack of all responses or data we need (assuming the other party develops its solution simultaneously).
And unit test
As for unit testing, it can be an alternative or complementary solution to the mock server/client. We can prepare test requests or responses and use them without any dependency from the external service.
Using automated tests we do not need to manually click anything, and with integration testing, there will be usually a lot of cases to cover.
Processes for the future
Is the system going to be changed without clients / other parties informed?
That is important information for other parties, so they can establish a way to get updated on recent versions of the API. Sometimes even small changes, even not related to the schema, may have some impact. Also, testing may be required to be done by other parties after a major update, to ensure everything works correctly.
Have data contract available online, preferably automatically updated
That way, any change in a field name, type, being required, maximum length, will be automatically reflected in the online documentation.
To ensure other parties use the same data contract, versioning can be implemented. When we update our schema, the structure can be analyzed either by a checksum, build number or timestamp.
Then our own application will fail if an automatic version is incorrect with a manual one, so a developer will not forget to update documentation and a version field.
Then any request with a header having outdated schema version will be rejected.
Also, dictionaries (e.g. error numbers, translations, order types, statuses, etc.) are part of the data contract, so should be monitored as well.
Versioning can be applied to certain parts of the system, as there is no need to reject Item requests while only the Customer data changed.
Logs in place
Having logs is obvious, for some period of time request and responses can be logged (and preferably encrypted), but once performance issues arise, one element of the logs will happen to be important.
Data exchange between systems e.g. sending requests should be timestamped to the milliseconds. If we measure only in seconds, there would be no difference between 0.01s and 0.99s. There is no automatic rounding for time types, as it could cause an action performed in one year to appear as performed in another year, depending on the system and rounding used.
So without rounding, two seconds difference can actually be a 0.03s difference or 1.98s difference, just look at the examples:
start date 1:33:15.9999 and end date 1:33:16.0001 -> 1s if looked at seconds, 0.0002s actual time
start date 1:33:15.0001 and end date 1:33:16.9999 -> 1s if looked at seconds, an almost 2s actual time
In logs may be useful saving each status change with a timestamp (in progress, success, error).
The human aspect, system configurations
Having the documentation in place does not mean it describes how the system works and replaces testing.
It can be often outdated, as usually the process would be to deliver a new version, create documentation, and share it with other parties. It may be useful to have the main version for the API somewhere (in the header, online page, etc.) so other parties can compare the current version with the documentation they have.
Different environments can have a different version of the various components, making it harder to reproduce and fix issues. This is a general rule, but when it comes to integration, it becomes more problematic.
To handle a lot of the environmental problems, it may be good to have a script for each environment which will update the configuration with one step.
Applications should handle the fact configuration has not been filled in and do not try to stupidly call https://PLACEHOLDER and throw some general error. It should be a very specific error.
When we have multiple parties involved, all of them would say “Our component works fine”, while clearly, something is wrong. It would be best that all parties look at the integrated system as their own system, share their technical details, and try to suggest solutions without a fear to step over the line.
If there are too many politics, there should be a Product Owner to solve the stalemate.
The same should be applied in the design phase – sharing technical details, suggesting other parties a better way to handle their tasks, requiring detailed documentation and high quality of work.
In the design phase, when one party is stuck with a big problem, should always contact others. Sometimes a change in the client logic or UI can be very simple, but the change logically being done in the API would be big trouble (like one day to one hour, or one day to one week in terms of effort). Then the party having the easiest way to develop something should do it.
Use temporary tables for bulk upload (validation on client side)
Send a small amount of data (JSON format for requests, no data duplication)
Try to balance the load with not sending millions of items at the same time (either as one package or one million async requests with one item).
Update only changed fields (but will require additional logic to handle field update timestamps and data storage)
Data transfer failures
Error handling and tests
Always up to date
Ready for production