Open Workout Challenge: How we tackled a hefty flow of transactions, dozens of restrictive rules, and synchronous operation, teaching Smart Routing to avoid unnecessary ‘push-ups.
In our system, a default timeframe of 30 seconds is allocated for a transaction to start and finish, meaning it should obtain a final status of “Success” or “Declined.” The transaction begins when the API request initiating the transaction enters our payment gateway. From the payment gateway, the transaction is passed through various modules: first to Smart Routing, then to the connector, and finally sent to the ultimate acquirer or PSP, from which we expect the final response. The entire process—from the moment the API request enters our system to receiving the final response from the ultimate acquirer or PSP—is allotted 30 seconds.
These 30 seconds are necessary to avoid creating queues in synchronous integration mode. While awaiting a response from the acquirer or PSP, we maintain a network connection with them. The number of simultaneous connections is always limited to prevent overloading ourselves (in other words, there won’t be available network sockets for someone else to join, rendering the service unavailable). This is a common practice when the acquirer or PSP does not support asynchronous integration, and our system interacts with the acquirer’s payment gateway in synchronous mode. In this case, we are describing exactly this scenario.
Synchronous mode is when we establish a network connection with the other party and keep it throughout, starting from the first request to the server until receiving the final result. The server’s number of simultaneous connections is always limited by its hardware and software configuration. Therefore, situations cannot be ruled out where the limit on simultaneous connections is reached, and if someone tries to contact us at that moment, they will receive a service denial because opening another connection won’t be possible.
Asynchronous mode is a mode of interaction with the other party when we send a request and do not keep the connection. Instead, we wait for a callback, i.e., a notification about the result. This asynchronous mode was invented for highly loaded systems, where a request is sent, the connection is terminated, freeing up resources for other requests to the server. We expect the other party, having processed our request, to return the response to us by accessing the address we specified, providing us with the necessary data, and closing the communication channel.
For high-load systems, asynchronous mode is advantageous, as it optimizes resource usage. Usually, 30 seconds in synchronous integration mode is more than enough because the processing of any transaction takes an average of no more than 10 seconds. However, if the acquirer (in our case, the PSP) takes a long time to respond, and we cannot obtain the final status within 30 seconds, the transaction in our system remains in the “Incomplete” status. Later, it can be manually assigned a final status after reconciliation with the acquirer and/or PSP.
Our system’s tenant faced an issue with a PSP (exactly a PSP, not a bank acquirer) that did not support asynchronous mode. The tenant had a major merchant in their system who, for some reason, interpreted our “Incomplete” statuses as final unsuccessful statuses. Instead of investigating why the transactions were hanging, the merchant cascaded them to the next external acquirer (not to our tenant). This led to the double debiting of funds from the buyer. This occurred because the transaction was finalized with the first PSP, but since we terminated the communication channel after 30 seconds, we were unaware of it. Since the merchant misinterpreted the statuses, the transaction, when cascaded again, went through, and was successfully finalized with the second external acquirer.
The “Incomplete” status is an interim status that a transaction receives immediately after creation in the system and holds until it obtains the final status (“Successful” or “Declined”). While a transaction has the “Incomplete” status, the merchant does not know whether the funds have been debited or not. However, it is not advisable to cascade a transaction while it has the “Incomplete” status. It is necessary to wait for the final status or contact the PSP to find out why the transaction did not receive the final status. However, this major merchant believed that “Incomplete” equaled “Declined.”
We started investigating why the system did not complete the transaction processing within 30 seconds. The response logs on the PSP side showed that it responded to us in 18 seconds and even provided a final successful response. In theory, our tenant’s system should have completed the transaction, but for some reason, it did not. We examined where the transaction was delayed for more than 12 seconds and found that it was delayed in our Smart Routing. Its processing on our side took almost half of the time allocated for transaction processing. This is very long, considering that Smart Routing should handle all rules in milliseconds.
We continued to study the case and discovered that our tenant’s system had created 30 rules at the PSP (system-wide) level. These rules were applied to every transaction coming from any merchant. These rules were aggregative, restricting rules like: “If the transaction amount is greater than N for the current month where: transaction currency = A, transaction type = payment, transaction status = success, then decline.” Our tenant PSP did not want to allow through its system successful payment transactions exceeding a certain amount in each currency, as determined for themselves each month. Therefore, they created such restricting rules for each of the 30 currencies they dealt with.
However, the logic of Smart Routing is such that first, all calculations necessary to check the rule are performed, and then the transaction is checked against this rule. The correspondence of the transaction currency to the currency specified in the rule is checked during the transaction check against this rule, that is, after all necessary preliminary calculations have been performed.
As a result, each time a transaction entered the system, the system, in real-time, calculated the total amount of successful payments for each of the 30 currencies at the moment the transaction entered. This process is repeated with each new incoming transaction. From a hardware resource perspective, these were quite “heavy” rules. Each time the system accessed the database, it collected all data for a specific currency from the beginning of the month and performed an aggregation calculation. And there were 30 currencies and rules.
Aggregation calculation is when the system calculates the previous set of the number of payments or the total amount of payments.
We saw that for each transaction, the system accessed the database 30 times to calculate how many payments had been made up to the moment of processing each successive transaction. And since the merchant had more than a million transactions per month (and overall, our system processes tens of millions of transactions per month), effectively, each time our system accessed the database, it selected only successful transactions for this merchant and performed calculations for all 30 currencies. However, calculations needed to be done only for the transaction currency.
For example, if the transaction is in euros, the tenant only needs to calculate and check whether the tenant exceeded 10 million euros in the current month or not. However, due to the mentioned feature of Smart Routing logic, calculations were performed for both the transaction currency and all 29 other currencies, even though to achieve the set goal, it was sufficient to perform calculations only for the transaction currency, to apply only one rule out of 30. And it was on these operations that we lost 18 seconds.
When the acquirer responded to us within standard limits, at most within 5 seconds, conducting the transaction within 12 seconds in Smart Routing finalized the transaction within the allocated time. However, in this specific case, when the acquirer responded to us in 18 seconds, timeouts started in our system.
To address this situation, we added the ability to set a precondition for a rule in Smart Routing so that the system could determine whether to execute this rule (and consequently start calculations of all data needed for checking this rule) or not.
If the tenant of the system wants to set a limit on the aggregated payment amount, they should specify in the precondition of this limiting rule the currency of the transaction to be checked, for which this rule should be applicable.
If they want not to exceed a certain amount in each currency, then for each currency, a rule of the following type should be created: “If transaction currency is A, if the sum of transactions is greater than N for the current month where: transaction currency = A, transaction type = payment, transaction status = success, then decline.”
Now, the system looks at the transaction currency in the precondition of the rule, compares it with the currency of the transaction being checked, and if they match, the calculation of the necessary aggregated values needed for further transaction checking under this rule is triggered. Otherwise, the rule is not applied to this transaction, and accordingly, the calculation of the necessary aggregated values is not performed.
Thus, now out of all 30 rules, only one is effectively triggered and processed – the calculation is performed only once for the specific currency of the specific transaction. Smart Routing has stopped performing unnecessary work and processes all aggregative limiting rules in less than a second, as expected.
We discovered this problem because a series of circumstances coincided:
- A large flow of payment transactions for our tenant.
- Dozens of limiting rules at the level of the entire PSP (not for a specific merchant or store) for our tenant. With just one or two rules, we wouldn’t have encountered such a timeout, as the processing of Smart Routing rules, even with a large transaction volume, would have taken 1-2 seconds.
- A long response from the acquirer or PSP (more than 15 seconds) with which our tenant works.
- Interaction with the acquirer in synchronous mode. If communication with the acquirer had been asynchronous, we would have received a final response from it and would have completed the transaction in 35-40 seconds.