Quartz.NET in UiPath
UiPath uses Quartz.NET for triggering different jobs in Orchestrator. Solution work well with standalone Orchestrator setup. But when you have highly available setup, with two or more Orchestrator instances, Orchestrator needs to monitor which jobs are to be executed and by which node.
Orchestrator uses Quartz builtin clustering, where each node monitors each other. Each node reports of themselves to Orchestrator and there is default reporting interval of 7,5 seconds in Quartz. If node has not reported to the cluster inside of 7,5 seconds, other nodes will assume that it has failed and removes node from the cluster.
For some reason which is not fully clear yet to me, or to UiPath, there can be issues causing instability in UiPath platform where robot requests to Orchestrator suddenly might timeout.
Example, robot made an POST Assets/GetRobotAssetByNameForRobotKey which resulted in HTTP 500 Server Error and eventually accumulated to fail the whole process. Thing which we witnessed happening when this occurs, is Orchestrator event log message
“[Quartz] This scheduler instance (Node1.636936211131625835) is still active but was recovered by another instance in the cluster. This may cause inconsistent behavior.”
This happens when other nodes in quartz cluster decides to remove node from the cluster. The function for this, is found at [QRTZ_SCHEDULER_STATE] table which has the field [LAST_CHECKIN_TIME] which tells to other cluster nodes when they have reported to be “alive”, and then there is another field named [CHECKIN_INTERVAL] which has value 7500 (7,5 seconds). CHECKING_INTERVAL is used by the cluster nodes to check when the other nodes should have reported of themselves.
PROBLEM here is, that the same CHECKING_INTERVAL is used by the cluster to check when the nodes should have reported of themselves, and the same value is used to instruct nodes WHEN to report to the cluster.
Now, if one of the Orchestrator nodes is late in reporting, it will be deemed to be down and removed from the cluster. This value can be raised by setting 7500 to example 15000 (15 seconds) means that other nodes will check the timestamp when the nodes should have reported in (inside 15 seconds), and also now the Orchestrator nodes are instructed to report in every 15 seconds so there still is 0ms of latency allowed. This can lead to inconsistent behavior in Orchestrator. NOTE: We know this can be set and we have tested and verified the finding using configuration quartz.jobStore.clusterCheckinInterval
SO, the issue with this now is, that there is not even a small possibility to have minor latency delays (caused by load, Azure network, multi-region failover replication). It looks like it is kind of “know” issue https://stackoverflow.com/a/48705751
Proposal is, to add additional value inform nodes WHEN TO REPORT IN, and the other value would be used to WHEN IT SHOULD HAVE REPORTED IN. So there would be example CHECKING_INTERVAL of 7500 (7,5 seconds) telling quartz cluster nodes to when machines should have reported in, and second value REPORTING_INTERVAL of 7000 telling cluster nodes to report in every 7 seconds. Then there would be 500ms grace period for nodes to report in, without taking those out of the cluster causing processes to fail.
These delays might occur for several reasons in cloud environment, example using Azure SQL Database which is a PaaS from Azure, we cannot control the clock of the SQL cluster. Even with servers having same NTP, there might be small time differences. SQL block / dead-lock events can cause delays. Also the network delays could cause issue when there is availability cluster in different Azure regions using and having SQL replication.
Unfortunately the only advice I’ve gotten from UiPath is to “raise the SQL performance”.
I have always hated when software issues (which should be working as based on the hardware requirements of the software), are to be “fixed” by purchasing more capacity that is needed, just because of poor architecture design. For the customers, requirement is to utilize cost effective cloud PaaS capacity for the backend components, and focus on the automations and as they are also buying licenses for the software, they expect more than recommendation to buy more capacity. BUT, it is all that we have now. Buy highly available single web instance, OR buy two web instances and overly powerful SQL to solve this issue.