Hi there, I am experiencing an issue in AX 2012 where a batch job that I've written seems to hang indefinitely occasionally. For a bit of context, my batch job is posting around 500 journals. This batch job is run weekly and takes around 6 to 8 hours while it sequentially posts all of the journals. I recently rewrote the batch job so to use multiple batch tasks so that multiple journals can be posted in parallel and had good success getting this process down to around 1 hour. I'm using the top picking technique described here: https://www.linkedin.com/pulse/batch-parallelism-d365-finance-operations-valentyn-lysenko
Each batch task performs a pessimisticlock select with readpast on a custom table that holds the list of journals that need to be posted, posts the journal, and updates the status in the custom journal table to posted.
A large majority of the time this process has worked great and the time to post our weekly batch of journals is greatly reduced. However, a few times during the testing, we've experienced an issue where the batch job hangs indefinitely. Using some database monitoring tools like sp_who2 I can see that a SPID for one of the batch tasks is blocking all the rest of the tasks but for some reason that SPID is in a /sleeping/ status and has a command of /AWAITING COMMAND/. This is not a deadlock, it's just that the other batch tasks are waiting for the first batch task to complete it's transaction but the first batch task is sitting there doing nothing.
I've talked to my database administrators and they do not have any suggestion on what could cause this. One of the times this happened, we killed the sleeping SPID. This threw an exception to the batch task and the rest of the tasks became unblocked and were able to complete their processing. We have run this updated batch process in a test environment about 50 times and this sleeping SPID issue has happened 3 times. Does anyone have any suggestion on what may be causing this?
This post seems to suggest that the caller (the AOS in this case?) may have had a client side command time out which left the transaction open so SQL will wait indefinitely for a response but since the client connection failed, it will never call back to close out it's transaction: https://dba.stackexchange.com/questions/41709/sleeping-spid-blocking-other-transactions
If this were the case, wouldn't a SQL error like a command time out be thrown to the batch job and cause an error in the batch? Does anyone have any suggestion on how to troubleshoot an issue like this? ​
Thank you for your reply! To ensure a great experience for everyone, your content is awaiting approval by our Community Managers. Please check back later.