Calling all SQL Performance Experts! If you want a challenge, this is it! This is a LONG read, but I believe you should have everything you need to provide recommendations. The TL;DR version is at the bottom
I have a 24/7 manufacturing customer that has recently started to experience massive/crippling locking timeouts + horrendous posting times. I'll setup the scenario for all to see, and then I'll discuss the details:
- NAV 2009 R2 Classic with SQL Option
- SQL Server 2008 R2, Version 10.50.2500.0. 32GB RAM allocated to SQL Server.
- Fully VMware environment. 1 server for NAV DB on SQL, 1 server for RDP.
- 23 concurrent users. 50% through RPD login, 50% through local clients.
- Database is about 500 GB on Drive D, with 20% free space. Log file is always truncated on Drive E.
- Only one TempDB database exists, and is also on Drive D.
- RAID 10 configuration, 5400RPM drives.
- Regarding Fragmentation: The highest fragmented index larger than 100 MB is Item Ledger Entry$16, at 18.5%
- Largest tables: GL Entry: 61GB, Value Entry: 54GB, Prod. Order Comp.: 27GB, Prod. Order Routing Line: 24GB, Item Ledger Entry: 17GB. Then a huge drop-off in sizes.
So now the scenario:
Cost adjustment is scheduled to be run every weekend at specific known times. During those times, users are aware that they cannot ship/invoice anything, nor can they post journals, because of the severe likelihood they'll be locked out by the cost adjustment process. Once cost adjustment is finished/stopped, a scheduled routine that posts all "declared" consumption and finished goods populates the consumption and output journals and then proceeds to post them. Over time, this auto-posting routine has gotten slower and slower, to the point where now it *may* post one journal line per second. This posting performance is steady throughout the week, but more pronounced after cost adjustment since the auto-posting at that point needs to catch up over the many hours it hasn't run while cost adjustment was running (we use the NAS for scheduled jobs).
While the production floor can safely "declare" their finished products without locking, just this week they started receiving locking timeouts on the Serial No. Info. table while the auto-posting routine was happening. This is unusual and critical because it prevents finished goods from being on hand.
So this is a MAJOR show-stopper issue. I did some digging into it and I found out that while the auto-posting routine is running, it is frequently flagged with a PAGEIOLATCH wait type in SQL. Checking the Resource Monitor, this wait type is corresponding in MASSIVE data read surges with little to no write activity on the DB NDF file. We're talking about reading 20MB/sec for several minutes if not hours in a row. So while this PAGEIOLATCH is being flagged on BOTH cost adjustment and auto-posting routines, other users on the production floor are being blocked by the former processes through a LCK_M_U wait type. This locking wait type is occurring on several tables, but primarily the Serial No. Info table (as explained above).
After some digging I settled on the DBCC TRACEON(1224) flag recommendation by Jorg, since it was not enabled. This IMMEDIATELY solved the problem and we do not receive any more locking errors on the Serial No. Info table.
However, crappy DB performance still remains. As I'm writing this I'm looking at the Resource monitor, and I see high read I/O but almost 0% CPU activity on NASSQL.exe and SQLSERVER.exe. And this...goes on and on for hours, until suddenly the Read I/Os drop down to almost zero, and then database gets into a rhythm of writing about 30KB/sec to the LDF and NDF files, while NASSQL.exe and SQLSERVER.exe are between 5-20% CPU utilization.
We're coming up on year-end and we haven't ran a full cost adjustment in 3 weeks due to lack of time and the locking timeouts mentioned above. Above all, right now we're trying to make sure there are no bottlenecks in our processing, but I keep seeing the Disk I/O as the biggest problem. I have repeatedly told the customer we need to upgrade their Disks, which are more than 5 years old, as well as to do full index rebuilding on their primary tables (GLE, VE, ILE, IAE), but the customer shot it down because putting the production floor on hold is not an option. So between now and the Christmas break there may not be a single down day, including the Christmas break (they work through all holidays!).
At this point I'm at a loss of what to do to make auto-posting faster. I've tried everything I can think of, except of course doing hardware upgrades and rebuilding indexes (as I said, customer can't have downtime). I need suggestions on what I can consider now, or if anyone can guide me to the true source of this problem.TL;DR version
: Are there any NAV/SQL performance consultants in the Greater Toronto Area that want to come onboard and help solve this problem?