Calling all SQL Performance Experts! If you want a challenge, this is it! This is a LONG read, but I believe you should have everything you need to provide recommendations. The TL;DR version is at the bottom
I have a 24/7 manufacturing customer that has recently started to experience massive/crippling locking timeouts + horrendous posting times. I'll setup the scenario for all to see, and then I'll discuss the details:
- NAV 2009 R2 Classic with SQL Option
- SQL Server 2008 R2, Version 10.50.2500.0. 32GB RAM allocated to SQL Server.
- Fully VMware environment. 1 server for NAV DB on SQL, 1 server for RDP.
- 23 concurrent users. 50% through RPD login, 50% through local clients.
- Database is about 500 GB on Drive D, with 20% free space. Log file is always truncated on Drive E.
- Only one TempDB database exists, and is also on Drive D.
- RAID 10 configuration, 5400RPM drives.
- Regarding Fragmentation: The highest fragmented index larger than 100 MB is Item Ledger Entry$16, at 18.5%
- Largest tables: GL Entry: 61GB, Value Entry: 54GB, Prod. Order Comp.: 27GB, Prod. Order Routing Line: 24GB, Item Ledger Entry: 17GB. Then a huge drop-off in sizes.
So now the scenario:
Cost adjustment is scheduled to be run every weekend at specific known times. During those times, users are aware that they cannot ship/invoice anything, nor can they post journals, because of the severe likelihood they'll be locked out by the cost adjustment process. Once cost adjustment is finished/stopped, a scheduled routine that posts all "declared" consumption and finished goods populates the consumption and output journals and then proceeds to post them. Over time, this auto-posting routine has gotten slower and slower, to the point where now it *may* post one journal line per second. This posting performance is steady throughout the week, but more pronounced after cost adjustment since the auto-posting at that point needs to catch up over the many hours it hasn't run while cost adjustment was running (we use the NAS for scheduled jobs).
While the production floor can safely "declare" their finished products without locking, just this week they started receiving locking timeouts on the Serial No. Info. table while the auto-posting routine was happening. This is unusual and critical because it prevents finished goods from being on hand.
So this is a MAJOR show-stopper issue. I did some digging into it and I found out that while the auto-posting routine is running, it is frequently flagged with a PAGEIOLATCH wait type in SQL. Checking the Resource Monitor, this wait type is corresponding in MASSIVE data read surges with little to no write activity on the DB NDF file. We're talking about reading 20MB/sec for several minutes if not hours in a row. So while this PAGEIOLATCH is being flagged on BOTH cost adjustment and auto-posting routines, other users on the production floor are being blocked by the former processes through a LCK_M_U wait type. This locking wait type is occurring on several tables, but primarily the Serial No. Info table (as explained above).
After some digging I settled on the DBCC TRACEON(1224) flag recommendation by Jorg, since it was not enabled. This IMMEDIATELY solved the problem and we do not receive any more locking errors on the Serial No. Info table.
However, crappy DB performance still remains. As I'm writing this I'm looking at the Resource monitor, and I see high read I/O but almost 0% CPU activity on NASSQL.exe and SQLSERVER.exe. And this...goes on and on for hours, until suddenly the Read I/Os drop down to almost zero, and then database gets into a rhythm of writing about 30KB/sec to the LDF and NDF files, while NASSQL.exe and SQLSERVER.exe are between 5-20% CPU utilization.
We're coming up on year-end and we haven't ran a full cost adjustment in 3 weeks due to lack of time and the locking timeouts mentioned above. Above all, right now we're trying to make sure there are no bottlenecks in our processing, but I keep seeing the Disk I/O as the biggest problem. I have repeatedly told the customer we need to upgrade their Disks, which are more than 5 years old, as well as to do full index rebuilding on their primary tables (GLE, VE, ILE, IAE), but the customer shot it down because putting the production floor on hold is not an option. So between now and the Christmas break there may not be a single down day, including the Christmas break (they work through all holidays!).
At this point I'm at a loss of what to do to make auto-posting faster. I've tried everything I can think of, except of course doing hardware upgrades and rebuilding indexes (as I said, customer can't have downtime). I need suggestions on what I can consider now, or if anyone can guide me to the true source of this problem.TL;DR version
: Are there any NAV/SQL performance consultants in the Greater Toronto Area that want to come onboard and help solve this problem?
I'll give that a try. I want to check the cache hit-ratio to see if it's low. What's a good ratio range to be at? And how do I check it and/or log it?
Although, I have my doubts that it's RAM. As I said, lots of data read I/O, and then shortly after that, just a trickling of data (and this goes on for hours).
G.R. & Associates, Inc.
Interesting problem indeed, I had similar issue with 360 GB database 5 or 6 years ago.
Installing SSD drives will improve the things but it will not resolve the problem. More memory will not help much.
My approach to resolve this would be.
1. I would install the script explained here
to monitor Blocks and locks.
2. I would keep the script running to to get some useful information.
3. When you look at the result it should provide what tables are conflicting and when. This information can then be used to optimise reports or posting routines in NAV.
4. If the step above does not help, then I would look at Data Compression. This would "delete" old Ledger entries. (Group them)
You can also create a test database and then ask some of the users to run some activities in the TEST database in order to collect information. All optimisation can be done tested in this TEST database before deploying into Production.
By the way, your statement already identifies the problem very nicely.
it is frequently flagged with a PAGEIOLATCH wait type in SQL. Checking the Resource Monitor, this wait type is corresponding in MASSIVE data read surges with little to no write activity on the DB NDF file.
Very often in Reports, you can find statement like
where one record is read for every record from e.g. Value entry table.
In most cases if you change the report you can improve the performance dramatically.
I hope this helps.
Max RAM set, default index fill factor set, MaxDOP, cost threshold, is the index being rebuilt often are wait stats being updated daily?
also large number of reads leading to a Pageiolatch problem may be due to some parameter sniffing.
Might be a few execution plans that need to be cleared out. see sp_configure or DBCC FREEPROCCACHE.
I'd also highly recommend the free ebook 'Accidental DBA' as well as looking into PAL Microsofts performance analysis of logs tool to get a base line on your servers wait stats and identify trends down the line.
I checked and we currently have about 8GB set aside for the NAV Server, while 32 GB are set for SQL Server: 28GB for Server Memory, 3GB for Cache, and 1GB for Optimizer. As far as I can tell, my cache hit ratio on the Optimizer is 95%, while the Buffer Cache Hit Ratio is 100%. The surprising thing is that these stats above are occurring while the CPU is constantly at 25% utilization on SQLSERVER.exe while only writing about 9KB/sec to the database LDF file! And as I said before, this can go on for hours! There's no reason for this to happen when the system isn't taxed and no other processes are running. Even FINSQL.exe is at 0% CPU utilization in the above scenario.
Am I missing something?
G.R. & Associates, Inc.
I'll try to use the performance toolbox by Jorg. Thanks for the tip. Waiting for him to send me the download link since it's by request only.
As far as blocking goes, I know exactly what tables get blocked, and which snippets of code they fail on. However, once I set traceflag 1224 those issues disappeared. At this point I'm just concerned about the crippling slowdown in item journal posting performance. As I explained in the post above, it really is that bad. But it gets worse...
It is very fast and brisk with posting finished goods (serialized and specific costing method), it is very fast with posting new FIFO raw materials and FIFO lot-tracked material). It is "okay" with Average Cost lot-tracked material, and it is EXTREMELY SLOW with older Average cost non-lot-tracked material (less than 0.2 lines/second).
In the last category (Average cost non-lot-tracked) a massive I/O read occurs, then it goes to zero, replaced with an ever-so-slow 8KB/sec I/O write to the NAV DB log file and TempDB log file. And as I said, this stays like this for hours.
G.R. & Associates, Inc.
Any other ideas?
G.R. & Associates, Inc.
Maybe there is something in there that you can use.
Just a side note: SQL Server 2016 SP1 Standard edition supports 128GB memory for the buffer cache alone. Just saying...