Periodic Audits

In my last full-time job, a mission critical environment, I performed daily audits of the company's SAN, both high volume, high availability database servers, and the company's file server, any of which, if they failed, would have shut down production. 

Because of the tools and techniques I have perfected, these audits rarely consumed more than 45 minutes per day, only longer if the audits uncovered something suspicious that warranted investigation. 

The audits encompassed several elements:

  1. I executed a small suite of SQL queries, about 15, that at a glance, informed me of the history and state of every critical element, both hardware and software, associated with the company’s 2 SQL Servers, and let me know that all SQL Agent jobs executed successfully as well as their execution elapsed times.
  2. I executed several business intelligence queries that were designed to inform company management of the level of traffic on, and utilization of, the company’s servers.
  3. I inspected the Windows Logs on all Windows servers (the 2 database servers and the NAS (file storage) server) for any types of warnings or errors.
  4. I inspected critical hardware and software components of the company’s Compellent SAN. 

After each daily audit, I composed a simple report showing the state of each server and the SAN, any critical elements in which management was interested, and the levels of traffic and server utilization. 

This simple daily audit, more than once, intercepted developing problems, and more than once, prevented catastrophic failures. 

For instance, although our SAN was fully alarmed to report any condition out of the ordinary, I once discovered, during a routine daily audit, that the negative DC voltage for the power supply to an entire bank of 16 high speed Fiber Channel (15K) disk drives was operating 2 volts below normal. This was an impending disaster about to happen. The low voltage power was producting extra heat and going to burn up all of the DC motors driving the disk drives. If this had happened, an entire bank of SAN disk drives could have failed and potentially, not only taken the company offline, but put them out of business. The SAN never produced any alerts or alarms, even though it should have. I was able to contact our SAN vendor and schedule replacement of the failing power supply before any damage to disk drive motors occurred.

This is a perfect example of why automation can never be fully trusted. You must automate, but you must also verify that the automation is working as expected. Both are best practices.

And it also a perfect example of the value and return on investment of time and energy that pro-active server administration and management practices produce.

Contact me to schedule a free consultation on performance tuning your SQL Server.