In my last full-time job, a mission critical environment, I performed daily audits of the company's SAN, both high volume, high availability database servers, and the company's file server, any of which, if they failed, would have shut down production.
Because of the tools and techniques I have perfected, these audits rarely consumed more than 45 minutes per day, only longer if the audits uncovered something suspicious that warranted investigation.
After each daily audit, I composed a simple report showing the state of each server and the SAN, any critical elements in which management was interested, and the levels of traffic and server utilization.
This simple daily audit, more than once, intercepted developing problems, and more than once, prevented catastrophic failures.
For instance, although our SAN was fully alarmed to report any condition out of the ordinary, I once discovered, during a routine daily audit, that the negative DC voltage for the power supply to an entire bank of 16 high speed Fiber Channel (15K) disk drives was operating 2 volts below normal. This was an impending disaster about to happen. The low voltage power was producting extra heat and going to burn up all of the DC motors driving the disk drives. If this had happened, an entire bank of SAN disk drives could have failed and potentially, not only taken the company offline, but put them out of business. The SAN never produced any alerts or alarms, even though it should have. I was able to contact our SAN vendor and schedule replacement of the failing power supply before any damage to disk drive motors occurred.
This is a perfect example of why automation can never be fully trusted. You must automate, but you must also verify that the automation is working as expected. Both are best practices.
And it also a perfect example of the value and return on investment of time and energy that pro-active server administration and management practices produce.