Database integrity is a serious issue. If database integrity is compromised, the ripple effects it can cause may not be realized for days, weeks, or even months. It can infect a database like a slow growing cancer, the scope of the damage constantly spreading and not realized until it's too late for data recovery.
What causes database corruption?
- Primarily, disk I/O system hardware failures.
- Secondly, software bugs in SQL Server itself. SQL Server is one of Microsoft’s best and most robust products but like all software, it has bugs. Software states degrade with time. The longer a server runs without a reboot, the higher the likelihood of encountering a software bug. As a best practice, all servers should be rebooted as frequently as is practical. “Fresh” software has less chance of triggering the execution of a software bug.
Once database corruption is discovered, there are only two solutions:
- There is a way to “repair” a database. I place the word “repair” in double quotation marks because the repair operation comes with a simple warning from Microsoft: it guarantees data loss. Exactly what that loss will be is unpredictable.
- The only other solution is as onerous as a database “repair” operation. It, too, guarantees data loss. If a database has become corrupted, the only option other than “repair” is to restore an older copy of the database which will cause you to lose all of the data since the older, database backup or SAN snapshot was created. Now, the question becomes “How old should the database backup or SAN snapshot be?” One hour? One day? One week? One month? How much data and production time can you afford to lose? How to tell if the older copy of the database backup or SAN snapshot contains the same corruption? The ugly answer to all of these questions is that the process to recovery will be very time consuming and your database will be down while a solution to the problem is executed.
And the business question: How much data and time can you afford to lose and still remain in business?
There are some best practices that are designed to lower and manage risk:
(I say “lower and manage” risk because it can never be eliminated, only mitigated.)
- SQL Server has tools that allow a DBA to perform a procedure to verify the integrity of all logical and physical database structures, to certify the integrity of your databases. This procedure, once executed to completion, guarantees a clean bill of health for your databases if it produces no error messages. But the procedure has a very undesirable side-effect because it must read the entirety of all of your databases. If executed in a OLTP (online transaction processing environment), the procedure can severely and negatively impact production, especially if your server has large databases. Realistically, the procedure must be executed during non-production hours, or on just small databases, or on a machine that is not in production that can afford to execute it (you can restore a database backup to an alternate machine, not in production, and execute the procedure there). To have a viable database integrity verification process will require a corporate commitment to execute the procedure on a regular basis, with the emphasis being on how much data (in days) the company can afford to lose. For instance, if your company can only afford to lose one day’s worth of data, the procedure must be executed daily.
- All database backup files can and should be verified for checksum integrity during the database backup process, although this is not a substitute for the tools mentioned in item #1. This verification process only verifies the checksums of the individual database backup records but this process at least guarantees that your database backup file can be read without finding checksum errors, which would render it useless. Until SQL Server 2008, this option was not available through the GUI used to create database backup jobs. A knowledgeable DBA will know how to write a SQL script to verify the checksums of database backup file records during the database backup process.
- There is a place in SQL Server that can be examined every few seconds to see if a database page checksum error exists anywhere on your server. (A page checksum error is a hard error and means that a particular database page on disk is now corrupt.) If any data is ever stored in this location, it is a company emergency. It is the first indication of database corruption. A SQL Server Agent job can be implemented to execute this examination procedure and immediately notify a DBA and/or other administrators of a discovered error. The process of examination takes less than a millisecond, so executing the procedure frequently has no impact on server efficiency or operations.
- Reboot your database servers as frequently as is practical.
- Have your System Administrator and/or SAN Administrator perform diagnostics on your disk I/O system hardware as frequently as is practical.
If you need help with performance tuning your SQL Server, contact me to schedule a free consultation.