Data Validation Overview
Data Validation is the process of confirming the integrity and completeness of digital information. For digital photographers, data validation generally addresses three questions. Is the archive complete? Did the files transfer properly? And, are the files uncorrupted?
Do I really have to do this?
Is the collection complete?
Did the files transfer properly?
Are the files uncorrupted?
MD-5 checksum
Write-once file validation
DNG validation
General file validation steps
Do I really have to do this?
Yes, you really have to do this. In order to maintain a successful archive, you'll need to do periodic data validation if you want to be sure that everything survives. While any particular file is not in great danger, over the life of an archive there's a good possibility that some silent corruption will introduce itself into your collection. It is important that you validate both primary and backup storage regularly. Proper data validation can spot problems early, before they lead to loss of your image files. Let's look at the components.
Is the collection complete?
This is probably the easiest validation to do, at least for people with a reasonably organized collection. The first step is to have a comprehensive catalog (or catalogs) of the collection. There's simply no way to know if the collection is complete without having some kind of record of what is supposed to be there.
Catalog software has the ability to check on the completeness of an archive, although some programs make it easier than others. Figure 1 shows the Find Missing Items command in Expression Media (now Media Pro). We suggest running a periodic check of the archive to make sure nothing has gone missing.
In order to have a complete catalog, you must first create a primary version of your image collection. If your files are widely scattered and mixed in with backup files, it will be nearly impossible to make a workable comprehensive catalog.
Figure 1 Catalog software can look through your collection in order to spot images that have gone missing. |
Did the files transfer properly?
The most common cause of missing or corrupted files occurs in transfer. The best way to prevent transfer errors is to use validated transfers whenever you copy files from one storage media to another.
Your operating system performs some basic checks when files are copied from one place to another, but it does not do a thorough validation. The only way to be absolutely sure that everything was transferred properly is to perform a bit-for-bit comparison between the original and the copy. While the operating system does not do this, many backup software packages do. Figure 2 shows an example of a file that was corrupted when it was transferred using the operating system.
Figure 2 This image was fine on the media card, but became corrupted when it was copied to a hard drive with a bad bridge board. The operating system did not notice the error. The bad bridge board in the drive enclosure was eventually diagnosed because validated transfers kept flagging transfer errors. |
We suggest using a utility to perform a validated transfer whenever you do an important transfer from one drive to another. Most backup software has the ability to do a bit-for-bit compare when copying files.
Read More about validated transfers in Data Transfer
Are the files uncorrupted?
Unlike completeness and transfer validation, checking on the integrity of individual files in your collection can be a more complicated process. For DNG files, the process can be quite simple, and lets you skip a number of the steps outlined below. Likewise, write-once copies of image files on CD, DVD and Blu-ray discs can be validated with a simple procedure if you take some steps prior to burning the disk. All other files, however, require a more difficult validation workflow. There are steps you can take to check on the files, and they are presented below.
MD-5 checksum
One of the most useful tools in data validation is the open-source MD-5 checksum (or MD-5 hash). An MD-5 checksum is made by running all the 1s and 0s of a digital file through an equation that produces a unique number as a result. This "hash" can be used to tell if even one bit in the file has changed, providing confirmation that the file is exactly the same as when it was first hashed.
Of course, this process will produce a hash mismatch if you've done anything to the file at all, even added a keyword, so it's really only useful for data that you never expect to change. There are two good places to employ this: in the DNG, and in write-once media.
Write-once file validation
If you create a validation hash when you burn your optical disks (or put away additive backups on hard drive), then you can check on the integrity of all the files by running a hash-checking utility. Any hash mismatch indicates a problem with the storage media that should be tracked down right away.
Read more in Write-once Media Validation
DNG validation
The DNG file format offers a special opportunity to use MD-5 hash with added flexibility. The source image inside a DNG file is never supposed to change, even though metadata can be added to the file itself, and the embedded preview can be updated to show the current file adjustments. As of the DNG specification 1.2, an MD-5 checksum has been placed inside all DNGs. This hash refers only to the source image, and can provide a reliable report of file integrity, even as the file is reworked.
It is therefore possible to validate an entire collection of DNG files without having to do any visual inspection.
Figure 3 The DNG converter can show three results on conversion. Converted means the file was successfully parsed and converted. The second message indicates whether it shows a checksum mismatch. The third message is generated when it can't even open the file. |
General file validation steps
If you want to determine the condition of files that are not DNG nor on write-once media, there are several steps you can take. They are listed here from easiest to most time-consuming.
Volume and directory validation
The first step for general validation of image storage is to check the media for problems with directory or volume corruption. If these check out fine, you know that the basic organization on the drive is intact.
Storage media integrity
The next step is to check on the media itself. For hard drive, that means a media scan that checks the drive for bad sectors. If this validates, then you know the drive itself is not failing.
Proprietary raw file integrity
You can use the DNG converter to do a basic check of the structural integrity of proprietary raw files. If the DNG converter can make a conversion, then the file is at least structurally sound enough to parse. It could still have corruption in the file as shown in Figure 2 above.
Visual inspection
The most certain, and most time-consuming, way to validate files is making a visual inspection. Of course, this is a monumental task for an archive of thousands of images. We suggest that the most efficient way to do this is to catalog the set of images, and inspect the results after you've finished. This can be tricky, though, because what you see may not be what's really there.
Up to main Data Validation page
On to Data Validation Details