## Experimental Validation Using Ground Truth Data

Validating the performance of automatic vessel segmentation algorithms is not straightforward, primarily due to the difficulty in identifying the "ground truth," that is, establishing what is the "correct answer" or what exactly a computer segmentation is expected to produce. A secondary issue is the fact that, based on the application, the degree or amount of error that is acceptable varies and a way to quantify the error needs to be developed. For example, topological applications for studies of the vasculature may place a high emphasis on detection of all vascular segments or precision in determining the vessel boundaries, possibly to sub-pixel accuracy. On the other hand, applications that require vessel segmentation for registration purposes may have the opposite emphasis. To validate a segmentation algorithm, the ground truth, the definition of what is being compared, and a measure are all required [78].

The arrival at "ground truth" is a known hard problem in image analysis and pattern recognition systems [54]. With retinal images, the ground truth is simply unavailable, and can be approximated by the creation of a "gold standard" to which computer-generated results are measured. In the context of this chapter, we define a gold standard as a binary segmentation, denoted as G, where each pixel, denoted as G(x, y), assumes a value of 0 or 1 for background or vessel, respectively. With such a standard, a computergenerated segmentation, denoted C, can be compared and evaluated against the gold standard G. In evaluating each pixel, there are four possible cases. The case when C(x, y) = G(x, y) = 0 is called a "true negative." The case when C(x, y) = G(x, y) = 1 is called a "true positive." The case when C(x, y) = 1 and G(x, y) = 0is called a "false positive." Finally, the case when C(x, y) = 0 and G(x, y) = 1 is called a "false negative." The frequency of these cases provides data that can be used as an indication of an algorithm's performance.

Generation of gold standards is often a costly and time-consuming process. In addition, it is known that even expert human observers are subjective and prone to a variety of errors. For example, it is possible for one observer to label a vessel that another missed, and vice versa. This inconsistency is referred to as inter-observer variability. Likewise, the same human observer is very likely to generate different segmentations for the same image, which is known as intra-observer variability. Thus, the use of a single human expert's annotations is unreliable and should be considered inadequate for the purpose of generating a gold standard. Thus, one approach to the creation of a gold standard is to combine multiple human-generated manual segmentations. From a set of multiple observers' manual segmentations H, with each individual segmentation being denoted Hi, we wish to obtain a single binary segmentation G that will be considered the gold standard. We know that these segmentations will differ and thus a strategy for resolution of these differences must be created.

## Post a comment