## Statistical Analysis Of Reporting Delays

There can be considerable delays before newly diagnosed AIDS cases are reported to central registries. For example, AIDS cases diagnosed in the United States are first reported to the staif of state health departments, who, in turn, report the cases to the national AIDS surveillance system at the Centers for Disease Control. Reporting

### CALENDAR MONTH OF AIDS DIAGNOSIS

Figure 7.1 Monthly AIDS incidence data in the U.S., adjusted and unadjusted for reporting delays. Error bars represent 95% confidence intervals. (Source: Harris, 1990a.)

delays may falsely give the appearance of a recent decline in disease incidence when, in fact, the apparent decline in incidence is due to incomplete reporting of the most recently diagnosed cases. In order to accurately monitor disease incidence trends it is necessary to adjust for delays in disease reporting. Figure 7.1 is a graph of numbers of cases in the United States by date of diagnosis for cases reported by March 31, 1989 (Harris, 1990a). The lower curve is unadjusted for reporting delays and suggests a decline in recent AIDS incidence. When the curve is appropriately adjusted for reporting delays, a different picture of rising disease incidence emerges.

The statistical analysis of reporting delays is complicated by the fact that the data are right truncated. This truncation refers to the exclusion from a data set of individuals with long reporting delays. Very similar statistical problems arise in the analysis of incubation periods of transfusion-associated AIDS cases (Chapter 4). Generally, failure to account for such right truncation will lead to biased results that underestimate reporting delays. The following discussion is adapted from Brookmeyer and Liao (1990b).

The data available for analysis consist of all cases reported to a registry as of the current calendar time, say C. The calendar time of diagnosis, Uh and the calendar time of report to the registry, Rh are recorded on each case (i indexes the case). The reporting delay for the ¿th case is d{ = Rt— Ut. We also record covariate values for each case, such as geographic region of diagnosis, risk group, or calendar time of diagnosis. Such covariates may explain heterogeneity in reporting delays. Again, the main complexity with the analysis of reporting delays is that the data are right truncated; that is, a case is included in the registry only if the reporting delay dt is less than or equal to T; = C — i/j. We call Ti the truncation time for the ¿th individual.

The cumulative probability that the reporting delay is less than or equal to t is called the reporting delay distribution, and we use the notation F{t) = P(d ^ t). The most important limitation of our data for estimating F(t), is that we cannot observe reporting delays larger than the maximum truncation time. For example, suppose a disease registry is examined on January 1, 1989, and it is found that the earliest diagnosis time of a reported case is January 1, 1983. The maximum truncation time is 6 years. The best we can do is estimate the reporting delay distribution conditional on the delay being less than or equal to 6 years. We call this the conditional reporting delay distribution, F*(t), and it is related to F(t) according to F*(t) = F(t)jF(6).

Our subsequent discussion refers to methods for estimating the conditional reporting delay distribution F*(t) = F(t)/F(tm) for some value tm which is not greater than the maximum truncation time. If tm is sufficiently large, then F* may be a good approximation to F. In our analysis of AIDS reporting delays (Section 7.3.3), we will estimate the reporting delay distribution conditionally on the delay being less than 4 years. An important point is that there is no information in the data about the proportion of AIDS cases with long reporting delays that exceed the maximum truncation time or the proportion of AIDS cases who are never reported (Lagakos, Barraj, and DeGruttola, 1988; Kalbfleisch and Lawless, 1989; Brookmeyer and Liao, 1990b). Additional epidemiological studies, such as death certificate reviews, are needed to address these important issues, and we return to this point in Section 7.4. An important caveat associated with an analysis based on the conditional distribution F* is that it could indicate delays are longer in subgroup A than subgroup B, but nevertheless, the proportion of cases that are never reported or have very long delays (greater than tm) could be larger in subgroup B than subgroup A.

We note that a simple linear regression analysis of observed delays dt on calendar time of diagnosis ui can be highly misleading, because we only get to observe AIDS cases if is smaller than 7"; = C — u^. Thus, a naive regression analysis of this type will show a trend of reporting delays becoming shorter over time, even if, in fact, there were no trends in the reporting delay distribution or possibly even if there were a trend toward larger delays.

### 7.3.1 Nonparametric Estimation

The method for finding the nonparametric estimate of the conditional reporting delay distribution adapts survival analysis and life table techniques for use with right truncated data. This approach involves expressing the conditional reporting delay distribution, F*(t), as the product of conditional probabilities. We define the conditional probability/»^ to be the probability that the reporting delay is equal to tj given it is less than or equal to ty, that is, pj = P(d = tj\d ^ tj). Then,

F*(t>)= f[ (1-/>,•) s = l,...,m-l, j = s+ 1

where (1 - p}) = P(d < tj\d tj) and F(tm) = 1.0. The estimate of the reporting delay distribution is obtained by substituting estimates of the conditional probabilities into the above expression. The only cases who can contribute information about pj are those cases whose truncation times are greater than or equal to tj and whose reporting delays are less than or equal to tj. We call these individuals the "risk set at t}" to emphasize the analogy with life table analysis. The number of individuals in the risk set at tj is called rij, and the number of cases with reporting delays of duration tj is called Tj. Thus, an estimate of pj is Tjltij. Then substituting this into the expression for F*, the nonparametric estimate of F* is

The variance is given by var(F*(g) = [F*(is)]2 S —-

which is analogous to Greenwood's formula.

These methods for nonparametrically estimating the reporting delay distribution are illustrated with hypothetical reporting delay data given in Table 7.1. Table 7.1 is hypothetical data of cases cross classified by the month of diagnosis and the reporting delay (in months). Since the maximum reporting delay which could possibly be observed was 5 months, the best that can be done (nonparametrically)

Table 7.1 Hypothetical Data of Cases Reported by April 30,1991: Illustration of Reporting Delay Calculations

Reporting Delay (months)

Table 7.1 Hypothetical Data of Cases Reported by April 30,1991: Illustration of Reporting Delay Calculations

Reporting Delay (months)

 Month of Diagnosis 1 2 3 4 5 6 Dec. '90 50 20 10 6 2 — Jan. '91 100 55 20 12 (273) (88) Feb. '91 171 115 45 (586) — Mar. '91 207 118 (836) — — April '91 220 — — — —

JVote: Numbers in parentheses are total cases in rectangular boxes.

JVote: Numbers in parentheses are total cases in rectangular boxes.

is to estimate the conditional reporting delay distribution given that the reporting delay is not greater than 5 months. The conditional probabilities p] are p2 = P{d=2\d^2) =^=.368

Then the reporting delay distribution given that the delay is less than or equal to 5 months is calculated from equation (7.1) as follows:

F*{4) = 1 - .023 = .977 F*(3) = .977(1 - .066) = .913 F*{ 2) = .913(1 -.128) = .796 F*{ 1) = .796( 1 - .368) = .503

The incidence data adjusted for reporting delays is obtained by appropriately dividing the observed cases by the reporting delay distribution. If ^ is the number of reported cases who were diagnosed u time units ago, then the adjusted incidence is Z* = Zl^*(u)- Again, this adjusted number does not account for cases with reporting delays longer than tm (in the hypothetical example, tm = 5 months). The calculations are illustrated in Table 7.2. There are two sources of uncertainty associated with the adjusted incidence data Z*: uncertainty in the estimated reporting delay distribution; and uncertainty due to binomial variation. The variance of Z* — Z/F* approximately (for large Z)

where a2, the estimated variance of F*, is obtained from equation (7.2) (see Brookmeyer and Liao, 1990). This formula is obtained by applying the delta method (Fienberg, 1980) to the ratio Z/F* with the binomial variance for Z and the variance of F*.

There are alternative approaches for estimating the reporting delay distribution. An alternative computational approach which produces the same estimate given by equation (7.1) is based on Poisson regression methods for the analysis of triangular incomplete contingency tables. Rosenberg (1990) gives a noniterative computational approach. These approaches are maximizing a conditional likelihood that is conditional on the numbers of cases that were reported to be diagnosed at each calendar time (Brookmeyer and Daminao, 1989; Harris, 1990a; Kalbfleisch and Lawless, 1989; Lagakos, Barraj, and De Gruttola, 1988). Another approach assumes a parametric model for the AIDS incidence curve and maximizes an unconditional likelihood. This approach is considered in Section 7.5.

### 7.3.2 Regression Analysis

In this section, statistical techniques for the regression analysis of reporting delays are briefly outlined. Regression techniques are important for assessing calendar time (secular) trends in reporting delay and

 Observed Incidence Adjusted Incidence Dec. 90 88 88 Jan. 91 187 187/.977 = 191 Feb. 91 331 331/.913 = 363 Mar. 91 325 325/.796 = 408 April 91 220 220/.503 = 437

identifying covariates, such as geographic region or risk group, that affect the reporting delay distribution. The main statistical problem is that the data are right truncated. We assume that there are r covariates (Xr,..., Xr) and that each covariate can take a finite number of values. Thus, each individual can be grouped into one of K strata defined by the values of the covariates. We label these strata by the covariate vector Xk, k= 1,..., K. The approach is to model the conditional probabilities/)^ as a function of covariates.

We extend the notation in the preceeding section by adding an additional subscript k to index the covariate strata. Thus, in the ¿th stratum, the conditional probability at tj is called pjk; the numbers of cases with reporting delay equal to tj is called Tjk; and the size of the risk set at t} is called njk. Our model is that, conditional on njk, the Tjk have independent binomial distributions, that is,

TJk = binomial (njk,pjk) j = 1,. . ., m - 1 k = 1,. .., K. (7.4)

Although the independence assumption is not strictly correct, it can be shown that the maximum likelihood estimates and their estimated variances are not affected by this assumption (Efron, 1988). In order to allow reporting delays to depend on covariates, we model the binomial probabilities as follows:

where ctj and P = (j31} • ■ ■, ftr) are parameters to be estimated. The function g is called the link function in the theory of generalized linear models (McCullagh and Nelder, 1989).

Two leading choices for the link function are the logistic link, Sifijk) = l°E{pjkl(l ~Pjk)} and the complementary log-log link, SiPjk) = log{ —logfl — pjk)}. The model given by equations (7.4) and (7.5) with a logistic link has been termed a continuation ratio model in the contingency table literature (Fienberg, 1980; McCullagh and Nelder, 1989). The complementary log-log link is especially attractive because it induces a simple relation among the distribution functions of reporting delays:

F?(ts) ={F*(ts)}6' 9k = exp(pXt) f = 1,..., m — 1, (7.6) where

i = s+l is the reporting delay distribution function when all the covariates are 0. The interpretation of a regression coefficient /},- associated with one of the covariates Xt is as follows: a positive f}t indicates delays are longer with increasing values of X{ after controlling for the other covariates; a negative /?,• indicates delays are shorter with increasing values of Xt; and a ^ equal to zero suggests no association between delays and the covariate Xt.

### 7.3.3 Reporting Delays in the United States

The methods described in Sections 7.3.1 and 7.3.2 were used to analyze reporting delays of AIDS in the United States (Brookmeyer and Liao, 1990b). The analyses were based on the October 1989 AIDS Public Information Data Set. This data set included 109,168 AIDS cases diagnosed in the United States and reported to CDC before October 1, 1989. The analysis included only cases who met the pre-1987 AIDS surveillance definition. This restriction is important because some cases who met only the expanded 1987 surveillance definition were reported to have been diagnosed months, and in some cases years, before the new definition went into effect, which could artifically give the appearance of long reporting delays.

Table 7.3 presents the nonparametric estimates of the reporting delay distribution for the entire United States and for each of the six geographic regions (using equation (7.1)). Overall, 51% of cases are reported within 3 months of diagnosis, and 84% within 12 months. The fastest reporting occurred in the Northeast and the slowest in the South. For example, the proportion of cases diagnosed within three months ranged from 0.56 in the Northeast to only 0.39 in the South.

An important question is whether the distribution of reporting delays has changed over calendar times. The regression methods of Section 7.3.2 were used to evaluate the separate effects of calendar time, risk group, and geography on the reporting delay distribution. These analyses suggested significant geographic variation. The influences of risk groups and calendar year of diagnosis were not consistent across each of the geographic regions. Variation among risk groups was attributed primarily to slower reporting of transfusion-associated and pediatric AIDS cases. An overall trend toward longer delays with calendar time of diagnosis was attributed primarily to a trend toward longer delays in the Northeast.

Adjusted AIDS incidence in the most recent month is very uncertain. For example, suppose the reporting delay distribution were known precisely, so that b = 0 in equation (7.3). Then, even with a true monthly incidence as large as 200 cases/month, the coefficient of variation of the adjusted incidence in the most recent month,

 Reporting Mid- Delay in Months U.S. Northeast Central West South Atlantic Other (number of cases)' (88,037)c (22,738) (6,375) (18,471) (11,342) (5,585) (22,249) 1 0.05 0.07 0.05 0.05 0.02 0.07 0.04 2 0.31 0.35 0.32 0.35 0.19 0.32 0.28 3 0.51 0.56 0.53 0.54 0.39 0.51 0.49 4 0.62 0.67 0.63 0.64 0.52 0.62 0.59 5 0.68 0.73 0.69 0.70 0.61 0.69 0.65 6 0.72 0.76 0.73 0.74 0.67 0.73 0.69 7 0.75 0.79 0.76 0.77 0.70 0.76 0.72 8 0.78 0.82 0.78 0.79 0.74 0.78 0.75 9 0.80 0.83 0.80 0.81 0.76 0.80 0.77 10 0.82 0.85 0.82 0.83 0.78 0.82 0.79 11 0.83 0.86 0.83 0.84 0.81 0.83 0.80 12 0.84 0.87 0.84 0.85 0.83 0.85 0.82 18 0.90 0.92 0.90 0.90 0.90 0.90 0.88 24 0.93 0.95 0.93 0.93 0.94 0.93 0.92 36 0.97 0.98 0.97 0.97 0.98 0.98 0.97 48 1.00 1.00 1.00 1.00 1.00 1.00 1.00

Source'. Brookmeyer and Liao, 1990b.

"Cumulative probability F*[t), of a reporting delay less than or equal to t months given the delay is less than or equal to 48 months. Cases reported within same month of diagnosis coded 1; cases reported in month following diagnosis coded 2, etc.

''Number of cases reported to the CDC before October 1, 1989, that met the pre-1987 surveillance definition.

'Includes an additional 1277 pediatric AIDS cases with missing geographic region information.

si CO

Source'. Brookmeyer and Liao, 1990b.

"Cumulative probability F*[t), of a reporting delay less than or equal to t months given the delay is less than or equal to 48 months. Cases reported within same month of diagnosis coded 1; cases reported in month following diagnosis coded 2, etc.

''Number of cases reported to the CDC before October 1, 1989, that met the pre-1987 surveillance definition.

'Includes an additional 1277 pediatric AIDS cases with missing geographic region information.

would be over 30% (from equation (7.3)). For this reason, adjusted AIDS incidence for the most recent month should be typically ignored when assessing trends. Many analysts recommend ignoring the most recent 3 or even 6 months of adjusted incidence data.

### 7.4 UNDERREPORTING OF AIDS CASES

Some AIDS cases may never be reported, and these cases are not accounted for by the reporting delay adjustments described in Section 7.3. Cases may not be reported either because they were never properly diagnosed, or, if diagnosed, may not have been reported to the surveillance system (General Accounting Office, 1989).

One method for assessing underreporting of cases is to identify individuals who died from AIDS by reviewing death certificates and to match these individuals to the AIDS surveillance registry. A measure of the completeness of reporting,/, is the fraction of cases identified by the death certificate review who are found in the AIDS surveillance registry:

Number of death certificate cases also in registry Number of AIDS cases found from death certificates

Hardy, Starcher, and Morgan (1987) used this methodology and searched death certificates from 1985 in four U.S. cities. They report / = 487/548 = .89. A source of uncertainty with this methodology is error in identification of all AIDS-related deaths from death certificates. AIDS cases that are not reported to the surveillance system may also be less likely to be correctly classified as an AIDS-related death on the death certificate.

A disadvantage of the above methodology is that it requires an identifier (record linkage) to match cases from death certificates with cases in the registry. An alternate methodology that does not require identifiers was proposed by Remis and Palmer (1991). Remis and Palmer assessed the completeness of reporting in the Quebec AIDS surveillance program by comparing deaths modelled from reported AIDS cases to AIDS mortality based on death certificates. They predicted AIDS deaths by propagating forward reported AIDS cases to obtain predicted dates of death, according to a survival distribution for AIDS patients. Based on a comparison of predicted and observed AIDS deaths in 1987-88, they estimated that the completeness of reporting was 92%.

A separate issue from the underreporting of cases concerns individuals with severe HIV disease who do not meet the AIDS surveillance definition. Stoneburner, Des Jaríais, Benezra (1988) studied the increased mortality in the 1980s among New York intravenous drug users. The death certificates among the IV drug users for whom the causes of death were AIDS-related were matched to the New York City Department of Health AIDS Surveillance Registry. They concluded that there was a large spectrum of severe HIV disease that does not qualify as AIDS.

An additional problem with tracking trends in individual risk groups is that the risk group for some cases may be unknown. The practice in the United States is to temporarily assign these individuals to a category with undetermined risk ("no identified risk"). Some of these individuals may then be reclassified to other risk groups, following the results of more intensive investigation and interviewing. As a result, trends in the undetermined group typically show pronounced growth in the most recent past. This "growth" must be interpreted very cautiously since it is likely due to using the undetermined classification as a temporary holding category. For example, as of November 1990, 9920 AIDS cases in United States were initially reported as no identified risk. Additional interviews and follow-up information were collected from 4863 of these cases, of which 4416 were eventually reclassified. Only 447 remained classified as no identified risk/other (Centers for Disease Control, 1990d).

In order to monitor risk group specific trends in AIDS incidence, it is necessary to redistribute cases with undetermined risk. Green, Karon, and Nwanyanwu (1992) have performed an analysis to examine past trends in these redistribution fractions among cases initially with undetermined risk for whom additional follow-up information and interviews were obtained. For example, among adult white males with initially undetermined risk, Green, Karon, Nwanyanwu (1992) report that eventually 72% are reclassified as gay, 7% as IVDU, 3% as gay/IVDU, 6% as heterosexual, 4% as transfusion-related, and 8% as no identified risk/other.

### 7.5 CHANGES IN THE SURVEILLANCE DEFINITION

The AIDS surveillance definition in the U.S. has been revised several times. These revisions reflect increasing knowledge of the pathogenesis of HIV infection and a desire to make sure that the surveillance definition reflects current diagnostic practice. Before 1985, the surveillance definition was not based on a positive HIV antibody test but required pathological evidence of AIDS-defining conditions. In 1985, the definition was expanded to include individuals who also had diseases such as disseminated histoplasmosis, chronic isosporiasis, and high grade or B-cell non-Hodgkins lymphoma (Centers for Disease Control, 1985b). A more significant revision occurred in 1987 (Centers for Disease Control, 1987d) when the definition was expanded to include HIV positive persons with diseases such as extrapulmonary tuberculosis, HIV dementia, and HIV wasting syndrome. The 1987 definition also included individuals who were HIV antibody positive and who were diagnosed with certain diseases, such as Pneunocystis carinii pneumonia and cerebral toxoplasmosis on a presumptive clinical basis rather than by histological proof. In 1993, the CDC revised the surveillance definition of AIDS. The case definition was expanded to include all HIV-infected persons who have <200CD4 + T cells or a CD4-I- T-lymphocyte percentage of total lymphocytes of less than 14. The expansion includes 3 clinical conditions—pulmonary tuberculosis, recurrent pneumonia, and invasive cervical cancer—as well as 23 specific clinical conditions in the 1987 AIDS surveillance definition (Centers for Disease Control, 1992b).

The impact of the definitional changes on reported AIDS incidence is illustrated graphically in Figure 7.2. Selik, Buehler, Karon, et al. (1990) report that about 28% of cases diagnosed and reported from September 1, 1987, to December 31, 1988, met only the new criteria of the 1987 revision. This proportion was highest among heterosexual intravenous drug users (43%) and lowest among male homosexuals (21%).

In general, the effect of broadening the surveillance definition is to

Only New Criteria w <D i/i nj U tri O

Only New Criteria

81 82 83 84 85 86 Quarter-Year of Report

Figure 7.2 U.S. AIDS cases by quarter year of report by surveillance definition. (Source: Selik, Buehler, Karon, et al.

category of case , 1990.)

81 82 83 84 85 86 Quarter-Year of Report

Figure 7.2 U.S. AIDS cases by quarter year of report by surveillance definition. (Source: Selik, Buehler, Karon, et al.

category of case , 1990.)

cause an abrupt increase in AIDS incidence (Figure 7.2). An important objective of the broadened definition is to capture a wider spectrum of HIV-related disease. Unfortunately, sudden changes in the surveillance definition make it difficult to interpret trends in AIDS incidence. There have been several attempts to reconstruct the AIDS incidence curve that would have been observed if there had been no definitional changes. Karon, Dondero, and Curran (1988) suggested the concept of a "consistent" AIDS case series. These cases include individuals who were diagnosed, either presumptively or definitively, with any one of the 1985 AIDS defining conditions. The consistent case series excluded AIDS cases who were diagnosed on the basis of the new 1987 AIDS defining conditions such as wasting syndrome or HIV encephalopathy.

A limitation of the consistent series is that it does not account for individuals who are diagnosed with AIDS by one of the new criteria and then subsequently develop a disease included in the old criteria. To address this concern, Gail, Rosenberg, and Goedert (1990a) introduced the concept of an "augmented consistent" case series. The idea is based on a three-state competing-risk model. The model is illustrated in Figure 7.3 where is the hazard of death for individuals diagnosed under the new definition and A2 is the hazard of progression to the old definition for individuals diagnosed under the new definition. Then, the probability that an individual diagnosed by the new definitional criteria would subsequently qualify for diagnosis with the old criteria in the tth month following the first diagnosis is

Death

Infection

AIDS Diagnosis Under Old Criteria

Figure 7.3 Multistate model of the effect of a change in the surveillance definition.

Equation (7.7) is used to obtain an augmented case series as follows: Suppose nt individuals are diagnosed under the new criteria in calendar month i. Then the diagnosis times of these individuals are reallocated forward in time so that «¡/>( individuals are considered to have been diagnosed under the old definition at calendar month (t + t). Figure 7.4 shows trends in both overall AIDS incidence and the augmented consistent case series among gay men in the United States.

7.6 EMPIRICAL EXTRAPOLATION OF AIDS INCIDENCE

### 7.6.1 General Considerations

The simplest spproach for obtaining projections of AIDS incidence is extrapolation of the AIDS incidence curve. The first Public Health Service projection made in 1986 estimated 270,000 cumulative AIDS cases in the United States by the end of 1991 (Public Health Service, 1986; Morgan and Curran, 1986). This projection was based on the extrapolation of a quadratic polynomial model for transformed monthly AIDS incidence, AIDS incidence was transformed using a Box-Cox transformation before fitting the polynomial model. The projections of annual U.S. AIDS incidence were 45,000, 58,000, and 74,000 in 1989, 1990, and 1991, respectively (Morgan and Curran, 1986).

The most serious limitation with extrapolation is that the projections depend crucially on the mathematical function used as the basis for the extrapolation. Furthermore, as discussed in the preceeding sections, AIDS incidence data are subject to a number of sources of uncertainty including reporting delays, underreporting, and changes in the surveillance definition. In this section, we discuss approaches for extrapolating the AIDS incidence curve and situations when extrapolation can produce useful short term projections of AIDS incidence.

The first step is to adjust AIDS incidence data for reporting delays as described in Section 7.3. Figure 1.1 displays the delay-adjusted AIDS incidence data by calendar quarter of diagnosis separately by risk group. This figure was based on all cases reported to the CDC by March 31, 1990. A simple log-linear model for extrapolation implies exponential growth in AIDS incidence. If E(T,) is the expected AIDS incidence at calendar time t, then log£(r,) =b0 + b1t. (7.8)

The regression parameters b0 and b1 can be estimated from statistical computing algorithms for Poisson regression (GLIM, for example,

10000

g 7500

Quarter of Diagnosis

Figure 7.4 Projected and observed quarterly AIDS incidence among homosexual and bisexual men in the United States. Projections were based on consistently defined AIDS incidence counts through June 30, 1987 (open squares) without constraints (solid line) and under the constraints that no infections occurred after July 1, 1985 (dot-dash line), as described by Gail, Rosenberg, and Goedert (1990a). Vertical lines indicate 95% confidence intervals, and solid squares depict augmented consistently defined quarterly AIDS incidence beginning in July, 1987. Solid circles depict all AIDS incidence beginning in July 1, 1987. (Source: figure 1 in Gail, Rosenberg, and Goedert, 1990a.)

Payne, 1986) under the assumption that AIDS incidence approximates a nonhomogeneous Poisson process. Equation (7.8) forces the expected AIDS incidence to keep growing exponentially, which is not consistent with epidemic theory. Indeed Figure 1.1 exhibits subexponential growth beginning in the early 1980s. Accordingly, it is necessary to add quadratic time terms to equation (7.8) to obtain log E(Tt) = b0 + b,t + b2t2.

The Public Health Service (PHS) projection in 1986 that there would be 270,000 cumulative AIDS cases by the end of 1991 was based on an extrapolation of a model of the form

Figure 7.4 Projected and observed quarterly AIDS incidence among homosexual and bisexual men in the United States. Projections were based on consistently defined AIDS incidence counts through June 30, 1987 (open squares) without constraints (solid line) and under the constraints that no infections occurred after July 1, 1985 (dot-dash line), as described by Gail, Rosenberg, and Goedert (1990a). Vertical lines indicate 95% confidence intervals, and solid squares depict augmented consistently defined quarterly AIDS incidence beginning in July, 1987. Solid circles depict all AIDS incidence beginning in July 1, 1987. (Source: figure 1 in Gail, Rosenberg, and Goedert, 1990a.)

where a is a power transformation. This model was applied to reporting-delay-corrected AIDS incidence data. The projection agreed with observed AIDS incidence until 1989, when the projections began to exceed AIDS incidence. By April 1, 1992, 215,263 cumulative AIDS cases with diagnoses through December 31, 1991, had been reported to the CDC, and the delay-corrected cumulative projection of AIDS cases through 1991 was 240,000.

The addition to equation (7.9) of higher order terms in time (e.g., t3 or t4) is not usually recommended. The most recent data points can have high influence on these regression coefficients and could result in AIDS incidence curves that predict dramatic changes in incidence in the short-term.

Even if the mathematical function used for the extrapolation agrees perfectly with observed counts of AIDS cases, the assumption that the mathematical function will agree with future AIDS incidence cannot be verified. Furthermore, any one of a number of statistical models may fit the observed AIDS incidence data equally well but give radically different long-term projections. Indeed, extrapolation of some models may yield anomalous and misleading results. For example, extrapolation of U.S. AIDS incidence data through 1987 using a normal density curve predicted sharp decreases in AIDS incidence and a cumulative final number of AIDS cases of about 200,000 (Bregman and Langmuir, 1990). The basis for this anomalous results was that although AIDS incidence was still increasing, it was increasing more slowly than previously (Gail and Brookmeyer, 1990a).

Conventional confidence intervals for the expected AIDS incidence at a future time, based on extrapolation methods, reflect the statistical uncertainty in the estimated regression parameters, but do not reflect the uncertainty in selecting the assumed parametric regression model. For example, the confidence bounds reported for the 1986 PHS projections are correct provided the assumed parametric model is correct. Conventional confidence intervals also do not reflect the random variation in future AIDS incidence.

Although these problems with extrapolation limit its usefulness for obtaining reliable long-term projections, extrapolation may still be useful for short-term projections. Observed trends in AIDS incidence may persist over the short term because even abrupt changes in the underlying transmission of HIV infection would not be seen in counts of AIDS cases for many years and then only gradually. This is because the AIDS incidence curve is smoothed when changes in the infection rate are convolved with the incubation period distribution (Chapter 8). Smooth curves may lend themselves to simple extrapolation. However in some circumstances, even short-term extrapolation can be in error. For example, some events may have an immediate impact on AIDS incidence, such as the advent of a new therapy to prevent or delay AIDS, a sudden increase in reporting delays, or a major change in the surveillance definitions. In these situations, historical trends in incidence cannot be reliably extrapolated.

7.6.2 Joint Modeling of Reporting Delays and AIDS Incidence

An alternative to, first, estimating the reporting delay distribution and then empirically modelling the delay-adjusted incidence data is to jointly model reporting delays and AIDS incidence (Harris, 1990a; Zeger, See, and Diggle, 1989). The basic idea is to model, Ttu, the number of AIDS cases diagnosed in calendar time (month) t and reported u time units (months) later. Zeger, See, and Diggle (1989) proposed a Poisson model for Ttu with expectation given by log E(TJ =/(/;p) +</(«;«)

where f(t; P) is a function of calendar time (t) with unknown parameters P, which describes the calendar time trends in AIDS incidence, and d(u; a) is a function of the reporting delay (a) with unknown parameters a, which describes the reporting delay distribution. Specifically, is the probability of a reporting delay equal to u months (actually, it is the conditional probability given that the delay is less than or equal to the maximum observed reporting delay; see Section 7.3). A simple choice forf(t) is a quadratic function f(t; P) = /?0 + Pit + Zeger, See, and Diggle (1989) suggest a cubic spline. A very flexible model for the delay function is the step function model d(u, a) = aa. Models such as these can be fit using Poisson regression methods (e.g., GLIM in Payne [1986].

There are several advantages to joint modeling of reporting delays and AIDS incidence. First, a correctly specified parametric model for AIDS incidence can increase the precision of delay-adjusted AIDS incidence considerably for the most recent time periods. Second, joint modeling accounts for two sources of uncertainty: uncertainty in the reporting delay adjustments and uncertainty in the estimated regression coefficients.

Fitted models of AIDS incidence will be unreliable in small subgroups with few numbers of AIDS cases (for example, subgroups defined by small geographic areas and risk groups). Zeger, See, and Diggle (1989) propose an empirical Bayes approach to predict AIDS incidence in small subgroups. The basic idea is to borrow strength from other similar subgroups to improve the trend estimates for a given subgroup. The empirical Bayes estimate for a given subgroup is a weighted average of a trend estimate obtained from modelling geographic and risk group effects from data from many subgroups and the trend estimate obtained only from the given subgroup.

0 0