Environmental project, 1005 – Time Series Study of Air Pollution Health Effects in COPSAC Children

Time Series Study of Air Pollution Health Effects in COPSAC Children

3 Statistical Methods

3.1 Generalized Additive Model (GAM)
3.2 Modelling Lag Structure

3.1 Generalized Additive Model (GAM)

A generalized additive Poisson regression model was fit modeling the logarithm of the expected value of daily counts of respiratory symptoms as a sum of linear and smooth functions of the predictor variables (Hastie et al., 1990; Schwartz, 1996). The generalized additive model allows regressions to include nonparametric smooth functions to model the potential nonlinear dependence of the daily respiratory symptoms on weather and season. The model assumes:

where Y is the daily count of respiratory symptoms in COPSAC children, E(Y) is the expected value of that count, X_i is the covariate, and S_i is the smooth function. For the Si we used smoothing spline. This approach is standard in air pollution time-series modeling (Schwartz, 1994).

Smoothing spline is a type of smoother, which is a nonparametric tool for summarizing the trend of a response measurement Y as a function of a predictor measurements X. Smoothers serve as a descriptive tool, depicting the shape of a relationship between X and Y, and as a building block of the estimation of additive models. The simple example of a smoother is a running mean (or moving average), while others types of smoothers include polynomial, loess, Gaussian kernel, regression spline, natural spline, etc. Smoothing spline creates a smooth curve through the data, where the level of smoothness is adjusted by a varying parameter that changes the curve from a least squares-line Approximation to a cubic spline interpolant. The smoothing spline s is constructed for the specified weights wi . The smoothing spline minimizes

If the weights are not specified, they are assumed to be 1 for all data points. p is defined between 0 and 1, p = 0 produces a least squares straight line fit to the data, while p = 1 produces a cubic spline interpolant. When non-specified, the smoothing parameter is automatically selected within `interesting range' near 1/(1+h³/6) where h is the average spacing of the data points, and is tipically much smaller than the allowed range of the parameters. Because smoothing splines have an associated parameter, these fits can be considered to be parametric. However, smoothing splines are also piecewise polynomials like cubic splines or shape-perserving interpolants and are thus most often considered a nonparametric fit type.

Our final model was:

where Y_t denotes the daily count of new respiratory symptoms (incidence) in COPSAC children, assumed to be Poisson distributed, β₁ denotes the log relative rate of morbidity associated with an unit increase in mean daily pollution, P_t pollutant of interest, S₂(T,5) is a smooth function of temperature with 5 degrees of freedom, S₃(time,8) a smooth function of time with 8 degrees of freedom, and offset(log(cohort)) term that incorporates weighing the daily number of new respiratory symptoms over the daily number of children in the cohort.

Single pollutant model was fit for each of the pollutants in Table 2.2.1, for all three populations defined in the section 2.1. Missing values were excluded from the analysis. Each pollutant was treated as having linear association with respiratory symptoms. To reduce sensitivity to outliers in the pollution levels the analysis excluded extremely high values. Exclusion of the extreme values was described in section 2.2.

The daily count of incident (new) cases of respiratory symptoms in COPSAC children was the main outcome in the study as registered on the first day of the event. A pilot study was performed with counts of prevalent cases in Population 1 and 2 together, and lead to the conclusion that incidence is most relevant clinical outcome in this study. Prevalence of symptoms on a given day would be highly dependent on prevalence of symptoms in the subject on the previous day and prevalence would be difficult to interpret with respect to lag-time in relation to air pollution levels.

Smoothing functions were used to capture seasonal and other short-term and long-term trends in the data. Temperature was used to capture potential short-term confounding. Other meteorological variables did not enter the final model due to correlation with temperature (Relative Humidity) or non significance (Wind Speed and Global Radiation). Smooth functions of time were used to remove the basic long-term pattern in the data. The span for the smooth function of time was chosen to remove seasonal and long-term trends and to minimize autocorrelation in the residuals, by methods described previously (Schwartz, 1999). The day of the week indicator, a standard variable used in GAM models to capture short-term trends, was not significant confounder in this study. Influenza epidemics variable, defined as a weekly percent of total physician visits, was not significant confounder in this study (Appendix B).

Analyses were performed by gam function, mgcv package in R statistical software.

3.2 Modelling Lag Structure

To evaluate the impact of air pollutants for up to 5 days after exposure several lag modeling methods were implemented. We chose a maximum lag of 5 days before the respiratory symptom incidence for the air pollution variable, because the goal of this analysis is to estimate the short-term effects of air pollution, and because previous studies have shown that longer lag had little correlation with development of respiratory symptoms.

Distributed lag models have been commonly used in social sciences and recently implemented in epidemiology by Pope and Schwartz (Judge et al., 1980; Pope et al., 1996). The motivation for the distributed lag model is the realization that air pollution can affect the development of respiratory symptoms in children not only on the same day but also on several consequent days. Unconstrained distributed lag model assumes:

where Xt-is the pollutant concentration q days before the occurrence of respiratory symptom. The overall effect of a unit increase in air pollution on a single day is its impact on that day plus that on subsequent days, that is sum of 0+ . . .+ q. Thus equation above can be rewritten as:

where the ω_i are weights that sum to 1, and β* is β₀+ . . .+ β_q. That is, β* is interpretable as the marginal effect of a unit increase in a weighted average pollution variable. Because a unit increase in pollution on a single day increases the weighted average on all subsequent days, the effect of that single day's increase will be β*ωi on each of the q subsequent days, or β* overall.

We fit unconstrained distributed lag model adapted to the GAM model from section 3.1:

where P_t is the air pollution level on the day t of the incidence occurrence, and P_t-5 is the air pollution level 5 days prior to incidence occurrence.

Because there is a substantial correlation between air pollution concentrations on the days close together, the above regression may have a high degree of colinearity, that results in unstable, but unbiased estimates of β's.

Next, to gain more insight into the shape of the distribution of the effect over lag, we fit single day's air pollutant exposure model, where pollutant level at day t and on subsequent days for up to 5 days are fit separately. This is an alternative approach to the unconstrained distributed lag model which is a constrained lag model, with a very restrictive constrain where β₁= β₂ = . . .= β_q=0. As we are not quite sure that the effects of pollution are limited to a single day, these constraints are much more restrictive than those in the undistributed lag model, and are likely to introduce bias in the estimated overall effect. They, however, give a marginal effect of a single day exposure and thus provide useful insight into the shape of the distribution of the effect over lag.

We finally also fit a traditional moving average approach with a 6-day moving average, to compare overall effect estimates with those from undistributed lag model.