Missing data can lead to biased and inefficient parameter estimates in statistical models, depending on the missing data mechanism. Count regression models are no exception, with missing data leading to incorrect inferences about the effects of explanatory variables. A convenient
...
Missing data can lead to biased and inefficient parameter estimates in statistical models, depending on the missing data mechanism. Count regression models are no exception, with missing data leading to incorrect inferences about the effects of explanatory variables. A convenient approach for dealing with missing data is to remove observations with incomplete records prior to the analysis – often referred to as case-wise deletion. Removing incomplete records, however, reduces the sample size, increases standard errors and, if data are not missing completely at random, produces biased parameter estimates. A more complex approach is multiple imputation, which provides an estimate of the modelling uncertainty created by the data ‘missing-ness’, as distinct from the natural variation in the data. However, multiple imputation produces biased parameter estimates if the probability of missing data is related to the observed data – or is endogenous. Latent variable modelling has recently been introduced as an alternative approach for dealing with missing data, but it comes at a high computational cost and complexity. Despite fairly extensive methodological advancements in statistical literature, case-wise deletion is commonly employed to deal with missing data in statistical models of transport, while the multiple imputation and latent variable approaches remain relatively unexplored. More importantly, the performance of these approaches has not been tested across different types of data missing-ness. To address these gaps, this study aims to contrast case-wise deletion with multiple imputation and latent variable approaches in dealing with missing data in count regression models. We compare the performance of these three approaches using crash count models estimated against empirical data obtained from state controlled roads in Queensland, Australia. A quasi-experimental evaluation of data missing-ness is then conducted by extracting three data subsets from the original dataset, each with a unique missing data mechanism (with terminology adopted from the statistical literature): missing completely at random, missing at random, and missing not at random. The three approaches are then applied to each data subset and the results are compared in terms of bias, precision of parameter estimates, and goodness-of-fit. The findings indicate that multiple imputation is the most effective approach when data are missing either completely at random or at random, whereas the latent variable approach is more effective when data are missing not at random. However, the effectiveness of the latent variable approach is dependent on the availability of suitable variables as instruments in the data.@en