An important tool in forensic science is the likelihood ratio (LR), which quantifies the strength of evidence. It does so by comparing the probabilities of the evidence under two mutually exclusive hypotheses, the prosecution hypothesis Hp and the defense hypothesis H<
...
An important tool in forensic science is the likelihood ratio (LR), which quantifies the strength of evidence. It does so by comparing the probabilities of the evidence under two mutually exclusive hypotheses, the prosecution hypothesis Hp and the defense hypothesis Hd. However, if the underlying probability model to determine these probabilities is not correct, this can lead to misleading conclusions, for example biases towards one of the hypotheses. The ability for an LR-system to produce LR-values that reflect the true probabilities of the evidence under the hypotheses is called `consistency'. Ensuring the consistency of LR-systems is necessary to prevent biases and inaccuracies.
Several methods to evaluate consistency of LR-systems have been developed over the past decade, but there has been a lack of thorough comparisons to identify which one is the most effective with real case data. In this thesis, the aim is to fill this gap by developing and optimizing the existing methods, and comparing them to one another. An in-depth comparative analysis will be conducted of various existing metrics, as well as some newly introduced metrics, to evaluate the consistency of LR-systems.
This study evaluates the consistency metrics Cllrcal and devPAV. The individual metrics are optimized before comparing them to each other. A third metric is introduced, which is named Fid. This metric is based on advanced calibration techniques. It is compared to the previous two metrics to see which one performs best on different datasets. This performance is evaluated based on the metrics' abilities to distinguish between consistent and inconsistent LR-systems, their reliability in terms of their output and their sensitivity to dataset size. To measure this, several different datasets are used.
The results show that Cllrcal outperforms the other metrics in distinguishing between consistent and inconsistent LR-systems. However, it falls short in terms of reliability, as it fails to assign the same values to different LR-systems that are all consistent. On the other hand, devPAV demonstrates high reliability, showing both reliability across different datasets and across different dataset sizes. The Fid metric shows similar performance to devPAV, but has the disadvantage of not working on smaller datasets. Therefore, as a metric it might not be preferred, although the method itself definitely shows interesting insights into the consistency of LR-systems.
These findings improve the tools we have for forensic evidence interpretation, helping to make forensic practices more accurate and reliable. By identifying the best metric for consistency, this research helps to make the criminal justice system fairer and more precise.