Effectiveness of propensity score methods with density estimation in identifying overlap for causal inference

Bachelor thesis (2023)

Authors

J.K.K. Tjong Electrical Engineering, Mathematics and Computer Science

Contributors

J.H. Krijthe Pattern Recognition and Bioinformatics - (mentor)

R.K.A. Karlsson Pattern Recognition and Bioinformatics - (mentor)

F.A. Oliehoek Interactive Intelligence - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:58a511b6-f773-42ce-845c-4785a1293faa

More Info

expand_more

Published Date

28-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

For causal inference, sufficient overlap is needed. It is possible to use propensity scores with the positivity assumption to ensure overlap is present. However, positivity is not enough to properly identify the region of overlap. For this, propensity scores need to be used in combination with density estimation. This project aims to evaluate this method, discovering in which scenarios it performs well or fails in identifying the region of overlap. More specifically, how it scales with more features or outliers, and how using different classifiers affects the performance. The method was tested with samples from a simulated dataset. The predicted overlap was compared with the true overlap of the known distributions.
Following the experiments, the method seems to perform best when the treatment and control groups share one region of overlap. In this case, logistic regression works best out of the classifiers that were tested. The overall performance drops when the two groups have multiple regions of overlap. For this, the random forest classifier performs best instead. Throughout all scenarios, the performance of the model drops with increasing dimensionality. Furthermore, having a small percentage of outliers only slightly affects the model. With more outliers, logistic regression is the only classifier further affected.

Files

Rp_final_paper.pdf

(pdf | 1.3 Mb)