A Framework for Identifying Evolution Patterns of Open-Source Software Projects

More Info
expand_more

Abstract

Research on open-source software evolution gained popularity in the last decade focusing on the theoretical determining factors. Additional works studied growth patterns modeling using time series techniques on small projects and metrics samples or non-openly available larger datasets. Limitations in reproducibility and scalability of these methodologies add to the lack of research on time series methodologies applied to open-source software evolution. Thus, time series approaches from different domains are needed to address the multivariate nature of larger and variable samples of open-source projects and metrics time series data. This thesis aims to provide a reproducible and scalable framework to support researchers in studying open-source software evolution using patterns modeling, time series merging, multivariate time series clustering and multivariate time series forecasting. An openly available dataset of 1328 projects is built using relevant metrics extracted from a systematic literature review. The metrics time series are segmented and clustered to obtain generalized growth patterns: Steep; Shallow; Plateau. The sequence of patterns and their correlation are used to create three project clusters, from which prediction models for all metrics are trained to perform multivariate time series forecasting. Experiment results give confidence over the reproducibility and the scalability of the framework and show how the pattern shifts can be linked to real events in projects' histories. The thesis provides an additional perspective on open-source software evolution and can serve as a starting point for further studies.