Clustering small and medium sized Dutch enterprises using hybrid intelligence
More Info
expand_more
Abstract
We are living in a world full of data. Data capture the characteristics of any entity around us, like living species, properties of scientific experimentation, etc. Moreover, data provides a basis for further analysis, reasoning, or decision-making. One of the most common applications of data analysis is to group data into a set of clusters to understand the underlying structures of a given data set. The classification system can be either supervised or unsupervised, depending on whether it assigns a data object to a discrete value from a finite value list or unsupervised or unlabeled categories. In unsupervised classification, also called, clustering no labeled data is available. State-of-the-art clustering algorithms are developed in a broader sense without targeting any specific applications. Further, they are used in a variety of application domains. However, because these algorithms lack domain-specific information and user-specific input, they do not always produce relevant results. Also, categorical features in the dataset make clustering harder because they lack semantics. Moreover, unlike classification tasks that are evaluated using well-defined target labels, clustering is an intrinsically subjective task as it depends on the interpretation, need, and interest of users. The little notion of ground truth makes cluster validation harder in an unsupervised setting. Also, there is no universally adopted approach to choose features or clustering schemes. To tackle such challenges there is an increasing need for methods that engage humans in the clustering process to tailor it to specific application domains and allow it to continuously adapt to their preferences. Such an approach where we try to achieve better computation results using human knowledge lies under the hybrid intelligence domain. This thesis explores the possibility of designing an end-to-end clustering analysis workflow using hybrid intelligence. The thesis aims to answer the following research question - How can we use hybrid intelligence in cluster analysis workflow to generate user-specific clusters and evaluate them? We try to answer the research question by introducing multiple novel methods that aim to solve the following challenges: evaluating clusters, creating semantics in categorical features, and performing user-specific cluster analysis. We apply the developed methodologies on real-time financial data to cluster small and medium (SMEs) sized Dutch enterprises. By our experimentation, we can observe that we manage to cluster Dutch SMEs as per the user-specific goals based on their financial behaviors. We believe that we achieve such results by creating semantics in the categorical features. The clustering results are further analyzed from a user requirement perspective. Our proposed cluster validation game enables us to validate cluster objects using human intelligence. The associated experimentation results give us confidence in our hypothesis that hybrid intelligence is one of the solutions to solve the clustering challenges.