T.J. Viering | TU Delft Repository

Malware Evolution

Unraveling Malware Genomics: Synergistic Approach using Deep Learning and Phylogenetic Analysis for Evolutionary Insights

Master thesis (2024) - A. Amalan (author), G. Smaragdakis (coach), T.J. Viering (mentor), H.J. Griffioen (graduation committee member)

The rapid advancement of artificial intelligence technologies has significantly increased the complexity of polymorphic and metamorphic malware, presenting new challenges to cybersecurity defenses. Our study introduces a novel bioinformatics-inspired approach, leveraging dee ...

The rapid advancement of artificial intelligence technologies has significantly increased the complexity of polymorphic and metamorphic malware, presenting new challenges to cybersecurity defenses. Our study introduces a novel bioinformatics-inspired approach, leveraging deep learning and phylogenetic analysis to understand the evolutionary dynamics of such malware. By analyzing a dataset of 103,883 malware samples, we transformed extracted features using pseudo-static, dynamic, and image analyses into embeddings with deep learning techniques, combining them into what we refer to as the "genome" of malware. These combined embeddings were used to construct phylogenetic trees employing the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) and the Neighbor-Joining (NJ) method.We were the first to utilize OpenAI's state-of-the-art embeddings for converting pseudo-static and dynamic features into embeddings. In addition, we discovered that transfer learning with ResNet-50 is highly effective compared to traditional CNNs, producing better image embeddings that outperform others in terms of classification accuracy.

We also introduced new validation techniques for phylogenetic trees, making use of VirusTotal timestamps and embedding drift analysis. These methods confirmed that the NJ method was more accurate. Furthermore, we developed techniques to simplify the analysis of these extensive phylogenetic trees, enabling efficient derivation of relationships within and between malware families. The insights from our NJ-built phylogenetic trees closely align with public data and lay a foundation for generating evolutionary-informed signatures that enhance tailored detection strategies. Our method has significantly expedited the process of identifying connections among 538 malware families by dramatically reducing the timeframe from months or years to just weeks much faster than traditional reverse engineering approaches for tracing malware evolution.

Deciphering Learning Curve Characteristics via K-Means Clustering of Curve Model Parameters

Bachelor thesis (2024) - E.A. Ozgur (author), O.T. Turan (mentor), T.J. Viering (mentor), H.S. Hung (graduation committee member)

Learning curves illustrate the relationship between the performance of learning algorithms and the increasing volume of training data [1, 2, 3]. While the concept of learning curves is well-established, clustering these curves based on fitting parameters remains an underexplored ...

Prevalence of non-monotonicity in learning curves

Bachelor thesis (2024) - D. Gafton (author), T.J. Viering (mentor), O.T. Turan (mentor), H.S. Hung (graduation committee member)

Learning curves are useful to determine the amount of data needed for a certain performance. The conventional belief is that increasing the amount of data improves performance. However, recent work challenges this assumption, and shows nonmonotonic behaviors of certain learners o ...

Learning Curve Extrapolation using Machine Learning

Benefits and Limitations of using LCPFN for Learning Curve Extrapolation

Bachelor thesis (2024) - P. Johari (author), T.J. Viering (mentor), O.T. Turan (mentor), H.S. Hung (graduation committee member)

This study explores the extrapolation of learning curves, a crucial aspect in evaluating learner performance with varying dataset sample sizes. We use the Learning Curve Prior Fitted Network (LC-PFN), a transformer pre-trained on synthetic data with proficiency in approximate Bay ...

Learning Curves

How do Data Imbalances affect the Learning Curves using Nearest Mean Model?

Bachelor thesis (2024) - J.J. Feng (author), T.J. Viering (mentor), O.T. Turan (mentor)

This research investigates the impact of data imbalances on the learning curve using the nearest mean model. Learning curves are useful to represent the performance of the model as the training size increases. Imbalanced datasets are often encountered in real-life scenarios and p ...

Clustering Learning Curves in Machine Learning using K-Means Algorithm

Can patterns be identified amongst learning curves after the application of the K-Means algorithm using point and statistical vectors?

Bachelor thesis (2024) - P.S.P. Ramsundersingh (author), T.J. Viering (mentor), O.T. Turan (graduation committee member)

A learning curve can serve as an indicator of the “performance of trained models versus the training set size” [1]. Recent research on learning curve analysis has been carried out within the Learning Curve Database (LCDB) [2] This paper will investigate if there are similarities ...

Towards a Linear-Data Monotone Wrapper Algorithm For Machine Learning Algorithms

Master thesis (2023) - B.H. Kam (author), J.C. van Gemert (mentor), M. Loog (mentor), T.J. Viering (mentor)

Machine learning algorithms (learners) are typically expected to produce monotone learning curves, meaning that their performance improves as the size of the training dataset increases. However, it is important to note that this behavior is not universally observed. Recently ...

”How Much Data is Enough?” Learning curves for machine learning

Investigating alternatives to the Levenberg-Marquardt algorithm for learning curve extrapolation

Bachelor thesis (2023) - L. Negru (author), J.H. Krijthe (mentor), T.J. Viering (mentor), Z. Yue (graduation committee member)

The conducted research explores fitting algorithms for learning curves. Learning curves describe how the performance of a machine learning model changes with the size of the training input. Therefore, fitting these learning curves and extrapolating them can help determine the req ...

A Comparative Analysis of Learning Curve Models and their Applicability in Different Scenarios

Finding datasets patterns which lead to certain parametric curve model

Bachelor thesis (2023) - A.G. Kalandadze (author), T.J. Viering (mentor), J.H. Krijthe (mentor), Z. Yue (graduation committee member)

Learning curves display predictions of the chosen model’s performance for different training set sizes. They can help estimate the amount of data required to achieve a minimal error rate, thus aiding in reducing the cost of data collection. However, our understanding and knowledg ...

The influence of the dimensionality on the parameters of the learning curve model

Bachelor thesis (2023) - A. Mereuta (author), T.J. Viering (mentor), J.H. Krijthe (graduation committee member), Z. Yue (graduation committee member)

Learning curves in machine learning are graphical representations that depict the relationship between a model's performance and the amount of training data it has been exposed to. They play a fundamental role in obtaining the knowledge and skills across a range of domains. Altho ...

Empirical Investigation of Learning Curves

Assessing Convexity Characteristics

Bachelor thesis (2023) - K. Gogora (author), T.J. Viering (mentor), J.H. Krijthe (mentor), Z. Yue (graduation committee member)

Nonconvexity in learning curves is almost always undesirable. A machine learning model with a non-convex learning curve either requires a larger quantity of data to observe progress in its accuracy or experiences an exponential decrease of accuracy at low sample sizes, with no im ...

Non-Monotonicity in Empirical Learning Curves

Identifying non-monotonicity through slope approximations on discrete points

Bachelor thesis (2023) - C. Socol (author), T.J. Viering (mentor), J.H. Krijthe (mentor), Z. Yue (graduation committee member)

Learning curves are used to shape the performance of a Machine Learning (ML) model with respect to the size of the set used for training it. It was commonly thought that adding more training samples would increase the model's accuracy (i.e., they are monotone), but recent works s ...

In Search of Best Learning Curve Model

Bachelor thesis (2022) - D.V.Q. Nguyen (author), T.J. Viering (mentor), M. Loog (mentor), G. Smaragdakis (graduation committee member)

Learning curves have been used extensively to analyse learners' behaviour and practical tasks such as model selection, speeding up training and tuning models. Nonetheless, we still have a relatively limited understanding of the behaviour of learning curves themselves, in particul ...

To Tune or not to Tune: Hyperparameter Influence on the Learning Curve

Bachelor thesis (2022) - P. Bhaskaran (author), T.J. Viering (mentor), M. Loog (mentor), G. Smaragdakis (graduation committee member)

A learning curve displays the measure of accuracy/error on test data of a machine learning algorithm trained on different amounts of training data. They can be modeled by parametric curve models that help predict accuracy improvement through curve extrapolation methods. However, ...

Different approaches to fitting and extrapolating the learning curve

Bachelor thesis (2022) - D. KIM (author), T.J. Viering (mentor), M. Loog (mentor), G. Smaragdakis (graduation committee member)

Extrapolation of the learning curve provides an estimation of how much data is needed to achieve the desired performance. It can be beneficial when gathering data is complex, or computation resource is limited. One of the essential processes of learning curve extrapolation is cur ...

Explain Strange Learning Curves in Machine Learning

Bachelor thesis (2022) - Z. Chen (author), T.J. Viering (mentor), M. Loog (mentor), G. Smaragdakis (graduation committee member)

The learning curve illustrates how the generalization performance of the learner evolves with more training data. It can predict the amount of data needed for decent accuracy and the highest achievable accuracy. However, the behavior of learning curves is not well understood. Man ...

Go Deep or Go Home?

Bachelor thesis (2021) - M.C. den Heijer (author), T.J. Viering (mentor), Y. Kato (mentor), O.T. Turan (mentor), Z. Wang (mentor), M. Loog (mentor), David Tax (mentor)

Does a convolutional neural network (CNN) always have to be deep to learn a task? This is an important question as deeper networks are generally harder to train. We trained shallow and deep CNNs and evaluated their performance on simple regression tasks, such as computing the mea ...

How does imbalanced data affect performance of regression CNNs?

Bachelor thesis (2021) - R.K. Thakoersingh (author), T.J. Viering (mentor), Y. Kato (mentor), M. Loog (mentor), David Tax (mentor), K.A. Hildebrandt (coach)

This research provides an overview on how training Convolutional Neural Networks (CNNs) on imbalanced datasets affect the performance of the CNNs. Datasets could be imbalanced as a result of several reasons. There are for example naturally less samples of rare diseases. Since the ...

Is the batch size affecting the performance of Regression CNNs ?

Bachelor thesis (2021) - J.A.D. Lamon (author), O.T. Turan (mentor), M. Loog (mentor), David Tax (mentor), T.J. Viering (mentor), Y. Kato (mentor), Z. Wang (mentor), K.A. Hildebrandt (graduation committee member)

With an expectation of 8.3 trillion photos stored in 2021 [1], convolutional neural networks (CNN) are beginning to be preeminent in the field of image recognition. However, with this deep neural network (DNN) still being seen as a black box, it is hard to fully employ its capabi ...

It sounds like Greek to me

Performance of phonetic representations for language identification

Bachelor thesis (2021) - D.J. IJpma (author), T.J. Viering (mentor), M. Loog (coach), S. Makrodimitris (mentor), A. Naseri Jahfari (mentor), Catharine Oertel (coach)

This paper compares the performance of two phonetic notations, IPA and ASJPcode, with the alphabetical notation for word-level language identification. Two machine learning models, a Multilayer Percerptron and a Logistic Regression model, are used to classify words using each o ...