M. Loog | TU Delft Repository

On the usage of wavelet-based techniques for Synthetic Image Detection

Master thesis (2023) - A. Joyandeh (author) , G. Jongbloed (mentor) , M. Loog (mentor) , Hanne Kekkonen (mentor)

With the rise of zero-shot synthetic image generation models, such as Stability.ai's Stable Diffusion, OpenAI's DALLE or Google's Imagen, the need for powerful tools to detect synthetic generated images has never been higher. In this thesis we contribute to this goal by consideri ...

Towards a Linear-Data Monotone Wrapper Algorithm For Machine Learning Algorithms

Master thesis (2023) - B.H. Kam (author) , Jan Van Gemert (mentor) , M Loog (mentor) , Tom J. Viering (mentor)

Machine learning algorithms (learners) are typically expected to produce monotone learning curves, meaning that their performance improves as the size of the training dataset increases. However, it is important to note that this behavior is not universally observed. Recently ...

Computational models for clinical drug response prediction

Aligning transcriptomic data of patients and pre-clinical models

Doctoral thesis (2023) - S.M.C. Mourragui (author) , Lodewyk F. Wessels (promotor) , Marcel Reinders (promotor) , M. Loog (promotor)

Extensive efforts in cancer research over the past decades have markedly improved diagnosis and treatments, leading to better outcomes for cancer patients. Paradoxically, however, these discoveries have begun to shed light on a level of complexity that rules out the emergence of ...

In Search of Best Learning Curve Model

Bachelor thesis (2022) - D.V.Q. Nguyen (author) , Tom Viering (mentor) , Marco Loog (mentor) , Georgios Smaragdakis (graduation committee member)

Learning curves have been used extensively to analyse learners' behaviour and practical tasks such as model selection, speeding up training and tuning models. Nonetheless, we still have a relatively limited understanding of the behaviour of learning curves themselves, in particul ...

Different approaches to fitting and extrapolating the learning curve

Bachelor thesis (2022) - D. KIM (author) , Tom Viering (mentor) , Marco Loog (mentor) , Georgios Smaragdakis (graduation committee member)

Extrapolation of the learning curve provides an estimation of how much data is needed to achieve the desired performance. It can be beneficial when gathering data is complex, or computation resource is limited. One of the essential processes of learning curve extrapolation is cur ...

To Tune or not to Tune: Hyperparameter Influence on the Learning Curve

Bachelor thesis (2022) - P. Bhaskaran (author) , Tom Viering (mentor) , Marco Loog (mentor) , Georgios Smaragdakis (graduation committee member)

A learning curve displays the measure of accuracy/error on test data of a machine learning algorithm trained on different amounts of training data. They can be modeled by parametric curve models that help predict accuracy improvement through curve extrapolation methods. However, ...

Explain Strange Learning Curves in Machine Learning

Bachelor thesis (2022) - Z. Chen (author) , Tom Viering (mentor) , Marco Loog (mentor) , Georgios Smaragdakis (graduation committee member)

The learning curve illustrates how the generalization performance of the learner evolves with more training data. It can predict the amount of data needed for decent accuracy and the highest achievable accuracy. However, the behavior of learning curves is not well understood. Man ...

Pruning Latent Neurons in Autoencoders using Early-Bird Tickets

Master thesis (2022) - R.F. Klazinga (author) , M Loog (mentor) , Jan Van Gemert (graduation committee member) , Cynthia C.S. Liem (graduation committee member)

Autoencoders seek to encode their input into a bottleneck of latent neurons, and then decode it to reconstruct the input. However, if the input data has an intrinsic dimension (ID) smaller than the number of latent neurons in the bottleneck, this encoding becomes redundant.
...

Factors related to dataset that influence the shape of learning curves

Bachelor thesis (2022) - N.T. Bui (author) , T.J. Viering (mentor) , M. Loog (mentor) , G. Smaragdakis (graduation committee member)

Although there are many promising applications of a learning curve in machine learning, such as model selection, we still know very little about what factors influence their behaviours. The aim is to study the impact of the inherent characteristics of the datasets on the learning ...

Reduce model unfairness with maximal-correlation-based fairness optimization

Master thesis (2022) - W. Huang (author) , M Loog (mentor) , Jan Van Gemert (graduation committee member) , Christoph Lofi (graduation committee member)

Supervised machine learning is a growing assistive framework for professional decision-making. Yet bias that causes unfair discrimination has already been presented in the datasets. This research proposes a method to reduce model unfairness during the machine learning training pr ...

Are CNNs that Learn to Predict Image Statistics Invariant to Domain Shifts?

Bachelor thesis (2021) - J.P. Biesheuvel (author) , T.J. Viering (mentor) , Ziqi Wang (mentor) , David M. J. Tax (mentor) , M. Loog (mentor) , Klaus Hildebrandt (graduation committee member)

Yes, convolutional neural networks are domain-invariant, albeit to some limited extent. We explored the performance impact of domain shift for convolutional neural networks. We did this by designing new synthetic tasks, for which the network’s task was to map images to their mean ...

Is the batch size affecting the performance of Regression CNNs ?

Bachelor thesis (2021) - J.A.D. Lamon (author) , O.T. Turan (mentor) , Marco Loog (mentor) , DMJ Tax (mentor) , Tom Julian Viering (mentor) , Yuko Kato (mentor) , Z. Wang (mentor) , K.A. Hildebrandt (graduation committee member)

With an expectation of 8.3 trillion photos stored in 2021 [1], convolutional neural networks (CNN) are beginning to be preeminent in the field of image recognition. However, with this deep neural network (DNN) still being seen as a black box, it is hard to fully employ its capabi ...

Go Deep or Go Home?

Bachelor thesis (2021) - M.C. den Heijer (author) , Tom Viering (mentor) , Y. Kato (mentor) , O.T. Turan (mentor) , Ziqi Wang (mentor) , Marco Loog (mentor) , David M.J. Tax (mentor)

Does a convolutional neural network (CNN) always have to be deep to learn a task? This is an important question as deeper networks are generally harder to train. We trained shallow and deep CNNs and evaluated their performance on simple regression tasks, such as computing the mea ...

How does imbalanced data affect performance of regression CNNs?

Bachelor thesis (2021) - R.K. Thakoersingh (author) , Tom Viering (mentor) , Y. Kato (mentor) , M. Loog (mentor) , David M. J. Tax (mentor) , K.A. Hildebrandt (coach)

This research provides an overview on how training Convolutional Neural Networks (CNNs) on imbalanced datasets affect the performance of the CNNs. Datasets could be imbalanced as a result of several reasons. There are for example naturally less samples of rare diseases. Since the ...

It sounds like Greek to me

Performance of phonetic representations for language identification

Bachelor thesis (2021) - D.J. IJpma (author) , Tom Julian Viering (mentor) , M. Loog (coach) , S. Makrodimitris (mentor) , Arman Naseri (mentor) , Catharine Oertel (coach)

This paper compares the performance of two phonetic notations, IPA and ASJPcode, with the alphabetical notation for word-level language identification. Two machine learning models, a Multilayer Percerptron and a Logistic Regression model, are used to classify words using each o ...

Exploring the Potential of Performance Bounds in Multi-Source Domain Adaptation

Master thesis (2021) - W.J.W. Bons (author) , M. Loog (mentor) , Jan van van Gemert (graduation committee member) , R.C. Hendriks (graduation committee member) , Jorge Martinez (graduation committee member)

Currently, trained machine learning models are readily available, but their training data might not be (for example due to privacy reasons). This thesis investigates how pre-trained models can be combined for performance on all their source domains, without access to data. This p ...

Currently, trained machine learning models are readily available, but their training data might not be (for example due to privacy reasons). This thesis investigates how pre-trained models can be combined for performance on all their source domains, without access to data. This problem is formulated as a Multiple-Source Domain Adaptation (MSA) problem setting, where models trained on source domains are combined so that the combiner is robust to application on any unknown target domain. This thesis explores the MSA setting and presents a perspective on MSA theory from literature. The issue in the MSA setting is that target models are not robust in general, leading to negative transfer. Firstly, this issue is illustrated by example of the source models and linearly weighted combinations of the source models. Next, existing theory that guarantees the existence of a robust model is investigated. It is argued that a performance bound has the potential to be extended from the perspective of additional knowledge: in addition to the in the MSA setting available source models, some additional knowledge of the source domains might be used by the model. Existing MSA theory’s assumptions are clarified and the theory is split in two. One half is inherent to the MSA setting and guarantees a model with as robustness property the loss on a matching mixture. The other half assumes a combiner that uses additional knowledge of the source domains, for which the robustness is proven satisfactory. Finally, it is investigated what makes additional knowledge in the MSA setting useful. Current literature assumes a specific target model--the distribution-weighted (DW) combiner--that is viewed as using the sources' joint distributions as additional knowledge. It is argued that knowledge of the training process of the source models can also be used as additional knowledge. In conclusion, this thesis discusses how robustness in the MSA setting can be improved from that of the source models by basing the combiner on additional knowledge of the source domains.

Sampling settings in active learning for investigating inconsistency

Master thesis (2020) - M. Li (author) , M Loog (mentor) , Sicco Verwer (graduation committee member) , Jan Van Gemert (graduation committee member)

Active learning has the potential to reduce labeling costs in terms of time and money. In practical use, active learning works as an efficient data labeling strategy. Another point of view to look at active learning is to consider active learning as a learning problem, where the ...

Photovoltaic Yield Nowcasting

For Residential Solar Systems in the Netherlands Using a Machine Learning Approach

Master thesis (2020) - D.W. Grzebyk (author) , Olindo Isabella (mentor) , H. Ziar (mentor) , Marco Loog (mentor) , Jaap Donker (mentor)

An increasing number of photovoltaic (PV) systems are being installed worldwide and residential sector is responsible for a large part of this growth. Small scale PV systems do not have complex measuring devices and their breakdowns are not spotted immediately by the system owner ...

An increasing number of photovoltaic (PV) systems are being installed worldwide and residential sector is responsible for a large part of this growth. Small scale PV systems do not have complex measuring devices and their breakdowns are not spotted immediately by the system owners. This might lead to prolonged time without generating power and creating both financial loss and environmental damage. This thesis presents a method of PV yield nowcasting laying foundations for remote monitoring. Early detection of faults is the first step towards eliminating the described issues. In this project four machine learning models for predicting solar yield were developed: ElasticNet, Polynomial Regression, Random Forest, and Extreme Gradient Boosting (XGBoost). The models were created both for daily and hourly data sets, as some inverters can log daily yields only. In both cases, the utilized data set consisted of data for the time between July 1st, 2018 and June 30th, 2019 and corresponding to 1,102 PV systems which is five times more than the largest data set studied in the found literature. The average PV system size in the data set is 4.44 kWp. Utilized inputs next to weather data and previous yields included shading factor describing fraction of direct light unable to reach PV system due to surrounding obstacles. Calculation of shading facor was based on 360⁰ pictures taken at the site. XGBoost algorithm turned out to be the most suitable for the task of PV yield nowcasting obtaining RMSE of 1.48 kWh and MAE of 0.877 kWh for hourly data aggregated to daily values and evaluated on future time steps. Currently used commercial software of Solar Monkey has RMSE equal 2.237 kWh and MAE equal 1.5 kWh. XGBoost model trained on daily data obtained RMSE 1.185 kWh and MAE 0.698 kWh outperforming hourly model most likely due to utilization of Hidden Markov Model for data cleaning. Next to overall performance, per system metrics were calculated for the hourly XGBoost. Mean individual RMSE for previously seen systems is 1.656 kWh while for unseen systems it equals 1.666 kWh. This means the model scales well to previously unseen systems and implies that its parallelized version is not necessary. Also, the model's learning saturates after seeing data corresponding to one year and 278 PV systems. Precalculation of GPOA worsened performance with respect to the model utilizing GHI. Hourly XGBoost has hourly RMSE of 0.281 kWh under clear sky and 0.377 kWh under partly cloudy sky which indicates it is more mistaken for cloudy conditions. This could be caused by low quality of cloud coverage data. The model also has large relative errors for small irradiance values which occur mostly in January and December, as well as just after sunrise and just before sunset. This issue is caused by using squared error as loss function during model training. Despite these shortcomings, the conclusive results recommend industrial implementation of the developed model.

BedBasedEcho

Bachelor thesis (2020) - J. Haas (author) , R.F. Klazinga (author) , N. van Stijn (author) , J. Teunissen (author) , Y. Zhang (author) , Marco Loog (mentor) , Eelko Ronner (mentor) , O.W. Visser (graduation committee member)

The core challenge of the BedBasedEcho BEP project is to create an algorithm to find the heart, and apply it on a robotic echocardiography solution. The team has found multiple complex solutions that are related to this problem, and has extracted useful information from these sol ...

Predicting financial time series with incomplete information due to late publications of financial reports

Master thesis (2020) - J.R. Esseveld (author) , M Loog (mentor) , E. Rambier (mentor) , Jan Van Gemert (coach) , Christoph Lofi (coach)

Records from ledgers of Dutch companies all across the Netherlands are used in this study. Records can be submitted in the ledgers with various lags, because the data of many different bookkeepers is involved with different workflows. Bookkeepers can be punctual or late, therefor ...