Privacy-preserving Clustering of Single-cell RNA Sequencing Data on Intel SGX
More Info
expand_more
Abstract
This project is dedicated to implementing an unsupervised learning clustering method system for processing big data applied in Intel SGX. Intel SGX is a technology developed to meet the needs of the trusted computing industry similarly to ARM TrustZone, but this time for desktop and server platforms. It is a set of safety-related instruction codes built into some modern Intel central processing units (CPUs). ScziDesk clustering algorithm works as core clustering method in our system. ScziDesk is a machine learning based clustering algorithm combining autoencoder, self-training K-means and KL divergence. AES128 is used as the encryption approach for data transmission. The whole project is divided into two parts, the data owner part and the SGX enclave hardware part. The data owner part includes the pre-processing of single-cell RNA-seq data, the encryption of data and the data feed to establish the TCP transmission stream. The SGX enclave part includes data reception, data decryption, neural network processing (including ScziDesk clustering network training and result prediction), and result feed. In this project, taking into account the number of useful libraries, the data pre-processing is written in python. Raw data is the H5 data file of single-cell RNA-seq result. All other parts are programmed in Rust. The data owner part will run outside the SGX, and the SGX enclave part will run inside the SGX. Practical data sets are applied for the analysis of efficiency, accuracy and resource allocation of the system.