Vector Processing in NUMA Systems
More Info
expand_more
Abstract
Over the last decade, applications like self-driving, image recognition and speech processing are having more and more impact on the society, all these applications are based on machine learning, and machine learning is all about metrics and vectors. For that reason, vector processors are getting attraction again.
Most of the previous research of vector processor focuses on single-core performance, and most of today’s large-scale computer system use non-uniform memory access (NUMA) architecture, how to efficiently deploy the vector processor in a NUMA environment remains a problem. NUMA is a shared memory model, with a NUMA system, there are multiple memories distributed in the system and usually each NUMA node has one memory. It will take the processors longer to access memories in other nodes then the memory in the same node, and this feature shows some opportunities to increase the performance of the NUMA system by accelerating the remote memory accesses.
In this thesis, a subset of the PARSEC benchmark are vectorized for both ARM Scalable Vector Instructions (SVE) and RISC-V Vector Instructions (RVV), and the single-core performance of these two types of processors will be compared based on this benchmark. Then a NUMA system is made for ARM SVE and the memory access pattern is analysed with gem5 simulator.