With the proliferation of the IoT infrastructure, the trajectory data is dynamically emerging. This data originates from a variety of moving objects, containing big volumes of multi-dimensional information such as space, time, semantics etc. The underlying information can be pote
...
With the proliferation of the IoT infrastructure, the trajectory data is dynamically emerging. This data originates from a variety of moving objects, containing big volumes of multi-dimensional information such as space, time, semantics etc. The underlying information can be potentially applied to create added value through scientific research, decision-making, emergency management etc.
However, due to the special properties of the trajectory data, namely high frequency, cardinality, dimensionality, heterogeneity etc., traditional data management systems face difficulties in handling such data. Even though some distributed solutions or big data solutions exist in other fields, they are not designed considering the modelling, accessing, distributing and querying characteristics of this special spatio-temporal data.
Given the spatial data management problems, a clustering/indexing solution for high dimensional Point Cloud by Space-filling curve considering the heterogeneous data spatial distribution has been developed, advocated and validated by a series of research finished at the GDMC, TU Delft.
However, it is uncertain whether the framework can be extended to other types of space-related phenomena. Furthermore, whether distributed database techniques can be utilized remains to be explored and what adjustments should be made is still unclear. To some extent, this thesis is an expanded study based on the Point Cloud research mentioned above.
To address these data management challenges, this thesis focuses on trajectory data modelling and compression, indexing and clustering, partitioning and distributing. Also, the querying strategies are studied. More specifically, the three main results of this thesis are:
Model the trajectory as the sequence split by semantic attributes and a spatio-temporal cube. This modelling takes proximity (locality) preservation and trajectory preservation into account at the same time, resulting in a balanced level of flexibility and aggregation, mitigating the storage burden by row-wise compression. For different subdivision resolutions (depth of Octree for space partitioning), the compression ratio can be up to 10.
Access the trajectory data by Space-filling Curve. The Space-filling Curve indexing method maps the 3D (high dimensional) indices to 1D (low dimensional) indices, overcoming the contradiction between high dimensionality and high cardinality. Adaptive Octree is used to mitigate the heterogeneity of the trajectory data. Based on the experiment results, the optimal tree depth is 4 or 5. The query optimization (specifically, the range merging technique) is also preliminarily explored.
Distribute the data on the distributed machines. This distributed deployment results in higher (nearly linearly in the experiment) scalability (horizontal expansion of disk, memory and CPU resources) and speed-up. The Space-filling Curve distributing strategy results in better load-balancing. However, due to the lack of flexibility of the distributed database platform used (specifically, Greenplum), the localization of data and computation (such as local aggregation) is limited.