Architecture for An Analytical Database of Objects

Analytical databases are often centered around metrics and measures, with dimensions organized into a rigid multi-dimensional cube structure. But this model struggles when the dimensions themselves need to change or evolve over time.

Architecture for An Analytical Database of Objects
Photo by Marcello Gennari / Unsplash

Analytical databases are often centred around metrics and measures, with dimensions organised into a rigid multi-dimensional cube structure. But this model struggles when the dimensions themselves need to change or evolve over time. To represent rapidly changing dimensions in an OLAP data warehouse, this model needs to be adapted to avoid recomputing entire cubes. The following is a mathematical basis for Sorted-Merge Join in MorselDB from first-principles.


Analytical databases (for OLAP purposes) are commonly centred around metrics (measures), with each metric occupying a cell in a multidimensional hypercube (OLAP cube). In such cubes, a dimension-to-metric projection is not usually associated with any specific object, but instead exists as a self-contained fact. Some databases approximate the object relation with a special dimension for an object identifier; this establishes a relationship between only objects and metrics but not objects and dimensions. Presented here is an architecture that retains most of the qualities of an OLAP DBMS while supporting both first-class object-dimension and object-metric relationships.

1 Introduction

The traditional OLAP cube lays out dimensions along individual axes, with each distinct value of a dimension occupying a discrete point in that dimension's axis and each cell at the intersections of such discrete axis positions holding a number that represents some metric. Multiple metrics are represented by their own cubes, each sharing the same shape. If the metrics' associations with objects are to be recorded, a dedicated axis is created to represent the objects.

This model, being metric-centric, can very well represent relations of metrics but not relations of the dimensions themselves. This is often not a problem since business intelligence use cases are almost always metric-centric, and the dimensions they use are often "final" and rarely need to be queried, or worse, changed.

When the dimensions do need to be changed, though, the cube model is completely unhelpful in applying the change. Thus, changes to dimensions often require total rebuilding of all cubes. Some OLAP implementations (such as, say, ROLAP) alleviate the problem to some extent by sharing the axes of multiple cubes. However, this does not eliminate the problem, as almost all the data is rewritten anyway.

To derive an architecture suited to interdimensional relationships, we must, of course, understand the use cases that would be helped, supported, or enabled by such a design in the context of how they are usually supported in traditional OLAP designs.

2 Mathematical Foundations

By the existing definition of an OLAP cube, each metric cube occupies an Euclidean space where the axes are the dimensions and a metric's position is a Cartesian coordinate composed of each of the dimension's values.

2.1 Addition of a Dimension

From classical geometry, we know that converting an Euclidean space of n dimensions (and Pn points) into a higher-dimensional space by adding a dimension of x points increases the total number of points by Pn × x. This can be particularly problematic for MOLAP databases. In fact, even ROLAP databases — which do not store the full Pn points but instead only those dimension values for which at least one measure exists and thus have considerably less work to do than MOLAP databases — still have to re-project all $n+1$-ary cartesian coordinates of all the metrics they do have.

However, this bleak analysis is naïve and limited to cases where the new dimension is not fully dependent on pre-existing dimensions.

2.2 Addition of a (Fully-)Dependent Dimension

If the new dimension is a function of one-or-more (say, m) of the pre-existing n dimensions (thus, m ≤ n), then the addition of this dimension does not change the Euclidean space of metrics. In fact, this "dimension" can be represented as a completely new space of m dimensions where the cells are the new dimensions values corresponding to the $m$-ary Cartesian coordinates of the dependency dimensions. Metric lookup by this new dimension's values can be translated, using the value's coordinates, into a metric lookup by the original m dimensions.

Essentially, each such subgroup of m independent dimensions form a separate cube shape, and each new dimension dependent on them forms its own cube of said shape. With this comes the realisation that the known methods of efficient storage and retrieval of OLAP cubes may apply just as well to these derived-dimension cubes.

This realisation is the foundation of the architecture presented here. As a result, the other use cases of interdimensional relationships readily collapse under this model.

2.3 Removal of a (Fully-Dependent) Dimension

Following 2.2, since a derived dimension is its own cube, its removal is simply the removal of the corresponding cube.

2.4 Replacement of a (Fully-Dependent) Dimension

Replacement is simply removal combined with addition, and following 2.2, neither the cube being removed nor the cube being added needs to modify the cube of metrics.

3 Object-based Analytical Databases

In OLAP databases that add an "object" dimension to their metric cubes, the values of such dimensions are often merely identifiers used for convenience, bearing little relevance to the values of other dimensions. This is an extreme, though natural, consequence of treating the various dimensions with the same level of importance.

To be precise, some dimensions are clearly relevant only to the measurement itself, and not to the object being measured. For example, time of observation, method of observation, accuracy of observation, etc. The general theme here is clear.

But there can also be dimensions that are more closely associated with the thing being observed, such as the colour of the car or the age/sex of the participant. Lumping these dimensions with those above, a practice often called "denormalisation,"colour of car, age/sex of participant makes it more difficult to derive further dimensions that may depend on only one set.

However, if we can treat these two sets of dimensions as separate groups, we can reap the benefits of 2.2. Say, a group for object-dependent dimensions and a group for metric-dependent dimensions. Adding/removing one of the former should have no interference from the latter group, and vice versa.

Presented below is an architecture that tries to achieve this separation while building on top of traditional OLAP primitives, thus retaining most of the positive qualities of efficient analytical processing already available today, and perhaps even unearthing a few new features made possible by said separation.

3.1 Low-Level Tables

Each subgroup is to get its own "table." These tables need not be different from traditional OLAP tables and can thus be implemented trivially from existing infrastructure. However, as we will see, some restrictions on the data layout may need to be placed to help coordinate them.

3.1.1 Object/Dimension Table

This table shall represent the object-dimension relations, mapping from object identifiers to dimension values associated with each object. In other words, the Primary Key of these tables shall be the object id. Naturally, dimensions that aren't associated with objects solely aren't a good fit here.

3.1.2 Metric Table

This table shall represent the object-metrics relations, but since a measurement may not be uniquely identified by an object alone and may have dimensions of its own (let's call these metric dimensions), this table shall be a mapping from a combination of object identifier and metric-dimension values to metric values. In other words, the Primary Key of these tables shall be a tuple of object id and dimension values.

3.2 High-Level "Tables"

A combined view of all the data may be presented by joining these together. When reading from this view, if the query can be fulfilled from only one of these tables, automatic optimisation can even avoid the join altogether. But of course, the majority of read queries aren't going to be of this kind, so the joining itself has to be efficient.

OLAP implementations have traditionally eschewed JOINs (in the Relational Algebra sense) due to the inherent difficulty of making general-purpose JOINs efficient. Fortunately, we don't have such broad requirements and can instead make do with a single, simple JOIN method if we apply some discipline to the data layout.

3.2.1 Impact on the Design of Low-Level Tables

Since the PK of the object tables is a left subset of the PK of the metric tables, we can avoid the classic pitfalls of inefficient JOINs (specifically, large storage requirements and long lookup times) by demanding that this Most Significant Subset (MSS) of the combined Primary Key of the object and metric tables is always maintained for combinable access methods. Specifically, the following primitives must always hold on to the action of accessing either table:

  1. The keyspace of the MSS should be easily decomposable to each individual table. If a key K1 in the view is to be accessed, it should be possible to easily access all keys from each view whose Most Significant Bit (MSB) equals K.
  2. Keyspace partitioning should be similarly conducive to fairness. If the view's keyspace is broken into a range K1..K2, each individual table's keyspace should also be able to be broken into key ranges whose MSB must be K1..K2.

3.2.2 JOINing data from Low-Level Tables

Since reads from the view can be trivially partitioned along the keyspace into respective reads from each table, joining the data from the tables can be carried out within each partition. Thus eliminating the need for global JOINs allows us to pick any efficient JOIN method. Additionally, if the data from each table is already ordered along the same PK as the view, application of a Sort-Merge Join collapses into a very efficient form of row-by-row Merge Join.


Pritam Baral is a key member of the software engineering team at Clarisights.