Engineering

Why We Chose Real-Time Joins Over Flat Files (The MorselDB Story)

Clarisights Engineer Anmol Arora takes us through making a key change in MorselDB

Anmol Arora

Aug 18, 2025 • 7 min read

When I joined Clarisights, I learned one of the most valuable lessons of my engineering career: sometimes your fundamental assumptions are completely wrong, and the only way forward is to prove how you are wrong. Read on to learn about MorselDB’s Equi-Join ("Fused") implementation.

Before I dive into our technical journey, let me explain what performance marketers are actually trying to do. Imagine you're running digital campaigns across Google, Facebook, TikTok, and other platforms. You're spending millions of dollars, and you need to answer questions like:

Which campaigns are profitable?
How much did I spend to acquire each customer?
Which creative variants perform best in which geographies?
Is my ROI improving or declining week-over-week?

On the surface, these are simple questions. But appearances can be deceiving...

Each platform reports data differently. Facebook might tell you about clicks and costs. Google Analytics tells you about conversions. Your CRM knows the actual revenue. To answer even basic questions, you need to join data from all these sources.

But here's where it gets complicated. Marketing data involves roughly 250 dimensions (campaign hierarchies, geographies, device types, audience segments) and 2,500 metrics (costs, clicks, conversions across different attribution windows). Marketers need to slice and dice across any combination of these. They might ask: "Show me mobile conversion rates for video ads in Germany, broken down by age group, compared to last month."

That's the fundamental reporting challenge we're solving at Clarisights. Now let me tell you how we almost got this completely wrong.

Flat Files Seemed Like a Good Idea...

Like many analytics companies, we started with what seemed like an obvious truth: joins are expensive. Too expensive for real-time analytics. So we did what made sense at the time - we stored everything as denormalized documents (basically flat files) in MongoDB.

Picture this: for each data point, we'd store all dimensions (campaign names, geographies, device types - about 250 of them) alongside all metrics (costs, clicks, conversions across different attribution windows - around 2,500 of them). No joins needed…

Except we were very wrong.

What we didn't fully appreciate was that the metrics data is constantly changing, whereas the dimensions don't change that often. Because marketers are not updating campaigns every second, but attribution data keeps shifting. Essentially, metrics and dimensions have different changing pace.

Let me explain attribution for those who haven't dealt with marketing data. Attribution is how marketers figure out which ad deserves credit for a conversion. Here's a typical user journey:

Monday: User sees a Facebook ad, doesn't click
Wednesday: User searches on Google, clicks a search ad
Friday: User finally makes a purchase

Which ad gets credit for the sale? At Clarisights, we support last-click attribution - the Google ad (last click before purchase) gets the credit. But here's the catch: that purchase on Friday might not show up in Google's data until Saturday, or even a week later. Why? Because:

The user might return the product, changing the conversion value
Payment processing might take time to confirm
The platform needs time to match the conversion to the original click
Attribution windows can extend up to 90 days

So data you thought was "final" on Monday might get updated on Tuesday, Thursday, or even weeks later as more conversions trickle in.

With our flat file approach, every time a single metric updated, we had to rewrite the entire document, which meant that we'd also end up re-writing the dimensions even though they hadn't changed. The same would happen if a dimension had changed, we'd end up re-writing all the metrics. The write amplification was crushing us.

Were Joins Actually Too Expensive?

Why were we so convinced joins were expensive? It wasn't just a random assumption. When you think about join algorithms, most of them have serious limitations at scale:

Hash joins seem great - create a hash table from one dataset, probe it with another. But that only works well when one table is small enough to fit in memory. When you're joining billions of metric rows with 100s of millions of dimension rows, where's that hash table going to live?

Nested loop joins are even worse. With O(n×m) complexity, queries would literally take days with our data volumes.

Sort-merge joins looked interesting, but sorting terabytes of data at query time would kill performance.

So we stuck with our flat files, accepting the write amplification as a necessary evil. MongoDB's aggregation framework would also slow to a crawl trying to process millions of these massive documents.

To ClickHouse!

Most of our performance improvements actually came from switching our workload from MongoDB to ClickHouse. MongoDB simply wasn’t the right fit for analytics at our scale, and moving to ClickHouse gave us the foundation for much faster queries and more predictable performance. However, once we had decided to make that transition, joins became the main architectural challenge we needed to solve. Everything changed when another Engineer intervened. Pritam Baral presented a working design that showed joins could be inexpensive - if you store the data correctly.

The key insight came from understanding how marketers actually work. They're not day traders making split-second decisions. They're analysing trends, testing hypotheses, optimising campaigns. A typical workflow might be:

Morning: Check yesterday's performance across all campaigns
Identify underperforming segments
Deep dive into specific geographies or demographics
Adjust budgets and targeting for today
Repeat tomorrow with fresh data

This workflow has a natural rhythm. The SLA we provide is simple: "Today you'll see yesterday's data." We can afford a couple of hours of delay because marketers are looking at daily, weekly, monthly trends, not real-time ticks.

This delay isn't a limitation - it's an opportunity. Here's how we redesigned our system:

Instead of storing everything together, we separated dimensions (which rarely change) from metrics (which update constantly). This immediately reduced our write amplification - updating a metric no longer meant rewriting all 250 dimensions.

But now we needed to join them at query time. This is where sorted-merge joins come in.

The beauty of sort-merge joins is their simplicity. If both datasets are already sorted by the join key, you just walk through them together. One cursor here, one cursor there. If they match, join. If not, advance the lagging cursor. It's a simple linear scan - no massive hash tables, no quadratic complexity.

The problem with sort-merge joins is the sorting cost. But remember that delay we can afford? We use it. Here's our approach:

Write-Ahead Log: When data arrives, we dump it as fast as possible into a log. No sorting, just pure write speed.
Post-Processing: During our processing window, we sort this data and write it to its table in sorted order. Both tables always remain sorted by their join keys.
Query Time: When a query comes in, both datasets are already sorted. The join is just a linear two-cursor merge — beautiful in its efficiency.

Why Two Different Joins

One thing that often confuses people about our architecture is that we actually perform two different types of joins at different times. Understanding this distinction is crucial to understanding why sorted-merge joins work so well for us.

Write-Time Joins: When conversion data arrives from Google Analytics, we need to figure out which ad campaign should get credit. This requires joining the conversion with our ad interaction data based on last-click attribution. This is complex logic - matching identifiers, resolving dimensions, applying attribution mappings. But it happens during data ingestion, not when marketers run reports.

Read-Time Joins: This is where sorted-merge shines. When a marketer runs a report asking for "cost per acquisition by campaign by country," we need to join:

Metric data (costs: $50,000, conversions: 1,000)
Dimension data (campaign name: "Summer Sale 2023", country: "Germany")

Both tables are massive, both are pre-sorted by the same keys, and the join is a simple merge operation. No complex attribution logic, just matching keys and combining data.

Building on ClickHouse

We built MorselDB on top of ClickHouse, but with significant modifications. ClickHouse is already fast, but it was designed for different patterns than ours.

One key difference: when ClickHouse inserts data, it makes it immediately visible by keeping it in memory. We don't do that. Remember, we can afford delay. Our data goes through that post-processing stage before becoming visible. This trade-off lets us batch update our tables while maintaining sort order for faster query performance.

The Results

The impact was dramatic:

Query performance improved by orders of magnitude
Resource costs (CPU, memory, I/O) became predictable and manageable

But the real victory was proving our initial assumption wrong. Joins aren't inherently expensive. They're expensive when you approach them wrong. By understanding our domain - marketing analytics with its natural delays and update patterns - we could design a system that turned conventional database wisdom on its head.

For our customers, this meant they could finally run complex queries across billions of rows without waiting minutes for results. A marketing manager could ask "Show me CAC (Customer Acquisition Cost - how much we spent to acquire each customer) by creative type by country for the last 90 days" and get an answer in seconds, not minutes. They could iterate quickly, test hypotheses, and make data-driven decisions at the pace modern marketing demands.

Looking back, the journey to MorselDB taught me several things:

Question fundamental assumptions: We operated for years under the belief that joins were too expensive. It took Pritam's proof to show us otherwise.
Understand your domain deeply: The insight about using natural delays in marketing data was crucial. This isn’t for everyone.
Simple algorithms can be powerful: Sort-merge joins aren't sophisticated. They're arguably the simplest join algorithm. But implemented correctly, they solved our biggest challenge.
Write amplification matters: In analytical systems, people often focus on read performance. But when your data constantly updates, write efficiency becomes crucial.

MorselDB continues to evolve. As marketing platforms change their attribution models and privacy regulations reshape data collection, we need to keep adapting. But the foundation - efficient joins through careful data organisation - remains solid.

For engineers considering joining us, this is what makes the work exciting. You're not just optimising abstract algorithms. You're solving real problems that affect how companies analyse billions in marketing spend. And sometimes, solving those problems means questioning everything you thought you knew about database design.

Interested in building systems that challenge conventional database wisdom? Clarisights is hiring engineers who enjoy solving complex data problems at scale. Visit our careers page to learn more.