Buy or Build?: An Oral History of MorselDB (Part 1)
In this series of posts, the team at Clarisights tells you the story behind the making of Morsel DB and how it will power the next generation of Clarisights customers.
Clarisights empowers performance marketing teams at some of the world’s most innovative companies to take meaningful decisions. We do this my democratising data through our revolutionary no-code BI reporting platform. Sitting at the heart of the Clarisights platform is MorselDB. MorselDB is the proprietary database technology that enables our customers to effortlessly turn billions of data points into infinitely customisable reports and actionable insights. In this series of posts the team at Clarisights tell you the story behind the making of MorselDB and how it will power the next generation of Clarisights customers.
Clarisights Journal (CJ): When did the MorselDB project start at Clarisights? What happened to make the Engineering team think that they needed a new analytical database? Arun?
Arun Srinivasan (CEO): There were two parts to that. First, Ashu kept telling me that we needed a new database. We always knew this. We knew what we had was not going to work in the long term. But the big question was: What was going to replace our existing database? That was the part we weren’t really clear about.
Ashu Pachauri (CTO): Let me add some historical perspective to this. The day I joined Clarisights—I think on that day itself—I discussed our databases with Ankur. I told him that our existing database--MongoDB--was not built for large scale analytics.
But until that point it made sense to use MongoDB because we were dealing with so much free form data. Customers keep creating dimensions, and they keep deleting dimensions. On other databases you have to do a lot of modelling to enable this. So I understood why Clarisights had chosen to go with MongoDB initially. But it quickly became clear that it would never give us the latency and throughput we needed.
Arun: Plus there was an incident with Delivery Hero (DH), who was our first large enterprise customer. As always we carried out a Proof of Concept period with them. And they loved the product and then came on board as a full paying customer. They started using it internally with a lot of enthusiasm.
One day I got a call from DH from Dubai. Delivery Hero had organized an annual company meeting in Dubai. This was where all the business heads across the DH group would come together for meetings and discussions. And our Customer Team at DH wanted to demo Clarisights at this huge meeting. Unfortunately, our platform was very slow and then it just crashed. It was embarrassing for them and therefore embarrassing for us.
They pinged me on WhatsApp and said, ‘This was a real embarrassment. It didn't work. And so please go and do something to fix it so that this doesn't happen again.’ And that was when the alarm bells really started ringing.
Arun: And there were those Tuesday problems! Every Tuesday the entire team at Delivery Hero used to have their weekly meetings. Unlike many other products in this space, we don’t sell user licenses or seats. Clarisights customers can create as many users as they want. This is a core part of our pricing and product philosophy.
On Tuesday all these users would load their Clarisights dashboards simultaneously. And of course, we were not really prepared for this. We had done a lot of homework to prepare for this kind of intense use. For instance, we added a lot of dimension-specific indexes. At one point Ankur used to use elastic search to identify which dimensional queries were slow and then add it manually to the database. Later Ashu wrote a script to automatically find relevant indexes to add. Then we did even more debugging. Initially, we used to think that if we add the indexes, then it will just work on its own. But then we discovered something. We discovered that MongoDB could do some silly things.
Ashu: Let me give you a little technical perspective on this. So basically what MongoDB does is it actually runs any query that you throw at it, even to figure out what index to use. It actually runs the query first. It runs and tries to get the first 50 records from the database to figure out which index is going to be faster. So what it does is… let's say there are 10 indexes. MongoDB will actually run 10 parallel queries trying to use all these 10 indexes.
And it'll try to find out which one of these gives results the fastest. So you see the problem, right? If it's already a slow query, you are basically just throwing this unnecessary load on the system, just to figure out what index to use. And the additional challenge for us is that our queries sometimes end up being very selective. So just to get those 100 records from MongoDB, it sometimes takes non-trivial amounts of time. And even then it would come up with the wrong index. You remember this Ankur?
Pritam Baral (Engineering Lead): Let me explain for a non-technical audience. The method Mongo used to select an index was like picking the tallest kid to play basketball. "Hey, you're tall. You should play basketball well, yeah?" That is stupid, especially in databases.
Ashu: On top of that, Mongo would measure the heights of every child every single time it had to pick someone to play basketball. Even if you gave MongoDB a list of all the kids with their heights, Mongo would say no thanks, I am going to measure everyone all over again. To be fair, MongoDB has a cache to prevent itself from doing this all over again. However, sizing this cache gracefully for the kind of scale that Clarisights functions at turned out to be a big challenge.
Arun: There was another factor as well. I suddenly notice that our server costs are just going through the roof. Something in the range of USD 14-15,000 a month. This was a lot of money for us. And this was long before we raised our Series A round. Back in those days we used to depend on Google Cloud credits . Google used to give away free credits of 100,000 dollars if you were a new company. This meant that there was no incentive to keep an eye on server costs.
Meanwhile, we were still waiting for investments to come in. And then came the perfect storm. Customers wanted more dependability. Of course, for us this is non-negotiable. We do whatever it takes for our customers. So Ashu and Ankur kept adding servers and indexes. And suddenly our server costs tripled. It was panic stations for us. And all this without a commensurate increase in revenues.
The final nail in this coffin was that despite all these efforts our customers weren’t happier. To this day I am grateful to Delivery Hero for trusting in us. Today we are looking to double our customers by the end of this year, and we have happy investors and an aggressive recruitment plan. But in those days… the fact they kept trusting us to get better was just amazing.
Ashu: By this point we had to solve our database problem as soon as possible. Immediately we had the option of picking something else off the shelf. We went through several databases. There was a long list. But this raised another huge challenge we had to solve: How do you really identify if something is actually going to work or not, or whether you need to build some something of your own?
Build or buy? How do you model that choice? I can break it down that decision into three parts.
The database that we are talking about fall in three categories: OLAP databases, OLTP databases and--what is a relative new category--HTAP databases. Which are in many ways a combination of OLAP and OLTP.
So OLAP databases are primarily used for large scales of data where you are just trying to do analytics. The kind of analytics we are doing, but you don't do a lot of updates to the data. And that was not our use case. We do a lot of updates to the data.
Ankur (Co-founder and Engineering Lead): So then that become a proposition. A metric for us to choose. We need a database that was farther along the read-write spectrum of databases.
Ashu: Exactly. And that's where the other spectrum of databases lies, which is the OLTP type of databases. They excel at updates and making pointed queries. But if you want to do large scale analytics, they can't really scale.
Which leaves us with the third category of databases. Which is the HTAP database. HTAP tries to combine both of these use cases. And that's where our spectrum lies. And that's where we were trying to find solutions for our problem.
Ankur: I just want to point out that HTAP is still pretty new. Really cutting edge. But back in 2019 it was a totally fresh field.
Ashu: A lot of it is still in research. There were some databases that had come out back then. That claimed to be HTAP. And they did a decent job for some use cases. But we found lots of limitations in these HTAP databases for our use.
CJ: What were some of these HTAP databases you looked at?
Ashu: First of all there is Apache Kudu. Kudu is quite popular in the big data ecosystem for doing this kind of workload. But Kudu has limitations. First that are is a soft limit on the number of columns that you can create. At Clarisights we end up having several hundreds, even thousands of columns just for a single table. For many customers we end up creating close to 2,500 metrics per data source. And then around 250 dimensions. And Kudu had never been used anywhere close to this scale. We still did benchmark tests on Kudu and it didn’t meet our standards at all. We also looked at Apache Hudi. Both Kudu and Hudi had limitations that couldn't scale to the complexity and demands of our system.
An example of a limitation of Hudi is that it only supports traditional file formats like Parquet and ORC, which are not built to support a true and efficient primary key concept. And an efficient primary key is essential for a large scale update workload, as well as truly uniform workload distribution in analytics queries. Most of these file formats suffer from functional and performance limitations that have arisen out of limitations within the Hadoop ecosystem.
On the other side of spectrum in the HTAP category are commercial databases. Snowflake is probably the most prominent there. There is also Firebolt that is newer. I think both of these are quite close to the design we eventually came up with. But then we only have limited view into the internals of these databases. They are both proprietary.
CJ: Did you consider using Snowflake?
Ashu: Pritam carried out a lot of research into Snowflake to see if we could bend it to our requirements. But eventually we realised that Snowflake’s closed architecture just made that kind of customisation impossible.
For example, our data comes from a lot of different data sources. We wanted to have separate controls on the velocity at which we ingest data for these different sources, and the velocity at which this data becomes available for analytics. There are tons of other customisations and efficiency gains we needed on top of a general-purpose database like Snowflake. But we had such little control over what happened inside Snowflake that we couldn’t get that level of granular control. We were left with just one choice: build our own database.
Arun: It was a very scary choice because… this is almost company-breaking. Because the current system was not solving the problem that we had for our customers. This was a high-risk solution to a high-stakes problem. But if we didn't solve this problem, there would be no Clarisights. Plus of course server costs were so high, we were bleeding money.
Pritam: And I can understand why Arun thought that this was a major cost because that's what I was told. Then one day I came in and asked: "Let's get those actual numbers. How much is this contributing? Even if you could completely wipe out Mongo and the new database cost… and had nothing to run, how much would we really save?"
And then we ran that analysis and we found cost centres in so many other places that had nothing to do with the database. So that's when we became aware that, okay, yeah there are other sections that we can focus on and get some easy wins.
But the main challenge remained. It was time to build a new database.
CJ: When you decide to build a new database, where do you start?
Pritam: You start with the use case. You always start with the use case. And that is where we started. What is the use case for this new database? As we just discussed there are so many choices when it comes to databases. So many nuances to be aware of, and so many trade-offs that you can make.
You start with the use case. You always start with the use case. And that is where we started. What is the use case for this new database? There are so many choices when it comes to databases. So many nuances to be aware of, and so many trade-offs that you can make.
This means that you start not with the solution but with the real problem. Which means really having to understand the use cases. What do you need out of this thing? And then you design solutions for that problem statement. If there exists a tool that fits that solution already? Fantastic! Then you really don’t want to design a new database because that's actually a lot of effort. So understanding the use cases is what I first did. I came down to Bangalore and talked to the Customer Success team. I tried to make sense of the product, make sense of the actual use cases on the database. I think we spent close to a month just understanding use cases and listing them down. And then I went back and did my research.
CJ: Can you expand on this research process a little bit?
Pritam: So first, like I said, I started by collecting all this information on use cases and customer requirements. And used this information to construct a non-technical problem statement. This is really starting from first principles. And then I began translating this non-technical problem statement into mathematical terms. How do you do this?
As Ashu and Ankur already said, different databases are good at different things. Some are good at writing new data but not very good at rewriting old data. So one of my first mathematical questions was: What portion of our data are we going to rewrite? And in how much time? And how does that affect the read quality? This then became a question of trade-offs. The first objective of my mathematical analysis then was to figure out how to structure the database problem in such a way that we minimised the right database costs but ignored the costs that didn’t really matter to us.
For Clarisights this tradeoff was between the database’s ability to crunch numbers—which it had to do at tremendous scale—and the speed of data updation—which wasn’t so critical. It became clear that we could afford for the data to update a little bit slowly. Plus we didn’t really need very high levels of atomicity. Clarisights wasn’t really suffering from MongoDB’s problems with atomicity. This is when I came to realise that in many ways Clarisights had already built a custom database. We weren’t using MongoDB as a database per se, but as part of database. In addition to MongoDB this database also included lots of functions built on top in Ruby using Postgres in some cases, Redis in other cases and so on.
For Clarisights this tradeoff was between the database’s ability to crunch numbers—which it had to do at tremendous scale—and the speed of data updation—which wasn’t so critical. It became clear that we could afford for the data to update a little bit slowly. Plus we didn’t really need very high levels of atomicity.
Altogether Clarisights had configured a proper second level database and analytical database. The problem then became a bit simpler. We didn’t really have to reinvent the wheel. Instead we had to take all these trade-offs and improvisations and build them all into a single cohesive unit. A single cohesive box. Eventually MorselDB is what became that box.
CJ: That is still a few months down the road right? At this point you have the mathematical formulation in place. And you have a sense of what the solution should look like. What was the next step?
Pritam: In some sense this was the easy part of the story. The next part was the really hard bit. It was a little bit like the early stages of the Manhattan Project. We all knew what we had to do. We knew where we wanted to reach. But how were we doing to get there? The more important question, I guess, was: Could this even be done? Even if it was mathematically feasible, was it technologically feasible for us to pull off?
Next Time: In Part Two of The MorselDB Story we look at the engineering challenges that had to be overcome to build the new database. Meanwhile, in the midst of this big, scary engineering project, Arun had new business challenges to deal with.
We're hiring great software engineers. Click here to know more.