Systems and Architecture
Recently Ooyala has launched its new OVP analytics product, IQ, as a standalone product offering. The product is powered by our in-house OLAP system which is built on top of Spark and Cassandra technology. This article discusses the experience of developing the system and some lessons learned from the development of the OLAP engine for IQ.
Ooyala IQ is a bold analytics product. The product requirement challenges the boundaries of a big data analytics query in the following ways:
Requires 360-degree view of the customer’s analytics by showing a constant multi-facet view (video, geo, device, traffic and more) at all times.
This means the product shows daily breakdown, top video, top geo, top devices, top traffic domain in single page for the customer so they always get the stats for their business within single page.
Allows deep analytics by providing interactive slice-and-dice of 360-degree views for date ranges up to a year.
This means the customer can select any filter to apply to view above. E.g The customer can answer questions like "what is my top video played in California, US using IOS from my sports domain" with just a few clicks on the UI instead of waiting for generated report like other analytics systems.
IQ - Business Intelligence page with 2 filters (device and country).
To understand why this is a challenge we have to first understand the scale of data we are talking about and its implication on computation requirements:
Ooyala receives a couple billion analytics events daily and the volume is growing rapidly.
With the events received, we produce close to hundred million rows of multi-dimensional analytics facts (that IQ needed). This translates to a few billion analytics facts yearly for some of our biggest providers.
Allowing flexible slice-and-dice means the data required for specific view cannot be pre-generated (since the number of possible views is astronomical) and we need to compute the data on the fly within the page SLA.
The so-called 360-degree view requires multiple simultaneous queries to request data for different dimensions of the analytics facts.
If we look at some benchmarks performed by AmpLab (creator of Apache Spark), we can pretty much see why this is a ‘Big’ challenge. For example, in the benchmark, it would take 111 seconds for Shark (SQL-on-Spark) to aggregate 100 Million rows of data using 5 m2.4xlarge instances (82 seconds with updated recent benchmark). To enable the level of interaction put forth by the product, this requires loading and aggregating millions of multidimensional analytics facts on the fly, and not just one of them but many of them with a single page view. While the problems of high dimension queries are a well-studied problem in OLAP, the advent of Big Data has already pushed the boundary much further. We knew immediately that to support the product, we need something at the cutting edge of the Big Data field.
We started the project by evaluating our options. This included looking at some MPP databases (GreenPlum, Vertica), open-source analytics query engines (e.g. Drill, Druid, Impala, Shark), and cloud-based analytics infrastructure offering includes AWS Redshift or Google BigQuery. There were so many different new technologies around Big Data queries, that it was as if everyone was trying to solve the same problem. After reviewing a lot of benchmarks and case studies, we decided in the end that most of them are either immature, uneconomical, or unsuitable for our long term analytics product vision (which would have even higher requirements). We decided to build our own solution based on Spark and Cassandra.
The reason Cassandra was chosen was because we had been a Cassandra powerhouse for quite some time, and are confident with the performance of Cassandra. We also made the assumption that we cannot have data locality, so Cassandra would be a good choice for the data storage as Cassandra has built-in caching and this would help in performance. Apache Spark, on the other hand, is a fast and flexible distributed data processing framework that is very promising. While we had a feeling that our use of spark was non-typical (it is proven not, in many ways), we believed with some talent and persistence we had a chance of building just what we needed and more for the product.
Our journey of building the OLAP platform obviously didn’t end at selecting the technologies. We spent a significant amount of time thinking about how to support low latency high dimensional queries with high QPS (query per second). The first thing we focused on was an efficient representation of OLAP data-cube for efficient storage, filtering and processing. We opted for columnar-based datacube structure due to its efficiency and versatility. At that moment, parquet was still in its infancy and was strongly coupled with HDFS and Hadoop. We coded everything from the ground up in Scala with [Saddle](https://saddle.github.io/) library for primitive data type support. Later on, we also optimized few metric structure including hyperloglog and histogram when we discovered they occupied more than half of our data storage. While we would not vouch that our columnar data-cube technology is ground-breaking, it is first of its kind in terms of ‘columnar over column-oriented-store’. By storing the columnar data using C*, we directly gain column storage separation and column level caching for query performance. By processing data in columnar fashion using Apache Spark, we also achieve great performance result. It has been benchmarked that we are conservatively 2x faster than Spark-SQL with simple n-gram(3) count even on spark 2.0-snapshot (March snapshot) where Spark SQL are significantly improved.
Benchmark of in memory word (n-gram) count vs Spark SQL 2.0
Spark is a wonderful technology built on top of great ideas. The concepts of lazy transformation, work optimization and distribution, parallelism control, and cacheable datasets all make distributed data processing much easier. We had few challenges, though, in our non-typical use of spark for data processing:
We are running a long-running (say months) and highly-concurrent spark context as a query engine, which is less typical than batch processing spark workflow and subject to more Spark issues including a race condition that we discovered with a highly concurrent load
We are actively managing core/parallelism as we need to maintain low latency while sustaining high QPS. High parallelism means a higher scheduling cost and potential (virtual) core starvation for concurrent requests, while with low parallelism the query latency will suffer. So we have to do query cost estimation to manage parallelism for each query.
We need custom HA with multiple drivers. As we know, spark drivers are not HA, so we need to come up with some novel way so that we are not wastefully duplicating data cache over multiple drivers (load-sharding with balance degradation).
There’s more to spark related considerations, but this blog post is not intended to be a technical deep dive into the details. In general, we are very happy with our choice on spark for the capability that it provides, as well as an active spark community that provides many solutions to our problems.
There were two schools of ideal during the lifetime of the project:
Building the fastest query engine that processed all the raw analytics facts as quick as possible.
Building a query engine that is adequate enough, but trying to minimize the data we need to process through other means (semi-pre-aggregation).
At the initial phase we were adopting approach #1 and tried to go against the law of physics (or pretended we have infinite computational resources at our expense). As much as we made great progress in terms of technology breakthrough, it was simply not enough. The approach hit the walls pretty soon (but not soon enough) when we realized that we were going to have a bad page response time, in terms of minutes. Fortunately, we turned course to approach #2. While we had to do much more work than with approach #1, it saved us from the impossibility of launching the product. Today we use shades of semi-pre-aggregated data to fulfill the same requirements for the product. Each query still requires a lot of processing using spark but is with more manageable data volume. The main moral here is we should never bet the success of the product on technology breakthrough that hasn't been materialized. The risk factor is simply too high. A nice corollary is that a good technical implementation is one that makes the product successful - even if it involves changes in approach. This is what being 'agile' about anyway.
The OLAP engine has been running in production for more than a year now. IQ has been a success due to the flexibility provided by the OLAP engine. The product allows the customer to visualize multidimensional data faster and gain actionable insights to their business at the speed of thought. We are still constantly seeking better performance as the product become a standalone product offering and has third-party players and devices supports through its JSON API. For the future, we will be still working on technology stack upgrades (Cassandra driver and Apache Spark version), architectural changes (multiple smaller Jobserver/Apache Spark clusters), and algorithmic improvements on columnar processing.
In this blog, we went through our experience of building OLAP engine for a unique analytics product, IQ. We explained our architectural choice, the technical challenges we went through using Apache Spark, and a lesson learned about being pragmatic. We think that this is not only an achievement for the company but also a great milestone for open source technology, especially Apache Spark, to tackle not just in-house analytics but also customer facing applications. We’ll be excited to share more in future so stay tuned!