Choosing the right data warehouse solution

Tools in general are purpose-built to address specific challenges effectively. Therefore, recognizing the problem space and selecting suitable tools are pivotal steps in simplifying the solution process. In this article, we'll delve into the critical factors to consider when selecting data warehousing tools for analytics solutions.

Data freshness

Understanding the problem starts with knowing how quickly data needs to be updated. We can break this down into three main groups:

  • Real-time insights: Data needs to be almost instantaneous.
  • Near real-time insights: A few seconds of delay is okay.
  • Delayed insights: It's fine if data takes minutes, hours, or even days to update.

Let's look at some examples:

During big sales events, online stores want to offer discounts quickly to shoppers based on the realtime demand patterns. Any delay in showing right discount to the customer at the right time could mean losing a sale. Streaming engines like Flink and Spark can process data as they arrive, making them perfect for these real-time tasks.

Next, let's talk about delivery apps. They need to quickly assign orders to nearby delivery agents. While a small delay of a few seconds is okay here, long delays hurt the service quality. Real-time databases like ClickHouse and Apache Pinot are great for these near real-time tasks.

Finally, think about financial companies having to analyse vast amount of transaction data to detect patterns, trends and anomalies. In this case, waiting hours or even days for updated data is acceptable since it's for looking back. Both the real-time databases like ClickHouse and data warehouse solutions like Athena or Redshift can handle these tasks depending on historical data volume.

It's also important to recognise a common pattern: the greater the need for up-to-date data, the less dependent the system becomes on older data. For example, a delivery app doesn't need historical data to assign a nearby agent quickly. But for something like a financial analysis, you need data from months or even years to see trends properly. The aforementioned tools align with this pattern and excel within their designated problem space; streaming engines efficiently handle small data windows for historical context, whereas real-time databases excel in managing longer periods. Data warehouse solutions, on the other hand, demonstrate expertise in handling extensive historical data.

Ingestion throughput

Streaming solutions and real-time databases typically boast impressive ingestion throughput capabilities, often handling thousands to millions of events per second. In contrast, data warehouses like Redshift and BigQuery offer relatively lower throughput rates.

In my view, excluding outliers, the need for ingestion throughput closely aligns with data freshness requirements. Scenarios with less stringent data freshness requirements rarely demands high ingestion rates. This again reinforces the point, the aforementioned tools excel within their designated problem space.

Model complexity

Streaming solutions and real-time databases thrive in environments with flat table structures, where table joins are unnecessary. However, scenarios involving complex modelling, such as organization-wide data lakes, typically demand intricate business logic and the joining of multiple tables to derive insights. In such cases, data warehouses like Athena and Redshift are better suited to handle the complexity and facilitate the necessary data manipulations.

It's worth noting that the above considerations primarily apply to problems involving large volumes of data. For smaller datasets, any big data tool should perform adequately regardless of model complexity.

Moreover, similar to the correlation between data freshness and ingestion throughput needs, the complexity of the model also closely aligns with data freshness requirements. Scenarios with strict data freshness requirements rarely necessitate complex models, again reinforcing the suitability of the aforementioned tools within their designated problem space.

Data size

Cloud data warehouses such as Redshift, BigQuery, and Snowflake are designed to effectively handle petabyte-scale datasets with reasonable latency, often within seconds.

On the other hand, tools like ClickHouse, Trino, and Spark excel at managing gigabyte to terabyte-scale data. Despite this, I believe they are well-equipped to tackle the majority of big data challenges faced by companies, barring those operating at FAANG scale. This confidence arises from the recognition that even with extensive datasets, querying typically targets specific segments rather than the entire dataset. By implementing appropriate partitioning and distribution techniques, these tools can be optimized to perform well across a wide range of problem domains.

Diverse data formats

Some challenges require the ability to ingest data from various sources, each employing different data formats. This scenario is prevalent among companies aiming to establish organization-wide data lakes, encompassing multiple functions to store operational data and extract insights from them.

To address this, emerging solutions like Iceberg and Apache Hudi, coupled with query engines such as Trino and Dremio are gaining traction. These solutions offer flexibility of decoupling storage from compute, schema evolution, and distributed processing capabilities. As a result, they offer a hybrid approach, combining the strengths of both data lakes and data warehouses, making them well-suited for this diverse data format challenge.

Cost considerations

Data tools often place significant demands on infrastructure, resulting in higher costs. While many cloud data warehouse solutions, such as Redshift, offer serverless options, most of these still operate as standalone clusters with automated scaling capabilities. Moreover, deploying the data solution across multiple regions may require users to maintain multiple expensive clusters.

There are exceptions to this cost model, such as Google BigQuery and AWS Athena, which charge customers based on usage. These services bill customers based on the amount of data scanned during query execution. However, even with such systems, it's easy to accumulate hefty bills if the data pipeline isn't carefully designed. For instance, poorly designed pre-aggregation queries scanning large tables repeatedly every hour could lead to unnecessary costs.

Maintenance overhead

Managed cloud-based data warehouses typically require minimal maintenance overhead. However, if the solution involves on-premise self-managed hosting, the choice of the right data tool can significantly influence the maintenance burden.

In my view, data warehouses that integrate data and compute layers (e.g., ClickHouse) may sacrifice performance benefits for increased maintenance overhead. Additionally, open-source solutions often lack robust automation features for hosting, necessitating careful consideration when selecting a tool.

From my perspective, Trino serves as an excellent example of an open-source tool that decouples storage and compute, resulting in a nearly stateless system that is relatively easier to maintain.

In conclusion, while the tools discussed above possess inherent strengths, they can often be leveraged beyond their intended capabilities. However, the extent to which these tools can be stretched raises important questions about the complexity of the solutions. Deviating too far from their core strengths can lead to increasingly intricate implementations.

Furthermore, navigating the evolving landscape of the problem space presents its own set of challenges. Rarely clear-cut in its initial stages, the problem space tends to evolve over time, further complicating the tool selection process. Given the lasting impact of decisions made in the early stages, meticulous consideration in tool selection becomes paramount.

Ultimately, striking the right balance between leveraging a tool's versatility and aligning it with the problem requirements is crucial for successful implementation and long-term viability.