Blog

Why I started DataPelago: Rajan Goyal, Founder and CEO

5m read

Here’s a technology cycle that might sound familiar: Demand for compute exceeds current processing power, and everyone frets. Then, exciting new technology shatters the performance barrier while also enabling previously unworkable business models. A rush of wealth and opportunities ensue. As the technology matures, improvements become incremental. Another performance ceiling comes into view, people fret—and the cycle repeats.

I have helped crack these performance limits numerous times, including back in the early aughts, when web traffic was largely unsecured, data hacks were rampant, and the push for widespread end-to-end security was limited by slow network hardware. I led the architecture team at Cavium to champion hardware-software co-design for its flagship processor family that allowed layer 4 to layer 7 services at internet scale and enabled the explosive growth of SaaS services and cloud hosting.

Later on, we invented data processing units (DPUs) to solve data movement latency with growing east-west traffic imposed by cloud infrastructure virtualization limits. Together with the CPU and the GPU, the DPU has become the de facto third pillar of computing and has made the disaggregated architecture of today’s modern cloud computing feasible.

Today’s concern about data processing limitations is not a crisis—it’s an opportunity. We have reached an inflection point where the current generation of CPU hardware and software can’t match the relentless growth of the global datasphere.

Decades of experience have shaped my core belief that only non-linear thinking will allow us to shatter the limits that data processing faces today in performance, cost, and scalability. And the performance bottleneck isn’t just I/O but instead has migrated up the stack to compute itself. With this realization, I set out to apply the same hardware/software co-design principles from the data movement problem to the data processing problem.

Here are the three trends that will continue to reshape data storage and the data processing industry:

The rise of unstructured data
Moore’s law and Accelerated Computing
Open-source compute engines & frameworks

Let’s examine each in turn.

The Rise of Unstructured Data

All data volumes are rising, but unstructured data is a tsunami. While it was a fraction of structured and semi-structured data 20 years ago, unstructured data now accounts for 90% of all data created—and it’s still growing by double-digit percentages each year. The world’s coping mechanism for these unprecedented data volumes can be summed up as Store and Ignore: 30% of data is estimated to have real business value, yet far less than 1% is ever analyzed.

The tension between rising unstructured data volumes and the challenge of processing it quickly and cost-effectively is perhaps most pronounced in the race for artificial intelligence. The data that informs Large Language Models (LLMs) is primarily unstructured, and the time and cost of preparing and processing that data are often the most resource-intensive aspect of the project. We need a unified platform to enable data pipeline in a time and cost-efficient manner.

Moore’s Law and Accelerated Computing

To maintain the exponential increase in compute speed that Gordon Moore predicted in 1965, the semiconductor industry has regularly shifted strategies. Increasing clock speed worked until we hit the frequency wall around the early 2000s. Then we turned to multiple cores until we hit the die size limit around the 2010s. The focus then shifted to increasing the number of servers until we hit the scale-out wall. The next step function will be provided by accelerated computing, which is a combination of domain-specific hardware and refactored software. Optimizing software for accelerated computing is “insanely hard,” as NVIDIA Founder & CEO Jensen Huang describes, but it represents the next wave of compute acceleration.

Open-source Compute Engines & Framework

Over the past decade, open-source compute engines like Spark and Trino/Presto, along with the emergence of Lakehouses, have matured and transformed the data landscape. These innovations empower customers to escape vendor lock-in, enabling them to choose the best solutions tailored to their unique needs. However, these open-source engines often struggle with performance when compared to premium managed but closed engines built on the same frameworks. It’s time to enhance these open-source solutions with accelerated computing capabilities, enabling them to leapfrog the performance of premium engines while avoiding any change to applications, tools, and workflows.

The Path Forward

“Data processing is the single largest computing demand on the planet today,” Jensen Huang, arguably the biggest beneficiary of these expenditures, says that “all of this should be accelerated.” To achieve a price/performance ratio that is orders of magnitude improved from what is available today, we need to fundamentally reimagine a data system. This means an approach that:

Leverages all the variety of accelerated computing instances offered by cloud service providers, including GPUs, FPGAs, CPU/SIMD, and XPUs.
Is built from the ground up to natively handle structured, semi-structured, and unstructured data.
Works with powerful frameworks, including open-source like Spark and Trino so that users can continue to use existing tools, workflows, and platforms with no change in applications while taking advantage of accelerated computing.

It was clear to me that building such a breakthrough platform would require a team from multiple disciplines—system, architecture, data movement, data analytics, cloud SaaS, UX, and more. With this cross-functional team there would be no barrier we couldn’t scale as we develop a computing platform of the future. Even better, done right, a team of this scope and caliber would create a high barrier to entry for others.

Building teams like that to create solutions as described above is exactly what I’ve done my whole career.

That’s why I founded DataPelago!

Blog

DataPelago Nucleus Vs Nvidia cuDF: Transforming GPU Economics for AI and Analytics

5m read

Blog

Introducing DataPelago Accelerator for Spark — the next frontier in Spark performance and efficiency

5m read

Case studies

Unlocking OSS Spark: How ShareChat Used DataPelago to Accelerate Analytical Data Pipelines While Reducing Costs by 50%

2m read

Why I started DataPelago: Rajan Goyal, Founder and CEO

The Rise of Unstructured Data

Moore’s Law and Accelerated Computing

Open-source Compute Engines & Framework

The Path Forward

Related

DataPelago Nucleus Vs Nvidia cuDF: Transforming GPU Economics for AI and Analytics

Introducing DataPelago Accelerator for Spark — the next frontier in Spark performance and efficiency

Unlocking OSS Spark: How ShareChat Used DataPelago to Accelerate Analytical Data Pipelines While Reducing Costs by 50%

Keep up with DataPelago

Get in touch