Data

Commanding Big Data on Databricks: Optimised Pipeline Runtime from 12.5 Hours to 2 Hours

Learn how we optimized a 12.5-hour Java Spark pipeline on Google Dataproc to a 2-hour PySpark solution on Databricks through refactoring and performance tuning.

Mar 11, 2025
Commanding Big Data on Databricks: Optimised Pipeline Runtime from 12.5 Hours to 2 Hours

When a leading marketing analytics platform needed to modernize their data infrastructure, we faced a significant technical challenge. Their existing system, a ~3 TB data transformation pipeline running on Google Dataproc, was processing vital marketing attribution data through Java Spark, but requiring over 12 hours to complete.

This performance bottleneck was impacting their ability to deliver timely insights to their customers. Even with 15 hours of runtime, attempting to execute the identical Java code on Databricks proved ineffective.

This led our team to optimise the pipeline by redesigning it in PySpark, resulting in a catching reduction in execution time to under 2.5 hours!

Problem Overview

The original pipeline had several issues that contributed to its inefficiency:

  1. Unnecessary Complex Functions: The Java code had overly complex functions, some of which were redundant or poorly designed for scalability.
  2. Array Processing: Heavy use of array processing led to inefficiencies in Spark’s distributed architecture.
  3. Custom UDFs: Many transformations relied on custom User Defined Functions (UDFs), which are notoriously slower compared to Spark’s built-in functions.

When attempting to run this code on Databricks, it became evident that it was not optimized for the platform’s strengths. Without a significant rewrite, the pipeline would remain a bottleneck.

Understanding the Business Requirements

Before getting into optimisation, we spent time studying the business logic driving the transformations. This phase was critical to ensure that the revised code met all functional requirements while increasing performance. The key tasks involved cleaning and standardising input data:

  • Aggregating huge datasets
  • Utilizing bespoke business rules for data enrichment

Optimization Approach

1. Transitioning from Java to PySpark

While Java is a powerful language, PySpark provides a more concise syntax and better integration with Databricks’ ecosystem. By leveraging PySpark,we:

  • Reduced code complexity, making it easier to maintain.
  • Leveraged advantage of Databricks’ built-in optimizations for Python workloads.

2. Setting Relevant Spark Configurations

Tuning Spark configurations was critical to handle the large dataset efficiently. Some key settings included:

  • spark.executor.memory and spark.executor.cores: Adjusted to maximize resource utilization on the cluster.
  • spark.sql.shuffle.partitions: Set to an appropriate number based on the dataset size to optimize shuffle operations.
  • spark.dynamicAllocation.enabled: Enabled to ensure efficient resource allocation.

3. Replacing Inefficient Operations

a. Removing Unnecessary Functions:

The Java code included redundant computations and excessive intermediate steps. By streamlining the transformations we reduced the overall computational load.

b. Using Window Functions

Array processing in the Java code was replaced with Spark’s window functions.

For example, calculating cumulative sums and rankings over partitions became significantly faster using window specifications

c. Replacing UDFs with Built-in Functions

Custom UDFs in the Java code were rewritten using PySpark’s built-in functions, which are optimized for distributed computation.

For instance, instead of a UDF to calculate the median of an array, we used percentile_approx on partitions. Built-in functions not only improved performance but also enhanced code readability.

4. Debugging and Profiling

Throughout the rewrite process, we utilised Databricks’ Spark UI to profile the pipeline and identify bottlenecks. This iterative debugging process helped fine-tune the optimizations.

Results

The optimized PySpark pipeline achieved the following:

Migrating and optimising a legacy pipeline is a challenging but rewarding task. By rewriting the Java Spark code in PySpark and applying targeted optimisations, we were able to unlock the full potential of Databricks for this use case. Always strive to balance performance, maintainability, and scalability when migrating or building new pipelines.

Tags:

DatabricksMigrationData AnalyticsAIBigdataRuntime reduction

More from Eucloid

QConvert: Automated Databricks SQL Migration
Blog

QConvert: Automated Databricks SQL Migration

Discover how Eucloid’s QConvert redefines SQL migration, bringing speed, and precision to legacy query conversion. Unlock faster adoption with intelligent automation, and empower your data teams to accelerate digital transformation.

Aug 28, 2025Read more
100% Automated Call Auditing, Zero ASR Cost- A New Standard of Voice Analytics
Case Study

100% Automated Call Auditing, Zero ASR Cost- A New Standard of Voice Analytics

A leading cloud telephony platform partnered with Eucloid to automate call analysis and unlock real-time intelligence from millions of conversations. By building a scalable ASR and analytics pipeline, we enabled 100% call coverage, improved multilingual accuracy, and reduced manual QA effort by 90%. Download the case study to see how Eucloid transformed voice operations with automated speech processing and real-time conversation insights. <

Apr 22, 2026
A guide to evaluating Product  Analytics tools:  Amplitude, Heap & Google  Analytics 4
Whitepaper

A guide to evaluating Product Analytics tools: Amplitude, Heap & Google Analytics 4

In today&rsquo;s world, brands are investing in creating world-class digital experiences to generate customer trust and to drive topline growth. But how do you evaluate whether these initiatives are effective? Download the whitepaper to understand how Product Analytics tools such as Amplitude, Heap and Google Analytics provide these insights.

Apr 21, 2026

Go further with your data than ever before

Experience how AI and analytics can support meaningful business outcomes

Talk to a Data & AI Expert

Eucloid supports end-to-end Databricks initiatives - from platform migration and architecture design to data pipeline development, governance implementation, and deployment of advanced analytics, machine learning, and GenAI solutions.

Yes. Our teams help organizations migrate from platforms such as AWS, GCP, Snowflake, and legacy warehouses while optimizing pipelines, models, and performance on Databricks.

We use Unity Catalog to establish centralized governance across data assets, dashboards, and AI models with access control, lineage tracking, and auditing.

Yes. We build and deploy ML models using Databricks tools like MLflow and help organizations develop GenAI and LLM-based applications on their data.

Timelines vary by project scope, but most migrations and foundational implementations take 8-16 weeks.

Yes. Databricks integrates with CRM systems, cloud storage, BI tools, and enterprise platforms. Eucloid helps design the integration architecture.