How to Analyzing Databricks performance using Ganglia

Posted by

  • What is Ganglia? Ganglia is a monitoring system designed for high-performance computing environments. It provides real-time data on CPU usage, memory usage, network activity, and more.
  • Purpose of Ganglia: By examining the metrics on Ganglia, you can assess whether your Databricks cluster is under-provisioned (too few resources), over-provisioned (too many resources), or just right. This helps ensure efficient use of resources.
  • Detailed Insights: Ganglia displays detailed, node-level metrics like CPU consumption, memory usage, disk usage, and network I/O. These metrics help you understand each node’s performance and stability.
  • Management Actions: With insights from Ganglia, you can decide whether to manually scale your resources, enable auto-scaling, or switch to different instance types to optimize performance.
  • Limitations: Ganglia does not provide job-specific insights or support for pipelines. Additionally, it has limited output options, often requiring users to take screenshots to capture current system status as JPEG or PNG images.

We can find the Ganglia at Databricks Clusters Metrics, and it is shown below.

Figure 1: Ganglia metrics and their interpretation

The above diagram shows an example of a balanced server load distribution. The below shows an example of an unbalanced server load distribution. Look out for the red squares. Those indicate the hot spots where load is more.

Figure 2: Example of a disbalanced server load distribution

Another thing we need to be on the lookout for is memory swapping as this indicates pressure on RAM. The below diagram highlights an example.

Figure 3: Indicator of RAM pressure

Best practices for using Ganglia

A.   Consider the cluster Memory last hour and cluster CPU last hour dashboards. Look for usage and idle times.

B.   Consider the Server Load Distribution map to get an idea of how balanced the workload across the nodes is. Absence of red squares indicates a well-balanced load.

C.   Be on the lookout for Memory Swapping. It can be detected in cluster Memory last hour dashboard by seeing a small purple line over a red line as indicated in the above diagram. This indicates memory pressure.

Inline Feedbacks
View all comments
Would love your thoughts, please comment.x