Verifying a memory issue
SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 30) (10.139.64.114 executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
These error messages, however, are often generic and can be caused by other issues. So, if you suspect you have a memory issue, you can verify the issue by doubling the memory per core to see if it impacts your problem.
For example, if you have a worker type with 4 cores and 16GB per memory, you can try switching to a worker type that has 4 cores and 32GB of memory. That will give you 8GB per core compared to the 4GB per core you had before. It’s the ratio of cores to memory that matters here. If it takes longer to fail with the extra memory or doesn’t fail at all, that’s a good sign that you’re on the right track.
If you can fix your issue by increasing the memory, great! Maybe that’s the solution. If it doesn’t fix the issue, or you can’t bear the extra cost, you should dig deeper.
Possible causes
There are a lot of potential reasons for memory problems:
- Too few shuffle partitions
- Large broadcast
- UDFs
- Window function without PARTITION BY statement
- Skew
- Streaming State