I love using the Eclipse Java debugger to trace and step through code. For remote systems I use the remote attach Java capabilities which is especially handy for Hadoop. I want to do the same thing for SPARK, so after some digging I figured out how.

I’m running Fedora in Virtual Machines connecting back to my Eclipse instance on an OSX based host machine. Here’s the steps:

  • On your SPARK cluster download a pre-built spark . I used 1.1.0 which is compatible with Hadoop 2.3. Unzip it somewhere like /opt/spark-1.1.0
  • On your SPARK cluster download Hadoop 2.3 .
  • Unzip and configure Hadoop to your liking. I prefer a gluster configuration over HDFS (more info on gluster+hadoop ). A good spot for hadoop is /opt/hadoop-2.3.0
  • To configure SPARK, edit #SPARK_INSTALL#/conf/spark-env.sh and add a env variable pointing to the Hadoop install: HADOOP_CONF_DIR=/opt/hadoop-2.3.0/For full disclosure, I’m also using Tachyon in front of Gluster, so I have an extra CLASSPATH entry in the spark-env.sh:

    export SPARK_CLASSPATH=/opt/tachyon/core/target/tachyon-0.6.0-SNAPSHOT-jar-with-dependencies.jar:/opt/hadoop-2.3.0/share/hadoop/common/lib/glusterfs-hadoop-2.3.10.jar:$SPARK_CLASSPATH

    Now start spark #SPARK_INSTALL#/sbin/start-all.sh

  • Create an example SPARK application. I used the WordCount example that ships in the SPARK archive.
  • Start the SPARK services: cd #SPARK_INSTALL#
    ./bin/start-all.sh
  • Package your SPARK application as a JAR, upload it to your SPARK cluster or standalone node.
  • I like to create a debug shell script to launch the application and supply the appropriate Java Debugger params. Mine looks like:
    #!/bin/sh
    export SPARK_SUBMIT_OPTS=”-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5431″
    ./bin/spark-submit \
    –class “org.apache.spark.examples.JavaWordCount” \
    –master local \
    /root/spark-example-0.0.2.jar tachyon://127.0.0.1:19998/er/root/in-dir/words1
  • Back in Eclipse, download the source for anything you want to debug. The SPARK source is a good suggestion as well as the Hadoop source. Be sure your versions match in source + bin! Import the source as projects. Details can be found on the Apache SPARK and Hadoop pages. Also create a project for your sample application and import its source. That way you can trace through it.
  • In eclipse under Debug Configurations, select the “Remote Java Application” and click new. Select all of the source projects you setup in step #8 and enter the hostname or IP of your SPARK master node in the Host field. Use port 5431 in the Port field. You can change the port per the debug script above.

Thats it! Set some breakpoints in your sample application, run the debug shell script on your SPARK cluster, then launch the “Debug Remote Java Application” in Eclipse. Eclipse should connect to your cluster which will start the SPARK application and halt at any breakpoints you may have set.

Next, debugging the SPARK daemon processes…..

dictated but not read -bc