In a previous article, we discussed how to prepare the setup of a Hadoop/Spark cluster. Now, preparing the cluster was only the beginning. While there are plenty of resources about setting up Hadoop on a cluster, as a beginner I’ve found it confusing at times and I spent a good dozen hours putting it all together. Therefore, in this article, I’ll document the issues encountered during setting up the Hadoop/Spark cluster and how to solve them. If you wish to give it a try yourself, please review the tutorial I’ve followed to set this all up, which I wholeheartedly recommend.

Now, to give you some context, my setup is made by a laptop (which will serve both as a name node and a data node) and Raspberry Pis as another two data nodes, as follows:

Note that I will be running Spark 2.4.5 and Hadoop 3.2.1.

So, we’ve downloaded, unpacked and moved hadoop to /opt/hadoop.

Let’s try to start it

hadoop version

Oops!

JAVA_HOME_NOT_SET

First of all, please do check you have Java installed. If you’d like Spark down the road, keep in mind that the current stable Spark version is not compatible with Java 11.

sudo apt install openjdk-8-jre-headless -y

Then, let’s look at the environmental variables. The tutorial (built for Raspbian) recommended setting this environment variable to something like

export JAVA_HOME=$(readlink –f /usr/bin/java | sed "s:bin/java::")

whereas the following, inspired from here, made it work for me (on Ubuntu):

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64

Please also note the environment variables I’ve added to .bashrc :

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-arm64
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Also, we need to add JAVA_HOME as above to

/opt/hadoop/etc/hadoop/hadoop-env.sh

Let’s try again now:

hadoop version

Works! Moving forward.

Permission denied (publickey,password).

So, everything is ready to spin off the Hadoop cluster. The only thing is when running

start-dfs.sh && start-yarn.sh

I keep receiving the following error:

localhost: Permission denied (publickey,password).

Turns out, multiple things are happening here. First, my user had a different name than the one on the slave nodes, so my SSH authentication was failing. To sort that out, on the master, I had to set up a config file at ~/.ssh/config as follows:

Host rpi-3
HostName rpi-3
User ubuntu
IdentityFile ~/.ssh/id_rsa

Host rpi-4
HostName rpi-4
User ubuntu
IdentityFile ~/.ssh/id_rsa

This made clear what is user I want to use when connecting with SSH.

But this wasn’t all. More importantly, though, it turns out I needed to SSH into localhost first to be able to run start-dfs.sh and start-yarn.sh . Moving forward.

Hadoop cluster setup — java.net.ConnectException: Connection refused

The next error on our list took me a good evening to debug. So when trying to start the cluster with start-dfs.sh I was receiving a connection refused error. First, without thinking too much about it, I followed the advice from here and set up the default server to 0.0.0.0.

The correct answer was in fact to set it to my name node server’s address (in core-site.xml) AND to make sure there isn’t an entry in /etc/hosts tying that to 127.0.0.1 or localhost. Hadoop doesn’t like that, and I’ve been warned.

Nodes not showing up

In a multi-node setup having some nodes missing can steam from a wide array of causes. Wrong configuration, differences in environments and even firewalls can cause issues. Here are a couple of tips on how to make head or tail of it.

Check the UIs first

There are several web interfaces exposed in a typical Hadoop stack. Two of them are the Yarn UI exposed on port 8088 and Name Node Information UI exposed on port 9870. In my case, I had a different number of nodes in HDFS than YARN, which pointed me to investigate and find the issue with my YARN settings.

Use jps to see that service is running

Jps is a Java tool, but you can use it to see which Hadoop services are up on a particular machine.

Is the DataNode up on the given node? Let’s move forward.

Dive into Hadoop logs

Hadoop logs are located in the logs subfolder, so in my case /opt/hadoop/logs. Don’t see a particular data node? Ssh into that machine and analyze the logs say for that particular service.

Check the configuration files

That’s were most issues could stem from. Review the logs, document yourself what the correct configuration is and apply them. The configuration files are located in /opt/hadoop/etc/hadoop. Start with the file corresponding to the service you’re trying to debug.

OK, so we’re more or less fine with Hadoop now. What about Spark? I downloaded Spark (without Hadoop) and unpacked it. Now what? Smooth sailing? Not so fast.

Spark fails to start

If you’ve downloaded Spark standalone, chances are you’d bump into the same issue as I did, so be sure to add the following to conf/spark-env.sh

export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)

Credits for this solution go here.

Pyspark error — Unsupported class file major version 55

Seriously, I’ve warned you above that the current stable Spark version is only compatible with Java 8. More details on that.

java.lang.NumberFormatException: For input string: “0x100”

Next, this harmless but annoying message that you’d see once starting the spark-shell can be easily fixed by adding an environment variable to .bashrc.

export TERM=xterm-color

Spark UI has broken CSS

When starting a spark-shell or submitting a Spark job, a spark context Web UI is made available at port 4040 on the namenode. In my case, the issue was that the UI had a broken interface, which made using it impossible.

Once again, StackOverflow was my friend. To sort this one, one would need to start a spark-shell and run the following command:

sys.props.update("spark.ui.proxyBase", "")

Credits to the solution here.

Property Spark.yarn.jars

When running Spark in a clustered mode on top of a YARN cluster, the Spark .jar classes need to be shipped across to other nodes that don’t have Spark installed. Probably it took me more tinkering to sort it out than it should, but in my case, the solution was to upload it to a location inside HDFS and let the nodes get it from there when they need it.

Creating the archive:

jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .

Upload to HDFS

hdfs dfs -put spark-libs.jar /some/path/

Then, in spark-defaults.conf set

spark.yarn.archive hdfs://your-name-node:9000/spark-libs.jar

Credits to the solution and explanation here.

Conclusion

This concludes our recap on some errors encountered during setting up Hadoop and Spark as a beginner. We’ll give our cluster a spin, test it out and report it in a future article. Thanks for reading and stay tuned!

Found it useful? Subscribe to my Analytics newsletter at notjustsql.com.

9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time