9 issues I’ve encountered when setting up a Hadoop/Spark cluster for the first time
In a previous article, we discussed how to prepare the setup of a Hadoop/Spark cluster. Now, preparing the cluster was only the beginning. While there are plenty of resources about setting up Hadoop on a cluster, as a beginner I’ve found it confusing at times and I spent a good dozen hours putting it all together. Therefore, in this article, I’ll document the issues encountered during setting up the Hadoop/Spark cluster and how to solve them. If you wish to give it a try yourself, please review the tutorial I’ve followed to set this all up, which I wholeheartedly recommend.
Now, to give you some context, my setup is made by a laptop (which will serve both as a name node and a data node) and Raspberry Pis as another two data nodes, as follows:
Note that I will be running Spark 2.4.5 and Hadoop 3.2.1.
So, we’ve downloaded, unpacked and moved
Let’s try to start it
First of all, please do check you have Java installed. If you’d like Spark down the road, keep in mind that the current stable Spark version is not compatible with Java 11.
sudo apt install openjdk-8-jre-headless -y
Then, let’s look at the environmental variables. The tutorial (built for Raspbian) recommended setting this environment variable to something like
export JAVA_HOME=$(readlink –f /usr/bin/java | sed "s:bin/java::")
whereas the following, inspired from here, made it work for me (on Ubuntu):
Please also note the environment variables I’ve added to .bashrc :
Also, we need to add JAVA_HOME as above to
Let’s try again now:
Works! Moving forward.
Permission denied (publickey,password).
So, everything is ready to spin off the Hadoop cluster. The only thing is when running
start-dfs.sh && start-yarn.sh
I keep receiving the following error:
localhost: Permission denied (publickey,password).
Turns out, multiple things are happening here. First, my user had a different name than the one on the slave nodes, so my SSH authentication was failing. To sort that out, on the master, I had to set up a config file at ~/.ssh/config as follows:
This made clear what is user I want to use when connecting with SSH.
Hadoop cluster setup — java.net.ConnectException: Connection refused
The next error on our list took me a good evening to debug. So when trying to start the cluster with
start-dfs.sh I was receiving a connection refused error. First, without thinking too much about it, I followed the advice from here and set up the default server to 0.0.0.0.
The correct answer was in fact to set it to my name node server’s address (in core-site.xml) AND to make sure there isn’t an entry in /etc/hosts tying that to 127.0.0.1 or localhost. Hadoop doesn’t like that, and I’ve been warned.
Nodes not showing up
In a multi-node setup having some nodes missing can steam from a wide array of causes. Wrong configuration, differences in environments and even firewalls can cause issues. Here are a couple of tips on how to make head or tail of it.
Check the UIs first
There are several web interfaces exposed in a typical Hadoop stack. Two of them are the Yarn UI exposed on port 8088 and Name Node Information UI exposed on port 9870. In my case, I had a different number of nodes in HDFS than YARN, which pointed me to investigate and find the issue with my YARN settings.
Use jps to see that service is running
Jps is a Java tool, but you can use it to see which Hadoop services are up on a particular machine.
Is the DataNode up on the given node? Let’s move forward.
Dive into Hadoop logs
Hadoop logs are located in the logs subfolder, so in my case /opt/hadoop/logs. Don’t see a particular data node? Ssh into that machine and analyze the logs say for that particular service.
Check the configuration files
That’s were most issues could stem from. Review the logs, document yourself what the correct configuration is and apply them. The configuration files are located in /opt/hadoop/etc/hadoop. Start with the file corresponding to the service you’re trying to debug.
OK, so we’re more or less fine with Hadoop now. What about Spark? I downloaded Spark (without Hadoop) and unpacked it. Now what? Smooth sailing? Not so fast.
Spark fails to start
If you’ve downloaded Spark standalone, chances are you’d bump into the same issue as I did, so be sure to add the following to
export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)
Credits for this solution go here.
Pyspark error — Unsupported class file major version 55
Seriously, I’ve warned you above that the current stable Spark version is only compatible with Java 8. More details on that.
java.lang.NumberFormatException: For input string: “0x100”
Next, this harmless but annoying message that you’d see once starting the spark-shell can be easily fixed by adding an environment variable to .bashrc.
Read more about it here.
Spark UI has broken CSS
When starting a spark-shell or submitting a Spark job, a spark context Web UI is made available at port 4040 on the namenode. In my case, the issue was that the UI had a broken interface, which made using it impossible.
Once again, StackOverflow was my friend. To sort this one, one would need to start a spark-shell and run the following command:
Credits to the solution here.
When running Spark in a clustered mode on top of a YARN cluster, the Spark .jar classes need to be shipped across to other nodes that don’t have Spark installed. Probably it took me more tinkering to sort it out than it should, but in my case, the solution was to upload it to a location inside HDFS and let the nodes get it from there when they need it.
Creating the archive:
jar cv0f spark-libs.jar -C $SPARK_HOME/jars/ .
Upload to HDFS
hdfs dfs -put spark-libs.jar /some/path/
Then, in spark-defaults.conf set
Credits to the solution and explanation here.
This concludes our recap on some errors encountered during setting up Hadoop and Spark as a beginner. We’ll give our cluster a spin, test it out and report it in a future article. Thanks for reading and stay tuned!