Hadoop HDFS

Hadoop HDFS

2017-06-19. Category & Tags: Default Hadoop, HDFS

(Update: tested v2.7.2 on Ubuntu 18)

Install #

OBS: security warning ! Note: change the core-site.xml and hdfs-site.xml content before running.
Note: change the HDUSER username before running.

# w/ java
curl https://raw.githubusercontent.com/SunnyBingoMe/install-hadoop/master/setup-hadoop | bash

Env & Hd-Structure Config (Multi-Node Only) #

Make sure the master node and the hadoop user, e.g. $hduser can access work nodes using passphrase-less ssh keys.
If a different username ($hduser) rather than the installer user ($user) is to run hadoop, we need to run setup_profile and setup_environment after su $hduser (some minor errors will be given by bash, no worries.)

For multi-node usage, on each node, append /etc/hosts with:
E.g.:

. . .
192.168.1.0 hd-master
192.168.1.1 hd-node1
192.168.1.2 hd-node2
192.168.1.3 hd-node2

For multi-node usage, on master, config $HADOOP_HOME/etc/hadoop/slaves as:

hd-node1
hd-node2
hd-node3

(by default: the slaves file has a single line localhost, i.e. the master will consider localhost as the only 1 datanode.)

Explain (vis Sunny exp): running start-dfs.sh on master will cause the master to check slaves and run hdfs.daemon.sh start datanode on each slave via ssh. Thus the DataNode will rely on its own config. If the NameNode address (“fs.default.name” in core-site.xml) on the DataNode is wrong, the DataNode will not be able to connect to the NameNode. Also, stop-dfs.sh canNOT stop DataNode:s if the DataNode is not in slaves (and the DataNode will try to re-connect until max-retry). This file is only read by the master when need to use, so the modification during the hdfs running will also take effect whenever the master needs the related content in the file, such as when start-/stop-dfs.sh

Tip 1: [safe mode] If the hdfs has been formatted & tested under single-node mode and then being changed to multi-node with a drop of files, it canNOT leave safe mode, The safe-mode is shown in the first line of hdfs dfsadmin -report. The cluster is in safe-mode when started, but will leave safe-mode after file checks are ok in several seconds.

Tip 2: [adding node] does not require restarting hdfs/NameNode. (see below “Add A New DataNode” regarding how to add)

Hadoop Config (Advanced Usage Only) #

(curl-bash code already includes this, we can ignore this part when just starting to try.)

On each node, set NameNode Location by updating $HADOOP_HOME/etc/hadoop/core-site.xml.
E.g.:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/dev/shm/</value>
		<description>optional, temporary directories.</description>
	</property>
	<property>
		<name>fs.default.name</name>
		<value>hdfs://hd-master:9000</value>
		<description>optional, default 9000. IPC port for jar job submission.</description>
	</property>
</configuration>

On each node, set path for HDFS, by updating $HADOOP_HOME/etc/hadoop/hdfs-site.xml.
E.g.:

<configuration>
    <!-- property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/hduser/data/nameNode</value>
			<description>optional.</description>
    </property>
    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/hduser/data/dataNode</value>
			<description>optional.</description>
    </property -->
    <property>
            <name>dfs.replication</name>
            <value>1</value>
			<description>optional?, default 1?.</description>
    </property>
</configuration>

Format FS #

On master:

hdfs namenode -format

Tip: formatting is always needed after a fresh installation and only the NameNode needs formatting.

Run #

On master, run: (run in anywhere, should be in PATH already)

start-dfs.sh

Test #

(For either single or multi-node clusters.)

bash (local) #

# admin test:
hdfs dfsadmin -report
hdfs dfsadmin -help
jps

# folder io test:
hdfs dfs -ls / # test read
hdfs dfs -mkdir -p /foo/bar && hdfs dfs -ls /foo/bar && hdfs dfs -ls /  # test write

# file io test:
cat /etc/*release > file1.txt
date > file2.txt
hdfs dfs -put file1.txt file2.txt /foo/bar && rm file1.txt file2.txt && hdfs dfs -cat /foo/bar/file2.txt # write local to hdfs
ls *.txt; hdfs dfs -get /foo/bar/file1.txt; ls *.txt # read hdfs to local

# rm:
hdfs dfs -rm /foo/bar/file1.txt
hdfs dfs -rm -r /foo/bar/

py (py2, local & remote) #

pip install hdfs

Write:

import datetime
from hdfs import InsecureClient

t_datetime = datetime.datetime.now()
content = 'asdf\n ddddd' + str(t_datetime)
print(content)
client = InsecureClient('http://hd-master:50070')
with client.write('/foo/bar/file3.txt', overwrite=True) as pf_write:
    pf_write.write(content)

(Note: 50070 is for both Web UI and Restful API.)

Read:

from hdfs import InsecureClient
client = InsecureClient('http://hd-master:50070')
with client.read('/foo/bar/file3.txt') as pf_read:
    content_read = pf_read.read()
    print(content_read)

OBS: by default, the remote permission is not enabled.
To enable:

sed -i "s/<configuration>/<configuration><property><name>dfs.permissions.enabled<\/name><value>false<\/value><\/property>/" /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/hdfs-site.xml

WARN: for test only, insecure!

Web UI / GUI #

Default port 50070 on master (any interface IP will work).
http://hd-master:50070
Note 1: this port is also used for rest-api.
Note 2: use core-site.xml to change if needed.
see pic in ps at the bottom.

Change Single-Node To Multi-Node #

The curl-bash script is for single-node only.
To have a multi-node cluster:

  • The NameNode needs to listen on a public address (which requires /etc/hosts if using hostnames than IPs) instead of localhost (“fs.default.name” in core-site.xml on the master)
  • The DataNode:s need to know the master’s address (“fs.default.name” in core-site.xml on slaves) (which also requires /etc/hosts if using hostnames than IPs)
  • To let the NameNode start the DataNode:s automatically when start-dfs.sh, the DataNode:s should be in $HADOOP_HOME/etc/hadoop/slaves file.
  • All automatic start/stop:s are done via passphrase-less ssh.

To change the first 2 points wrt the public address:

ansibleall 'ADDR="hd-master:9000" && \
HADOOP_HOME="/usr/local/hadoop/hadoop-2.7.2/" && \
sed -i "s/.*<value>hdfs:\/\/.*/\t<value>hdfs:\/\/${ADDR}<\/value>/" $HADOOP_HOME/etc/hadoop/core-site.xml'

(for non-ansible users: on each node: set ADDR and HADOOP_HOME and then do sed.)

Add A New DataNode #

Summary: install the same, config the same, start the DataNode and done/ready to use.
Steps:

  1. Make sure the NameNode is listening on a public address.
  2. Install use the same hdfs, e.g. curl-bash.
  3. Config the same (e.g. just copy folder $HADOOP_HOME/etc/hadoop/ to the new DataNode).
  4. On the new DataNode: hdfs.daemon.sh start datanode.
  5. Remember to add the new DataNode to $HADOOP_HOME/etc/hadoop/slaves on the master so that it can be stop/start by the master.

Spark Usage #

Make sure all spark nodes have permission to access HDFS.

Detailed Config & Performance Tuning #

HA #

Ref #

Main ref: linode
See more in doc

PS #

Hadoop Web UI (50070):
Hadoop Web UI