Hadoop HDFS

Hadoop HDFS

2017-06-19. Category & Tags: Default Hadoop, HDFS

(updated: tested hadoop 3.1.3 on Ubuntu 1804)

Install #

sudo mkdir -p /usr/lib/jvm
sudo tar -zxvf ./jdk-8u162-x64.tar.gz -C /usr/lib/jv

vim ~/.bashrc

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_162
export JRE_HOME=${JAVA_HOME}/jre
export CLASSPATH=.:${JAVA_HOME}/lib:${JRE_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH

# new terminal
java -version
echo $JAVA_HOME

sudo tar -zxf hadoop-3.1.3.tar.gz -C /usr/local
sudo mv /usr/local/hadoop-3.1.3/ /usr/local/hadoop
sudo chown -R hadoop:hadoop /usr/local/hadoop
/usr/local/hadoop/bin/hadoop version

Config (pseudo-distributed 伪分布式配置) #

cd /usr/local/hadoop

注：一旦添加了 core-site.xml 以下配置，则无法进行 stand-alone 单机运行?。

gedit ./etc/hadoop/core-site.xml

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

gedit ./etc/hadoop/hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/name</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:/usr/local/hadoop/tmp/dfs/data</value>
    </property>
</configuration>

Q: Error: JAVA_HOME is not set and could not be found. (even with .bashrc setup) A: vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh , export JAVA_HOME=/usr/lib/jvm/default-java

vim /usr/local/hadoop/etc/hadoop/hadoop-env.sh

# modify to / add:
export JAVA_HOME=/usr/lib/jvm/default-java

Format storage 格式化 NameNode （NameNode 初始化）:

cd /usr/local/hadoop
bin/hdfs namenode -format

Start Hadoop (with HDFS) #

cd /usr/local/hadoop
./sbin/start-dfs.sh

jps # check status

3 java processes (excluding jps)：

http://localhost:9870

ps: 如果 DataNode 一直启动失败，可以关闭 DHS，删除所有数据、重新格式化 namenode：

cd /usr/local/hadoop
./sbin/stop-dfs.sh
rm -r /usr/local/hadoop/tmp
./bin/hdfs namenode -format
./sbin/start-dfs.sh

使用 HDFS #

创建当前用户的主目录(optional, 相当于 Ubuntu /home/xxx) ./bin/hdfs dfs -mkdir -p /user/hadoop auto-create if not existing auto default folder for HDFS path not starting with /

复制文件 by put:

cd /usr/local/hadoop
./bin/hdfs dfs -mkdir input
./bin/hdfs dfs -ls input
./bin/hdfs dfs -put ./etc/hadoop/*.xml input # cp： FS > HDFS
./bin/hdfs dfs -ls input

测试简单 MapReduce (Word Count 数单词)：

cd /usr/local/hadoop
# ./bin/hdfs dfs -rm -r output
./bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep input output 'dfs[a-z.]+'  # 找到含有dfs的词，然后统计词频

tip: output 自动创建 tip: “Output directory hdfs://localhost:9000/user/hadoop/output already exists”: 先删除已有输出文件夹，因为 HDFS 默认不覆盖，与 Ubuntu 等 Linux 不同！

查看 HDFS 中的结果：

./bin/hdfs dfs -ls output

取 HDFS 文件到 Ubuntu 系统 FS 并查看：

# rm -r ./output # if exists
./bin/hdfs dfs -get output output     # cp: HDFS > Ubuntu FS
cat ./output/*

# 停止 HDFS
/usr/local/hadoop/sbin/stop-dfs.sh

Python (Anaconda) + HDFS #

install anaconda (optional) #

https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/?C=M&O=D

bash ~/anaconda.sh -b -p $HOME/anaconda && \
eval "$(~/anaconda/bin/conda shell.bash hook)" && \
$HOME/anaconda/bin/conda init

ln -s $HOME/anaconda/ $HOME/anaconda3

install hdfs libs #

# python==3.10
pip install pandas pyarrow
conda install libhdfs3
conda install libgcrypt
conda install libprotobuf
conda update libhdfs3

Python WordCount Demo #

输入文件 (words.txt)：

hello world hello
world is beautiful
hello python world

期待输出文件 (wordCount.txt/part-00000)：

beautiful	1
hello	3
is	1
python	1
world	3

Mapper 脚本 (mapper.py)

#!/usr/bin/env python
import sys
import re

# 读取标准输入（Hadoop Streaming 会将 HDFS 文件内容逐行输入）
for line in sys.stdin:
    # 去除首尾空白，转换为小写，移除标点
    line = line.strip().lower()
    line = re.sub(r'[^\w\s]', '', line)
    # 分割成单词
    words = line.split()
    # 输出 (word, 1) 键值对
    for word in words:
        if word:  # 确保单词非空
            print(f"{word}\t1")

Reducer 脚本 (reducer.py)

#!/usr/bin/env python
import sys
from collections import defaultdict

# 用于存储单词计数的字典
word_counts = defaultdict(int)

# 读取标准输入（Mapper 的输出）
for line in sys.stdin:
    # 解析输入，格式为 "word\tcount"
    word, count = line.strip().split('\t', 1)
    try:
        count = int(count)
        word_counts[word] += count
    except ValueError:
        continue  # 跳过无效行

# 输出最终结果到标准输出
for word, count in sorted(word_counts.items()):
    print(f"{word}\t{count}")

执行文件 count.sh

chmod +x mapper.py reducer.py
/usr/local/hadoop/bin/hdfs dfs -rm -r ./wordCount.txt # Hadoop 不允许覆盖已有输出目录
hdfs dfs -put words.txt /user/hadoop/words.txt

/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.3.jar \
    -input ./words.txt \
    -output ./wordCount.txt \
    -mapper "python3 mapper.py" \
    -reducer "python3 reducer.py" \
    -file mapper.py \
    -file reducer.py

/usr/local/hadoop/bin/hdfs dfs -ls ./wordCount.txt
/usr/local/hadoop/bin/hdfs dfs -cat ./wordCount.txt/*

提示：

Hadoop Streaming 允许使用 Python 脚本作为 Mapper 和 Reducer，通过标准输入/输出与 Hadoop 交互。 -file 参数将 mapper.py 和 reducer.py 分发到 Hadoop 集群的每个节点。
Hadoop 输出总是会创建一个目录，实际输出文本文件文件在 ./wordCount.txt/part-00000
Mapper: 逐行读取 HDFS 文件内容（通过 sys.stdin）; 将每行转换为小写，移除标点，分割成单词; 输出格式为 word\t1，每行一个键值对;
Reducer: 接收 Mapper 的输出，格式为 word\tcount ; 使用 defaultdict 累加每个单词的计数; 按字典序输出最终结果，格式为 word\tcount。

======================================================= #

(below Updated: tested v2.7.2 on Ubuntu 18) #

Install #

OBS: security warning ! Note: change the core-site.xml and hdfs-site.xml content before running. Note: change the HDUSER username before running.

# w/ java
curl https://raw.githubusercontent.com/SunnyBingoMe/install-hadoop/master/setup-hadoop | bash

Env & Hd-Structure Config (Multi-Node Only) #

Make sure the master node and the hadoop user, e.g. $hduser can access work nodes using passphrase-less ssh keys.
If a different username ($hduser) rather than the installer user ($user) is to run hadoop, we need to run setup_profile and setup_environment after su $hduser (some minor errors will be given by bash, no worries.)

For multi-node usage, on each node, append /etc/hosts with:
E.g.:

. . .
192.168.1.0 hd-master
192.168.1.1 hd-node1
192.168.1.2 hd-node2
192.168.1.3 hd-node2

For multi-node usage, on master, config $HADOOP_HOME/etc/hadoop/slaves as:

hd-node1
hd-node2
hd-node3

(by default: the slaves file has a single line localhost, i.e. the master will consider localhost as the only 1 datanode.)

Explain (vis Sunny exp): running start-dfs.sh on master will cause the master to check slaves and run hdfs.daemon.sh start datanode on each slave via ssh. Thus the DataNode will rely on its own config. If the NameNode address (“fs.default.name” in core-site.xml) on the DataNode is wrong, the DataNode will not be able to connect to the NameNode. Also, stop-dfs.sh canNOT stop DataNode:s if the DataNode is not in slaves (and the DataNode will try to re-connect until max-retry). This file is only read by the master when need to use, so the modification during the hdfs running will also take effect whenever the master needs the related content in the file, such as when start-/stop-dfs.sh

Tip 1: [safe mode] If the hdfs has been formatted & tested under single-node mode and then being changed to multi-node with a drop of files, it canNOT leave safe mode, The safe-mode is shown in the first line of hdfs dfsadmin -report. The cluster is in safe-mode when started, but will leave safe-mode after file checks are ok in several seconds.

Tip 2: [adding node] does not require restarting hdfs/NameNode. (see below “Add A New DataNode” regarding how to add)

Hadoop Config (Advanced Usage Only) #

(curl-bash code already includes this, we can ignore this part when just starting to try.)

On each node, set NameNode Location by updating $HADOOP_HOME/etc/hadoop/core-site.xml.
E.g.:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<property>
		<name>hadoop.tmp.dir</name>
		<value>/dev/shm/</value>
		<description>optional, temporary directories.</description>
	</property>
	<property>
		<name>fs.default.name</name>
		<value>hdfs://hd-master:9000</value>
		<description>optional, default 9000. IPC port for jar job submission.</description>
	</property>
</configuration>

On each node, set path for HDFS, by updating $HADOOP_HOME/etc/hadoop/hdfs-site.xml.
E.g.:

<configuration>
    <!-- property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/hduser/data/nameNode</value>
			<description>optional.</description>
    </property>
    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/hduser/data/dataNode</value>
			<description>optional.</description>
    </property -->
    <property>
            <name>dfs.replication</name>
            <value>1</value>
			<description>optional?, default 1?.</description>
    </property>
</configuration>

Format FS #

On master:

hdfs namenode -format

WARN: if namenode is reformated, the data in data node should be manually removed: find data dir according to “dfs.datanode.data.dir” in “hdfs-site.xml”, such as file:/usr/local/hadoop/tmp/dfs/data, then remove everything within “data” dir (but not the “data” dir itself) by: rm -rf /usr/local/hadoop/tmp/dfs/data/*

Tip: formatting is always needed after a fresh installation and only the NameNode needs formatting.

Run #

On master, run: (run in anywhere, should be in PATH already)

start-dfs.sh

Test #

(For either single or multi-node clusters.)

bash (local) #

# admin test:
hdfs dfsadmin -report
hdfs dfsadmin -help
jps

# folder io test:
hdfs dfs -ls / # test read
hdfs dfs -mkdir -p /foo/bar && hdfs dfs -ls /foo/bar && hdfs dfs -ls /  # test write

# file io test:
cat /etc/*release > file1.txt
date > file2.txt
hdfs dfs -put file1.txt file2.txt /foo/bar && rm file1.txt file2.txt && hdfs dfs -cat /foo/bar/file2.txt # write local to hdfs
ls *.txt; hdfs dfs -get /foo/bar/file1.txt; ls *.txt # read hdfs to local

# rm:
hdfs dfs -rm /foo/bar/file1.txt
hdfs dfs -rm -r /foo/bar/

py (py2, local & remote) #

pip install hdfs

Write:

import datetime
from hdfs import InsecureClient

t_datetime = datetime.datetime.now()
content = 'asdf\n ddddd' + str(t_datetime)
print(content)
client = InsecureClient('http://hd-master:50070')
with client.write('/foo/bar/file3.txt', overwrite=True) as pf_write:
    pf_write.write(content)

(Note: 50070 is for both Web UI and Restful API.)

Read:

from hdfs import InsecureClient
client = InsecureClient('http://hd-master:50070')
with client.read('/foo/bar/file3.txt') as pf_read:
    content_read = pf_read.read()
    print(content_read)

OBS: by default, the remote permission is not enabled.
To enable:

sed -i "s/<configuration>/<configuration><property><name>dfs.permissions.enabled<\/name><value>false<\/value><\/property>/" /usr/local/hadoop/hadoop-2.7.2/etc/hadoop/hdfs-site.xml

WARN: for test only, insecure!

Web UI / GUI #

Default port 50070 on master (any interface IP will work).
http://hd-master:50070
Note 1: this port is also used for rest-api.
Note 2: use core-site.xml to change if needed.
see pic in ps at the bottom.

Change Single-Node To Multi-Node #

The curl-bash script is for single-node only.
To have a multi-node cluster:

The NameNode needs to listen on a public address (which requires /etc/hosts if using hostnames than IPs) instead of localhost (“fs.default.name” in core-site.xml on the master)
The DataNode:s need to know the master’s address (“fs.default.name” in core-site.xml on slaves) (which also requires /etc/hosts if using hostnames than IPs)
To let the NameNode start the DataNode:s automatically when start-dfs.sh, the DataNode:s should be in $HADOOP_HOME/etc/hadoop/slaves file.
All automatic start/stop:s are done via passphrase-less ssh.

To change the first 2 points wrt the public address:

ansibleall 'ADDR="hd-master:9000" && \
HADOOP_HOME="/usr/local/hadoop/hadoop-2.7.2/" && \
sed -i "s/.*<value>hdfs:\/\/.*/\t<value>hdfs:\/\/${ADDR}<\/value>/" $HADOOP_HOME/etc/hadoop/core-site.xml'

(for non-ansible users: on each node: set ADDR and HADOOP_HOME and then do sed.)

Add A New DataNode #

Summary: install the same, config the same, start the DataNode and done/ready to use.
Steps:

Make sure the NameNode is listening on a public address.
Install use the same hdfs, e.g. curl-bash.
Config the same (e.g. just copy folder $HADOOP_HOME/etc/hadoop/ to the new DataNode).
On the new DataNode: hdfs.daemon.sh start datanode.
Remember to add the new DataNode to $HADOOP_HOME/etc/hadoop/slaves on the master so that it can be stop/start by the master.

Hadoop HDFS

Install #

Config (pseudo-distributed 伪分布式配置) #

Start Hadoop (with HDFS) #

使用 HDFS #

Python (Anaconda) + HDFS #

install anaconda (optional) #

install hdfs libs #

Python WordCount Demo #

======================================================= #

(below Updated: tested v2.7.2 on Ubuntu 18) #

Install #

Env & Hd-Structure Config (Multi-Node Only) #

Hadoop Config (Advanced Usage Only) #

Format FS #

Run #

Test #

bash (local) #

py (py2, local & remote) #

Web UI / GUI #

Change Single-Node To Multi-Node #

Add A New DataNode #

Spark Usage #

Detailed Config & Performance Tuning #

HA #

Ref #

PS #