Hadoop Distributed File System (HDFS) Tutorial
HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications.
One of the primary advantages of HDFS is its transparency. Clients do not need to be particularly aware that they are working on files stored remotely. The existing standard library methods like open(), close(), fread(), etc. will work on files hosted over NFS.
The HDFS can be found in the /conf folder of your Hadoop installation. The conf/hadoop-defaults.xml file contains default values for every parameter in Hadoop, this file is read-only. You override this configuration by setting new values inconf/hadoop-site.xml. This file should be replicated consistently across all machines in the cluster.
Configuration settings are a set of key-value pairs of the format:
<property> <name>property-name</name> <value>property-value</value> </property>
The following settings are necessary to configure HDFS:
fs.default.name : This is the URI (protocol specifier, hostname, and port) that describes the NameNode for the cluster. eg. hdfs://thys.michels.com:9000
dfs.data.dir : his is the path on the local file system in which the DataNode instance should store its data. eg. /home/username/hdfs/data
dfs.name.dir : This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. eg. /home/username/hdfs/name
the result will looks like follows:
<configuration> <property> <name>fs.default.name</name> <value>hdfs://your.server.name.com:9000</value> </property> <property> <name>dfs.data.dir</name> <value>/home/username/hdfs/data</value> </property> <property> <name>dfs.name.dir</name> <value>/home/username/hdfs/name</value> </property> </configuration>
The master node needs to know the addresses of all the machines to use as DataNodes; the startup scripts depend on this. Also in the conf/ directory, edit the file slaves so that it contains a list of fully-qualified hostnames for the slave instances, one host per line. On a multi-node setup, the master node (e.g., localhost) is not usually present in this file.
The next step is to
user@machine$ mkdir -p $HOME/hdfs/data user@namenode$ mkdir -p $HOME/hdfs/name
Folder access chmod +rw and needs to be accessed by all that will use this node. Best practice is to create a hadoop user and group. It is not recommended that you run Hadoop as root.