Hadoop Distributed File System (HDFS) Tutorial

HDFS, the Hadoop Distributed File System, is a distributed file system designed to hold very large amounts of data (terabytes or even petabytes), and provide high-throughput access to this information. Files are stored in a redundant fashion across multiple machines to ensure their durability to failure and high availability to very parallel applications.

One of the primary advantages of HDFS is its transparency. Clients do not need to be particularly aware that they are working on files stored remotely. The existing standard library methods like open()close()fread(), etc. will work on files hosted over NFS.

Configuring HDFS

The HDFS can be found in the /conf folder of your Hadoop installation. The conf/hadoop-defaults.xml file contains default values for every parameter in Hadoop, this file is read-only. You override this configuration by setting new values inconf/hadoop-site.xml. This file should be replicated consistently across all machines in the cluster.

Configuration settings are a set of key-value pairs of the format:

  <property>
    <name>property-name</name>
    <value>property-value</value>
  </property>

The following settings are necessary to configure HDFS:
fs.default.name : This is the URI (protocol specifier, hostname, and port) that describes the NameNode for the cluster. eg. hdfs://thys.michels.com:9000
dfs.data.dir : his is the path on the local file system in which the DataNode instance should store its data. eg. /home/username/hdfs/data
dfs.name.dir : This is the path on the local file system of the NameNode instance where the NameNode metadata is stored. eg. /home/username/hdfs/name

the result will looks like follows:

 
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://your.server.name.com:9000</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/home/username/hdfs/data</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>/home/username/hdfs/name</value>
  </property>
</configuration>

The master node needs to know the addresses of all the machines to use as DataNodes; the startup scripts depend on this. Also in the conf/ directory, edit the file slaves so that it contains a list of fully-qualified hostnames for the slave instances, one host per line. On a multi-node setup, the master node (e.g., localhost) is not usually present in this file.

The next step is to

user@machine$ mkdir -p $HOME/hdfs/data

user@namenode$ mkdir -p $HOME/hdfs/name

Folder access chmod +rw and needs to be accessed by all that will use this node. Best practice is to create a hadoop user and group. It is not recommended that you run Hadoop as root.

2 Comments

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s