Hadoop cluster setup
using Apache Ambari
Apache Ambari:
Apache Ambari is an open
source tool from Apache family which is used in building, monitoring,
provisioning the Apache Hadoop cluster.
Apache Hadoop:
Apache Hadoop is an open
source framework from Apache family which is used for large mass of
data analysis. It is a Massively Parallel processing analytical tool.
Apache Hadoop cluster is a commodity cluster.
Apache Hadoop is a
framework where you can store large mass of data and make your
analysis on it. We have many other tools which are comes under Hadoop
Ecosystem. And you can make use of the tools for the analytics and
storing the data. We will discuss about the tools later (Refer Topic
Happiestminds).
Cluster:
Bunch of machines
connected through network is called as a cluster. Communication
between each machine is happened using Switch. Switch contains 24
ports which are minimum and most of the companies use 48 port
switches in their clusters.
A switch with 48 ports
and with 2 GB network switch costs you Rs.15000 this is where you
have to make you choices for the performance of the cluster. There
are many other factors which impacts on the cluster performance which
we will discuss later.
Rack:
Rack is which hold the
bunch of machines in single place. We will need to have multiple
racks for building a Hadoop cluster. Where in here we are not using
this rack for now since this is just a five node cluster (small
cluster).
Happiestminds:
This
is the cluster name which we will discuss now. Happiestminds is the
cluster which is built by using Apache Ambari.
Happiestminds cluster has five nodes one acts as master node and rest
of four acting as slave nodes.
Happiestminds
has most of the Hadoop ecosystem tools in it. The list of tools is as
follows.
Services
|
Version
|
Description
|
HDFS
|
2.0.6
|
Apache Hadoop Distributed File System.
|
YARN + MAPREDUCE
|
2.0.6
|
Apache Hadoop NextGen MapReduce (Yarn).
|
Nagios
|
3.5.0
|
Nagios Monitoring and Altering system.
|
Ganglia
|
3.5.0
|
Ganglia Metrics collection system.
|
Hive
|
0.12.0
|
Data Warehouse system for ad-hoc queries &
analysis of large datasets & table & storage management
services.
|
HBase
|
0.96.0
|
Non-Relational Distributed database &
centralized service for configuration management &
synchronization.
|
Pig
|
0.12.0
|
Scripting platform for analyzing large datasets.
|
Sqoop
|
1.4.4
|
Tool for transferring bulk data between Apache
Hadoop and structured data stores such as relational databases.
|
Oozie
|
4.0.0
|
System for workflow coordination and execution of
Apache Hadoop jobs.
|
Zookeeper
|
3.4.5
|
Centralized service which provides highly
reliable distributed coordination.
|
Table 1
These
are the Hadoop Ecosystem tools which are installed when we build the
cluster. I have given a short description.
Hadoop Services in the Cluster:
Hadoop runs five
services in the cluster,
Master Node
|
Hadoop NameNode, Resource Manager, Secondary
NameNode.
|
Slave Node
|
Hadoop DataNode, Node Manager.
|
Table 2
Master Node:
In Hadoop Cluster
Master-Node plays the key role ‘cause it has NameNode running on
it. This Master take cares about most of the things in the cluster,
it manages the data stored on the Slave nodes.
Slave Node:
There can be “n”
number of slave nodes in the cluster as per your resources and
requirements. The data is stored in the datanodes.
I
will not discuss about all the components here but just the cluster
setup.
Installing Apache Ambari
Agenda:
To
build a Hadoop cluster with ambari to monitor and provision the
cluster.
Prerequisites:
You
need to check the existing installation. It may cause problem if you
have any existing installations I have mentioned in the Table-1.
Setup
password less SSH, this helps the master node to have access on all
the slave nodes.
Steps
involved in Setup password less SSH are as follows,
- ssh-keygen
- cat ~./ssh/id_rsa >> ~/.ssh/authorized_keys
- chmod 700 ~/.ssh
- chmod 600 ~/.ssh/authorized_keys
- ssh-copy-id –i ~/.ssh/id_rsa root@ipaddress (slave node’s)
You
have to add all slave nodes’ hostnames in the /etc/hosts file.
NOTE:
The hostname should be a fully qualified domain name FQDN.
You
have to edit the network config file /etc/sysconfig/network
- Append the following NETWORKING_IPV6=no
You
have to turn off the iptables for now.
- chkconfig iptables off.
You
have to disable the SELinux .
- setenforce 0
Check
the package kit /etc/yum/pluginconf.d/refresh-packagekit.conf and
make the following changes.
- Enabled=0
Make
sure unmask is set to 022.
Installation steps:
We
have downloaded all the rpms into one local machine and created our
own repository.
Steps
involved in creating a local repository,
- yum install cretaerepo
- cd /var/www/html
- mkdir HDP
- move all the downloaded rpms
- mv rpms /var/www/html/HDP
- createrepo –dv /var/www/html/HDP
- repodata will be created
- yum clean all
- yum repolist (should list out the HDP)
NOTE:
In yum.repos.d you need to set the path of the repository in
repomd.xml file.
Install Ambari-server
- yum install ambari-server
ambari-server setup is a command to setup the server here we have
many options,
- ambari-server setup –s –j ///path of the jdk
Ambari-server has been installed successfully and configuring is
the next part which needs to be done.
Log into Apache Ambari
http://{hostname}:8080
Welcome:
Give the cluster name
Select Stack:
You will have to select
the stack if you were installing without local repository but in our
case we are going for local repository.
Confirm Hosts:
Ambari-Agent:
Ambari-Agent comes into
picture when the host is registered. Ambari-Agent is the key thing in
the communication between master and slave.
Choose services:
You have a choice left
to you where you can choose whatever the service to be installed in
your system.
Assign Masters:
You can assign the
services to the master node. If you are registering the slave you can
assign the service for the slave.
Assign Slaves and Clients:
Assign the services to
the slave node.
Customize Services:
Set the required
configurations for the services.
Review:
Just give you a brief
message on the installation.
Install Start and Test:
Summary:
Services
Dashboard:
Heatmaps:
Hosts:
Add Host:
If you want to add
another node to the cluster click on the add host button and
you can follow the steps in the wizard.
NOTES:
If you are restarting
your cluster then you need to start the ambari-agent in the slave
machines manually.
Check with the
permissions of the ssh key and also the copy-id.
Add the slave
hostnames in the Hosts file of the master.
Check with the network
issues (that there is no loose connections).
Usage of the memory
plays an important role (make sure there is no extra burden on the
cpu).
Check the HDP.repo
file and make sure that it is referring the local repository that you
have created.
No comments:
Post a Comment