According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem.

A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. Such clusters run Hadoop’s open source distributed processing software on low-cost commodity computers.

Prasantmahato
3 min readNov 8, 2020

To prove this statement ,I have used tcpdump command throughout the practical.

What I planned to do ???

Step1

Create a Hadoop cluster with 2 data nodes ,configure everything and to make it ready for use .

Hadoop Cluster with 2 Data nodes

Step2

Upload a file from Client side

Uploading a File

At the same time both the data nodes should be ready with tcpdump command ,so that later we can read the packets .

Now ,I uploaded file from client side and received certain packets on both the data nodes .

In first data node I received packets.

Packets in First DataNode

In 2nd data node I received packets .

Packets in second Data Node

I started reading these packets with the help of below IPs.

Datanode 2 –Public_ip : 13.233.110.102 Private_ip : 172.31.36.159
Datanode 1- Public_ip : 65.0.74.41 Private_ip : 172.31.42.168
Namenode — Public _ip :15.206.93.255 Private_ip: 172.31.34.233
Client -Public_ip :13.233.73.201 Private_ip :172.31.41.169

Result

What I found is Very Intresting .

I found Client is uploading data in only first Datanode and rest replications are made by all Data nodes ,like I noticed when the Client uploads the data in the first data node .,at the same time first Data node creates a replica to the second Data node and the second Datanode creates a replica in third Data node .

I also noticed one thing that the first Datanode always ping the Name node so that Namenode would feel that there is a live node to which data is to be transferred and the second node pings to first data node and the third Datanode pings to second node to mark it as a Live Datanode .

One more main thing I want to state is Hadoop doesnot uses the concept of parallelism .The method I explained is the concept used by Hadoop for fulfilling Velocity problem.

Thankyou

For any queries and suggestions dm me on my LinkedIn .

--

--