Showing posts with label Hadoop. Show all posts
Showing posts with label Hadoop. Show all posts

Thursday, 26 December 2019

Hadoop - Understanding Hadoop Architecture

Hadoop is an open source framework written in java that allows distributed processing of large datasets across clusters of computers.
Hadoop is designed to scale up from single server to thousands of machines and each have local-computation and storage.

Hadoop Architecture have 4 modules.

Understanding Hadoop Architecture
  1. Hadoop Common: These are Java libraries which provides filesystem and OS level abstractions which are required to start Hadoop.
  2. Hadoop YARN: This is used for job scheduling and cluster resource management.
  3. Hadoop Distributed File System: It provides high throughput access to application data and is suitable for applications that have large data sets.
  4. Hadoop MapReduce: This is YARN-based system for parallel processing of large data sets.

Question: What is MapReduce?
MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters. It consits of master JobTracker and slave TaskTracker per cluster. The JobTracker is responsible for resource management and schedule the task. TaskTracker is responsibel for execute the task.

Question: How Does HadoopwWork at very basic level?
  1. Job Submission by client to Hadoop.
  2. Input and output files are in "distributed file system" and with use of "Hadoop Common files" Initialization the hadoop.
  3. Execute of map and reduce functions.
  4. Scheduling the task by TaskTracker.
  5. Execution the task by JobTracker.

Friday, 11 December 2015

Hadoop Basic Interview Questions and Answer

Hadoop Basic Interview Questions and Answer

Question: Name fews companies that use Hadoop?
  1. Facebook
  2. Amazon
  3. Twitter
  4. eBay
  5. Adobe
  6. Netflix
  7. Hulu
  8. Rubikloud

Question: Differentiate between Structured and Unstructured data?
Data which are proper categorized and easily search and update is know as Structured data.
Data which are proper un-categorized and can't search and update easily is know as Un-Structured data.

Question: On Which concept the Hadoop framework works?
HDFS: Hadoop Distributed File System is the java based file system for scalable and reliable storage of large datasets. Data in HDFS is stored in the form of blocks and it operates on the Master Slave Architecture.
Hadoop MapReduce:MapReduce distributes the workload into various tasks which runs in parallel. Hadoop jobs perform 2 separate job. The map job breaks down the data sets into key-value pairs or tuples. The reduce job then takes the output of the map job & combines the data tuples in smaller set of tuples. The job is performed after the map job is executed.

Question: What is Hadoop streaming?
Hadoop distribution has a generic programming interface for writing the code in any programming language like PHP, Python, Perl, Ruby etc is know as Hadoop Streaming.

Question: What is block and block scanner in HDFS?
Block: The minimum amount of data that can be read or written is know as "block" (Defualt 64MB).
Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any type of checksum errors.

Question: What is commodity hardware?
Hadoop have thousands of commodity hardware which are inexpensive that do not have high availability. these are used to execute to job.

Can we write MapReduce in other than JAVA Language?
Yes, you can write MapReduce task in other languages like PHP, Perl etc.

Question: What are the primary phases of a Reducer?
  1. Shuffle
  2. Sort
  3. Reduce

Question: What is Big Data?
Big data is too heavy database that exceeds the processing capacity of traditional database systems.

Question: What is NoSQL?
It is non-relational un-structured database.

Question: What problems can Hadoop solve?
It sove following problems.
  1. When database too heavy and exceed its limit.
  2. Reduce the cost of server.

Question:Name the modes in which Hadoop can run?
Hadoop can be run in one of three modes:
  1. Standalone (or local) mode
  2. Pseudo-distributed mode
  3. Fully distributed mode

Question:What is the full form of HDFS?
Hadoop Distributed File System

Question:What is DataNode and Namenode in Hadoop?
Namenode is the node which stores the filesystem metadata.
File maps with block locations are stored in datanode.

Question:What are the Hadoop configuration files?
  1. hdfs-site.xml
  2. core-site.xml
  3. mapred-site.xml