Index based Approach in Hadoop Ecosystem for Performance Improvement
In today's world, the term BIG DATA is not a new thing to most of the professionals and academicians. One possible definition of BIG DATA is "The Data which is huge in size and beyond the processing capacity of a single or bunch of computers is called BIG DATA". The two important aspect of this BIG DATA is- Storing and Processing of data. We can also realize this BIG DATA as a problem to us.
On other hand, we have Apache Hadoop as a solution to Big Data problem. Hadoop is an open source framework owned by Apache Software foundation for Storing and Processing the large dataset but not suitable or recommended for small dataset. In Hadoop, we have HDFS (Hadoop Distributed File System) for storage purpose and MapReduce for processing purpose as two main components of it.
HDFS is a special designed file system for storing the large datasets with cluster of commodity hardware's with streaming access pattern while MapReduce is responsible for parallel processing on stored datasets in HDFS.
To Search any specific data in Hadoop, we have to go through all the data blocks available in HDFS via MapReduce program as Hadoop stores the entire dataset in form of Data Blocks in different DataNodes available in cluster.
This paper deals with the strategy to Search a specific data in Hadoop in minimal time. For this, we introduce a new index approach in Hadoop EcoSystem by which we only need to go through those data blocks where the desired data is available not all data blocks.
Keywords - Apache Hadoop EcoSystem, HDFS, Indexing in HDFS, InputSplits.