Mapreduce InputSplit vs Blocks: A Comparison Guide
Input Split: This shows the data that each mapper processes separately. As a result, the number of input splits and the number of map tasks are equal. The mapper processes the records that the framework divides into splits.
The initial data for the MapReduce operation is stored in input files. The default location for input files is HDFS. How to divide and read input files is explained in inputFormat. The creator of InputSplit is InputFormat.
The initial data for the MapReduce operation is stored in input files. The default location for input files is HDFS. How to divide and read input files is explained in inputFormat. The creator of InputSplit is InputFormat.
In the MapReduce programming model, an InputSplit is a logical representation of a portion of the input data that is assigned to a mapper for processing. An InputSplit consists of a set of record boundaries that define the start and end of the records that are processed by a mapper.
On the other hand, a block is a physical unit of data storage in a distributed file system, similar to the Hadoop Distributed File System (HDFS). A block is typically a large chunk of data (e.g., 128 MB) that is stored on a single machine.
In HDFS, the input data for a MapReduce job is typically stored as a set of blocks in the file system. The InputSplits for the MapReduce job are created based on the blocks of the input data. Each InputSplit is assigned to a mapper, and the mapper processes the records within the boundaries of the InputSplit.
Overall, an InputSplit is a logical representation of a portion of the input data that is assigned to a mapper for processing, while a block is a physical unit of data storage in a distributed file system. InputSplits are created based on blocks of data in the file system, and each InputSplit is assigned to a mapper for processing.