What is Mapper in MapReduce?

In the MapReduce programming model, a mapper is a function that processes a set of key-value pairs and produces a set of intermediate key-value pairs. The mapper is typically responsible for filtering and sorting the data, and for grouping the data into key-value pairs that can be processed by the reducer.

The RecordReader produces input records, which the Hadoop Mapper processes and converts into intermediate key-value pairs. The intermediate output and the input pair are wholly dissimilar.

The complete collection of key-value pairs is the mapper’s output. Partitioning of the output is done based on the key before writing the output for each mapper task. As a result, partitioning ensures that each key’s values are all clustered together.

For each InputSplit, Hadoop MapReduce creates a separate map task.

Here is an example of a simple mapper function in MapReduce:

def mapper(key, value): # Process the input key-value pair 
# and generate a set of intermediate key-value pairs intermediate_key = ... intermediate_value = ... yield intermediate_key, intermediate_value

In this example, the mapper function takes a key and a value as input and produces a set of intermediate key-value pairs as output. 

For the Mapper, InputSplit transforms the blocks’ physical representation into their logical equivalent. For instance, it will need two InputSplit to read the 100MB file. The framework creates one InputSplit for each block. One mapper is created for each InputSplit.

The amount of data blocks is not necessarily a factor in the MapReduce input split. By adjusting the mapred.max.split.size parameter during task execution, we can modify the split count.

The mapper function can perform any necessary processing on the input data and generate the intermediate key-value pairs as needed.

The intermediate key-value pairs produced by the mapper are then passed to the reducer, which aggregates and processes the data further. The final output of the MapReduce job is produced by the reducer.

Overall, the mapper is an important component of the MapReduce programming model, as it is responsible for processing and grouping the data that will be processed by the reducer.

Leave a Reply

Your email address will not be published. Required fields are marked *