Tuesday 14 June 2016

Hadoop Best Practices

1. Avoid small files(sized less than 1 HDFS block(128MB)) with one map processing a single file.

2. Maintain optimal block size, block size could be >=128MB, to avoid tens of thousands of map tasks in processing large data sets.

3. Applications processing large data-sets with optimal number of reducers and avoiding cases like using a few reducers.

4. Avoid applications writing out multiple, small, o/p files from each reducer.

5. Using the Parallel keyword in pig scripts fro processing large data-sets.

6.Use the DistributedCache to distribute artifacts of sizes around tens of MB each

7. Enabling compression for both intermediate map o/p and the reducer o/p(final o/p).

8.Implementing automated processes to screen-scrape the web ui is strictly prohibited as it leads to severe performance issue.

9. application should not perform any metadata operations on the file-system from the back-end.

10. Counters are very expensive since the job Tracker has to maintain every counter of every map/reduce task for the entire duration of the application.