1. Avoid small files(sized less than 1 HDFS block(128MB)) with one map processing a single file.
2. Maintain optimal block size, block size could be >=128MB, to avoid tens of thousands of map tasks in processing large data sets.
3. Applications processing large data-sets with optimal number of reducers and avoiding cases like using a few reducers.
4. Avoid applications writing out multiple, small, o/p files from each reducer.
5. Using the Parallel keyword in pig scripts fro processing large data-sets.
6.Use the DistributedCache to distribute artifacts of sizes around tens of MB each
7. Enabling compression for both intermediate map o/p and the reducer o/p(final o/p).
8.Implementing automated processes to screen-scrape the web ui is strictly prohibited as it leads to severe performance issue.
9. application should not perform any metadata operations on the file-system from the back-end.
10. Counters are very expensive since the job Tracker has to maintain every counter of every map/reduce task for the entire duration of the application.
2. Maintain optimal block size, block size could be >=128MB, to avoid tens of thousands of map tasks in processing large data sets.
3. Applications processing large data-sets with optimal number of reducers and avoiding cases like using a few reducers.
4. Avoid applications writing out multiple, small, o/p files from each reducer.
5. Using the Parallel keyword in pig scripts fro processing large data-sets.
6.Use the DistributedCache to distribute artifacts of sizes around tens of MB each
7. Enabling compression for both intermediate map o/p and the reducer o/p(final o/p).
8.Implementing automated processes to screen-scrape the web ui is strictly prohibited as it leads to severe performance issue.
9. application should not perform any metadata operations on the file-system from the back-end.
10. Counters are very expensive since the job Tracker has to maintain every counter of every map/reduce task for the entire duration of the application.
No comments:
Post a Comment