tree 4df834775d34831097827ddc3addac59fce525d7
parent 12e0ca6b243838e74ce0f9132a3019ee960cad97
author Steve Loughran <stevel@cloudera.com> 1715631154 +0100
committer GitHub <noreply@github.com> 1715631154 +0100
gpgsig -----BEGIN PGP SIGNATURE-----
 
 wsFcBAABCAAQBQJmQnQyCRC1aQ7uu5UhlAAAk4EQAIeCYYR5NqkwdaQeIvTiR039
 X/Q4+jzD1xpCNTkkC1I5XbeOUh/8pInTxdbnUVvNSNlTXwz/KHFIZGSyUL5sBBaP
 8/L2SqJ3X5u/vx7ZWyTSLdT0eQM9As8wxFzpqEK8kmMgOX7DIqXtLrt+8qTHUlu7
 XPtbLFp//5UnmBXm+mcDCpGQFlj4L8D85ommGMUaJi5BiDvqMiiMsaqXR075ZJMU
 dVY6AQZ/0tM3Q8tGt9teyzInpSmUinTziBt36Q1dbvCszlIN33X9BC5OFQAGonJv
 XAT0GbxSCM1koarUNhRl+d8UedmAv+CVhg5UuIJ41PXPjAU3uPGV5pSGQZy8Ij3h
 n8gJyCKznJMofxDCRLXOcrPJW7DHPXN/Bpk0OcKAO1M+MZ8VWD1WcxElaiyhd0hJ
 fzQxWfQk2l5Ol4vRTHp7m+xbMR2bAaOycmOGr0AGpiTj4fUg2fiTtadINRhV3THT
 VEi4MyDkEs025Ta7TLwUJBq5BpDuLxtTz50SP9OhtjAvzNdBQsk2XiYl4k6tSo5b
 u1qgpc5q2MIgbWuIS8f/BfMzihLsVWFP+AwN0f7Nx6Mnvrq9EvLq3jIBuKbMB6pP
 8A8bgqPlJKIemy7zWtqtWNRqdiv/hJUTGGzvfmsetUHISx8fprozqESsPRr/CvVL
 n4qFnZ0WY4Xwijy0u7KB
 =akNw
 -----END PGP SIGNATURE-----
 

MAPREDUCE-7474. Improve Manifest committer resilience (#6716)


Improve task commit resilience everywhere
and add an option to reduce delete IO requests on
job cleanup (relevant for ABFS and HDFS).

Task Commit Resilience
----------------------

Task manifest saving is re-attempted on failure; the number of 
attempts made is configurable with the option:

  mapreduce.manifest.committer.manifest.save.attempts

* The default is 5.
* The minimum is 1; asking for less is ignored.
* A retry policy adds 500ms of sleep per attempt.
* Move from classic rename() to commitFile() to rename the file,
  after calling getFileStatus() to get its length and possibly etag.
  This becomes a rename() on gcs/hdfs anyway, but on abfs it does reach
  the ResilientCommitByRename callbacks in abfs, which report on
  the outcome to the caller...which is then logged at WARN.
* New statistic task_stage_save_summary_file to distinguish from
  other saving operations (job success/report file).
  This is only saved to the manifest on task commit retries, and
  provides statistics on all previous unsuccessful attempts to save
  the manifests
+ test changes to match the codepath changes, including improvements
  in fault injection.

Directory size for deletion
---------------------------

New option

  mapreduce.manifest.committer.cleanup.parallel.delete.base.first

This attempts an initial attempt at deleting the base dir, only falling
back to parallel deletes if there's a timeout.

This option is disabled by default; Consider enabling it for abfs to
reduce IO load. Consult the documentation for more details.

Success file printing
---------------------

The command to print a JSON _SUCCESS file from this committer and
any S3A committer is now something which can be invoked from
the mapred command:

  mapred successfile <path to file>

Contributed by Steve Loughran