Datanode makes use of DU to identify disk space uses for ozone in the volume. DU is a time-consuming operation if run over large occupied disk, like when running over disk path having 10s of TB container data. This will be slow operation.
Challenges:
Based on the above concern, it's not feasible to do du over non-ozone path.
Ozone space usages includes summation of:
Used space = sum(<all containers data size>, <du over ozone path excluding container data>)
Space is not counted as ozone space for below cases:
These spaces will be added up to non-ozone used space, and especially un-accounted containers need to be cleaned up.
In future, the container space (due to duplicacy or corruption) needs to be removed based on error logs.
This inaccuracy does not have much impact over solution (as its present in existing) and due to nature of “du” running async and parallel write operation being in progress.
Approach to fits better in this scenario, can be provided as OptimizedDU and keeping previous support as well.
Hadoop also have similar concept, where used space is only the actual data blocks, and its calculation is simple, i.e.
Used Space = number of blocks * block Size
Similar to this, Ozone can also calculate Used space as,
Used Space = Sum of all container data size
Space will not be counted:
These space will be counted in reserved space. And these data are small in nature (except container data). Container data are corrupted / containers which needs to be removed manually or need fix in code to remove those automatic.
So considering above impact, this might be one of simple solution providing higher performance.
Approach 2 Run DU over meta path only (excluding container dir path) is preferred over other as it identify ozone used space more close in accuracy to DU implementation and handle the time issue.