Proposal
Problem statement
This ACP is technically two different features, but they are both related to my motivating example, so to avoid making redundant proposals, I'll present both of them here.
Currently, in std::fs, we can construct a ReadDir iterator through using std::fs::read_dir passing in a AsRef<Path> value. However, while the ReadDir iterator stores the directory path it's reading directory entries from, we do not have an accessor method to get this directory path. For context and clarity, this is the internal implementation of ReadDir for unix platforms as seen in library/std/src/sys/fs/unix.rs:
// all DirEntry's will have a reference to this struct
struct InnerReadDir {
dirp: DirStream,
root: PathBuf,
}
pub struct ReadDir {
inner: Arc<InnerReadDir>,
end_of_stream: bool,
}
impl ReadDir {
fn new(inner: InnerReadDir) -> Self {
Self { inner: Arc::new(inner), end_of_stream: false }
}
}
The problem that arises here when a ReadDir struct does not exposing a path reference to its internal PathBuf is that iterative implementations that constructs a ReadDir on nested directories are forced to redundantly allocate a PathBuf to remember this path (see motivating example); in iterative implementations we are unable to hold path references to child directories because the lifetime of those path references will not live through the next iteration.
The second issue to point out is in regards to what std::fs::read_dir does internally. Underneath the hood, std::fs::read_dir wraps around the different platform implementations of readdir, which the common thing they do across different platforms is they take a &Path argument and allocate memory to convert this path reference to an owned PathBuf. This is the unix platform implementation of readdir:
pub fn readdir(path: &Path) -> io::Result<ReadDir> {
let ptr = run_path_with_cstr(path, &|p| unsafe { Ok(libc::opendir(p.as_ptr())) })?;
if ptr.is_null() {
Err(Error::last_os_error())
} else {
let root = path.to_path_buf();
let inner = InnerReadDir { dirp: DirStream(ptr), root };
Ok(ReadDir::new(inner))
}
}
The thing is for iterative/recursive approaches on nested directories, the nested directory DirEntry produced by ReadDir iterator returns an owned PathBuf when accessing this entry's path via DirEntry::path. The problem that arises from this is that constructing a ReadDir for these directories allocates the same PathBuf again instead of taking ownership of the nested directories' PathBuf. This could be expensive if we have a bunch of nested directory paths and we're just cloning a PathBuf for each nested directory.
It would help to have a way to access the path associated with the ReadDir iterator for an optimized iterative solution aiming to operate on a directory and nested directories as well as a way to print what path contains DirEntry error results (as ReadDir produces Result<DirEntry, Error> items). Having a way for the ReadDir iterator to be constructed through an owned PathBuf would help avoid a redundant allocation for nested directories.
Motivating examples or use cases
My motivating example comes from my PR in trying to change the implementation of std::fs::remove_dir_all for unix/uefi platform from recursive to an iterative version.
The current implementation of std::fs::remove_dir_all for the mentioned platforms are as follows:
pub fn remove_dir_all(path: &Path) -> io::Result<()> {
let filetype = fs::symlink_metadata(path)?.file_type();
if filetype.is_symlink() { fs::remove_file(path) } else { remove_dir_all_recursive(path) }
}
fn remove_dir_all_recursive(path: &Path) -> io::Result<()> {
for child in fs::read_dir(path)? {
let result: io::Result<()> = try {
let child = child?;
if child.file_type()?.is_dir() {
remove_dir_all_recursive(&child.path())?;
} else {
fs::remove_file(&child.path())?;
}
};
// ignore internal NotFound errors to prevent race conditions
if let Err(err) = &result
&& err.kind() != io::ErrorKind::NotFound
{
return result;
}
}
ignore_notfound(fs::remove_dir(path))
}
My iterative implementation in the PR is as follows:
pub fn remove_dir_all(path: &Path) -> io::Result<()> {
let filetype = fs::symlink_metadata(path)?.file_type();
if filetype.is_symlink() { fs::remove_file(path) } else { remove_dir_all_iterative(path) }
}
fn remove_dir_all_iterative(path: &Path) -> io::Result<()> {
let mut directories = VecDeque::new();
directories.push_front((path.to_path_buf(), fs::read_dir(path)?));
while !directories.is_empty() {
let (parent_path, read_dir) = &mut directories[0];
let child = read_dir.next();
if let Some(child) = child {
let result: io::Result<()> = try {
let child = child?;
let child_path = child.path();
if child.file_type()?.is_dir() {
let child_readdir = fs::read_dir(&child_path)?;
directories.push_front((child_path, child_readdir));
} else {
fs::remove_file(&child_path)?;
}
};
// ignore internal NotFound errors to prevent race conditions
if let Err(err) = &result
&& err.kind() != io::ErrorKind::NotFound
{
return result;
}
} else {
ignore_notfound(fs::remove_dir(parent_path))?;
directories.pop_front();
}
}
Ok(())
}
In my iterative approach, to preserve the order that the entries are removed, I use a VecDeque to store a tuple of the PathBuf and the associated ReadDir iterator; holding a PathBuf value here is necessary in the tuple if I want to remove the path associated with the ReadDir iterator when I exhaust the iterator. Moreover, I am unable to store a &Path value in the tuple because the child.path() returns an owned PathBuf that will be deallocated at the end of the while loop iteration if I don't transfer ownership to it to the VecDeque. This storing of PathBuf, as mentioned earlier, is redundant because our ReadDir iterator already has that information. Moreover, because our ReadDir iterator has that information, the current recursive implementation does utilize more memory with each path: &Path arguments passed in on recursive calls, when we only need to use that path reference value at the time of removing the associated directory of the ReadDir iterator.
Also note: I am aware that I could use a LinkedList here if resizing costs is undesirable for this VecDeque.
Unfortunately, both the current recursive and iterative implementation has an extra allocation on PathBuf when using std::fs::read_dir with the nested directories. In an ideal scenario, an iterative version of remove_dir_all would be as follows:
fn remove_dir_all_iterative(path: &Path) -> io::Result<()> {
let mut directories = VecDeque::new();
directories.push_front(fs::read_dir(path)?);
while !directories.is_empty() {
// directory is not empty
let read_dir = directories.front_mut().unwrap();
let child = read_dir.next();
if let Some(child) = child {
let result: io::Result<()> = try {
let child = child?;
if child.file_type()?.is_dir() {
directories
.push_front(fs::owned_read_dir(child.path())?);
} else {
fs::remove_file(&child.path())?;
}
};
// ignore internal NotFound errors to prevent race conditions
if let Err(err) = &result
&& err.kind() != io::ErrorKind::NotFound
{
return result;
}
} else {
ignore_notfound(fs::remove_dir(read_dir.root()))?;
directories.pop_front();
}
}
Ok(())
}
This solution assumes that we have an accessor method to the ReadDir inner struct root field and a std::fs::owned_read_dir()) method that operates exactly like std::fs::read_dir() returning a Result<ReadDir>, but accepts an owned PathBuf instead of a path reference.
Solution sketch
In std::fs, we can introduce the following methods:
/// Takes ownership of the path and constructs an iterator that
/// iterates through the directory entries of the path
pub fn owned_read_dir(path: PathBuf) -> io::Result<ReadDir> {
fs_imp::read_dir(path).map(ReadDir)
}
impl ReadDir {
/// Returns the path that this iterator is reading entries from
pub fn root(&self) -> &Path {
self.0.root()
}
}
This solution also means that we have to change the fs_imp::read_dir signature from currently:
pub fn read_dir(path: &Path) -> io::Result<ReadDir> {
// FIXME: use with_native_path on all platforms
imp::readdir(path)
}
to
pub fn read_dir(path: PathBuf) -> io::Result<ReadDir> {
imp::readdir(path) // platform impl would also take in a `PathBuf`
}
Or if this is undesirable, we can create a separate owned_read_dir function with the trade off that it introduces an owned_read_dir function that we need to separately implement/maintain per platform.
As for std::fs::ReadDir::root, we would have to create a wrapper accessor method per platform implementation of ReadDir struct.
Alternatives
The current working approach I can think of right now for iterative implementations that perform work on a given directory and nested directories is to use a tuple that holds the path and the associated ReadDir iterator; you can also get the parent of the child paths through Path::parent, but this doesn't seem ideal and won't be useful if you exhaust the ReadDir iterator. Otherwise, just use a recursive implementation.
The downside with the solution I propose is that it introduces more function that we need to implement for different platforms, and for any future platforms that we support.
Links and related work
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
- We think this problem seems worth solving, and the standard library might be the right place to solve it.
- We think that this probably doesn't belong in the standard library.
Second, if there's a concrete solution:
- We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
- We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
Proposal
Problem statement
This ACP is technically two different features, but they are both related to my motivating example, so to avoid making redundant proposals, I'll present both of them here.
Currently, in
std::fs, we can construct aReadDiriterator through usingstd::fs::read_dirpassing in aAsRef<Path>value. However, while theReadDiriterator stores the directory path it's reading directory entries from, we do not have an accessor method to get this directory path. For context and clarity, this is the internal implementation ofReadDirfor unix platforms as seen inlibrary/std/src/sys/fs/unix.rs:The problem that arises here when a
ReadDirstruct does not exposing a path reference to its internalPathBufis that iterative implementations that constructs aReadDiron nested directories are forced to redundantly allocate aPathBufto remember this path (see motivating example); in iterative implementations we are unable to hold path references to child directories because the lifetime of those path references will not live through the next iteration.The second issue to point out is in regards to what
std::fs::read_dirdoes internally. Underneath the hood,std::fs::read_dirwraps around the different platform implementations ofreaddir, which the common thing they do across different platforms is they take a&Pathargument and allocate memory to convert this path reference to an ownedPathBuf. This is the unix platform implementation of readdir:The thing is for iterative/recursive approaches on nested directories, the nested directory
DirEntryproduced byReadDiriterator returns an ownedPathBufwhen accessing this entry's path viaDirEntry::path. The problem that arises from this is that constructing aReadDirfor these directories allocates the samePathBufagain instead of taking ownership of the nested directories'PathBuf. This could be expensive if we have a bunch of nested directory paths and we're just cloning aPathBuffor each nested directory.It would help to have a way to access the path associated with the
ReadDiriterator for an optimized iterative solution aiming to operate on a directory and nested directories as well as a way to print what path containsDirEntryerror results (asReadDirproducesResult<DirEntry, Error>items). Having a way for theReadDiriterator to be constructed through an ownedPathBufwould help avoid a redundant allocation for nested directories.Motivating examples or use cases
My motivating example comes from my PR in trying to change the implementation of
std::fs::remove_dir_allfor unix/uefi platform from recursive to an iterative version.The current implementation of
std::fs::remove_dir_allfor the mentioned platforms are as follows:My iterative implementation in the PR is as follows:
In my iterative approach, to preserve the order that the entries are removed, I use a
VecDequeto store a tuple of thePathBufand the associatedReadDiriterator; holding aPathBufvalue here is necessary in the tuple if I want to remove the path associated with theReadDiriterator when I exhaust the iterator. Moreover, I am unable to store a&Pathvalue in the tuple because thechild.path()returns an ownedPathBufthat will be deallocated at the end of the while loop iteration if I don't transfer ownership to it to the VecDeque. This storing ofPathBuf, as mentioned earlier, is redundant because ourReadDiriterator already has that information. Moreover, because ourReadDiriterator has that information, the current recursive implementation does utilize more memory with eachpath: &Patharguments passed in on recursive calls, when we only need to use that path reference value at the time of removing the associated directory of theReadDiriterator.Also note: I am aware that I could use a
LinkedListhere if resizing costs is undesirable for thisVecDeque.Unfortunately, both the current recursive and iterative implementation has an extra allocation on
PathBufwhen usingstd::fs::read_dirwith the nested directories. In an ideal scenario, an iterative version ofremove_dir_allwould be as follows:This solution assumes that we have an accessor method to the
ReadDirinner structrootfield and astd::fs::owned_read_dir())method that operates exactly likestd::fs::read_dir()returning aResult<ReadDir>, but accepts an ownedPathBufinstead of a path reference.Solution sketch
In
std::fs, we can introduce the following methods:This solution also means that we have to change the
fs_imp::read_dirsignature from currently:to
Or if this is undesirable, we can create a separate
owned_read_dirfunction with the trade off that it introduces anowned_read_dirfunction that we need to separately implement/maintain per platform.As for
std::fs::ReadDir::root, we would have to create a wrapper accessor method per platform implementation ofReadDirstruct.Alternatives
The current working approach I can think of right now for iterative implementations that perform work on a given directory and nested directories is to use a tuple that holds the path and the associated
ReadDiriterator; you can also get the parent of the child paths throughPath::parent, but this doesn't seem ideal and won't be useful if you exhaust theReadDiriterator. Otherwise, just use a recursive implementation.The downside with the solution I propose is that it introduces more function that we need to implement for different platforms, and for any future platforms that we support.
Links and related work
What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution: