You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously, initialize() allows creating MPI comm world after import distributed
from distributed import Client, Nanny, Scheduler
from distributed.utils import import_term
...
def initialize(
...):
if comm is None:
from mpi4py import MPI
comm = MPI.COMM_WORLD
However, as I tested on large-scale clusters with hundreds of nodes, as the number of workers increases, typically when it gets more than 32, initialize() function will be stuck at:
Because some worker processes fail to receive the bcast'd scheduler address.
After a long time of debugging, I found that this is strongly related to the order of import mpi4py and import distributed (or from distributed import). I am guessing that in distributed, some communication environment settings are made first which then leads to some conflicts when mpi4py tries to bootstrap the MPI.COMM_WORLD after it.
By strictly requiring the user to create the MPI.COMM_WORLD before calling the initialize() function, the above problem no longer bothers. According to my test, it can scale out to more than 128 workers (maybe more, as my resource is limited) without any hanging issues.
@YJHMITWEB: You are proposing that comm=None (the current default) be entirely disallowed, even though it works in many setups. This seems a bit restrictive to me. Is it fair to say that the documentation should mention that setting the comm parameter to an existing mpi4py.MPI.Intracomm can prevent hanging on some systems when scaling to large numbers of MPI ranks? Perhaps this should not be such a Draconian modification and more of a documentation update.
Another modification that I might be in favor of is keep the current design but detect the comm.get_size() after the internal comm is created (i.e., in the comm=None case). If the comm.get_size() is larger than the limit you have tested (e.g., above 32), then issue a warning to the user telling the user why they might be experiencing hanging (and to tell them the fix).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Previously,
initialize()allows creating MPI comm world afterimport distributedHowever, as I tested on large-scale clusters with hundreds of nodes, as the number of workers increases, typically when it gets more than 32,
initialize()function will be stuck at:and
Because some worker processes fail to receive the bcast'd scheduler address.
After a long time of debugging, I found that this is strongly related to the order of
import mpi4pyandimport distributed(orfrom distributed import). I am guessing that indistributed, some communication environment settings are made first which then leads to some conflicts whenmpi4pytries to bootstrap the MPI.COMM_WORLD after it.By strictly requiring the user to create the MPI.COMM_WORLD before calling the
initialize()function, the above problem no longer bothers. According to my test, it can scale out to more than 128 workers (maybe more, as my resource is limited) without any hanging issues.