Are you perchance oversubscribing your nodes?
Open MPI does not currently handle well when you initially
undersubscribe your nodes but then, due to spawning, oversubscribe
your nodes. In this case, OMPI will be aggressively polling in all
processes, not realizing that the node is now oversubscribed and it
should be yielding the processor so that other processes can run.
On Oct 30, 2007, at 10:57 AM, Murat Knecht wrote:
> does someone know whether there is a special requirement on the
> order of
> spawning processes and the consequent merge of the intercommunicators?
> I have two hosts, let's name them local and remote, and a parent
> on local that goes on spawning one process on each one of the two
> After each spawn the parent process and all existing childs
> in merging the created Intercommunicator into an Intracommunicator
> connects - in the end - alls three processes.
> The weird thing is though, when I spawn them in the order local,
> at the second, the last spawn all three processes block when
> encountering MPI_Merge. Though, when I switch the order around to
> spawning first the process on remote and then on local, everything
> out: The two processes are spawned and the Intracommunicators created
> from the Merge. Everything goes well, too, if I decide to spawn both
> processes on either one of the machines. (The existing children are
> informed via a message that they shall participate in the Spawn and
> Merge since these are collective operations.)
> Is there some implicit developer-level knowledge that explains why the
> order defines the outcome? Logically, there ought to be no difference.
> Btw, I work with two Linux nodes and an ordinary Ethernet-TCP
> between them.
> users mailing list