On 5/2/07 1:28 AM, "Ole Holm Nielsen" <Ole.H.Nielsen_at_[hidden]> wrote:
> Bas hit the nail on the head: When using OpenMPI's mpirun under
> Torque TM one apparently *must* omit the "-machinefile $PBS_NODEFILE"
> flags and only specify "-np 2", presumably because TM knows all
> about the machines under its control.
> This behavior is new to me: Is this a feature or a bug in OpenMPI ?
> At least a better behavior of mpirun could be expected when you
> specify both -np and -machinefile.
Thanks Bas - I spaced out on the command line.
We would consider it a "feature" that OpenMPI is integrated with Torque. We
actually read the PBS_NODEFILE internally ourselves. I believe the problem
here is that specifying the "machinefile" prevents us from using that
Torque-integrated code and forces us down a different code path that doesn't
correctly interpret the PBS_NODEFILE format.
We probably should consider your observation a "bug" - frankly, it wasn't
something anyone anticipated a user ever doing, so nobody I know of ever
tested it. I'd have to dig into the internals to understand how you wound up
in that particular error mode.
> Bas van der Vlies wrote:
>> You must use the following command:
>> mpiexec -np 2 ./a.out
>> whello, i am 0 of 2
>> whello, i am 1 of 2
>> all is well that ends well
>> $ mpiexec -np 2 -machinefile $PBS_NODEFILE ./a.out
>> [ib-r6n19.irc.sara.nl:04999] pls:tm: failed to poll for a spawned proc,
>> return status = 17002
>> [ib-r6n19.irc.sara.nl:04999] [0,0,0] ORTE_ERROR_LOG: In errno in file
>> rmgr_urm.c at line 462
>> [ib-r6n19.irc.sara.nl:04999] mpiexec: spawn failed with errno=-11
> users mailing list