Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-05-02 07:48:42


I guess I am now totally confused, so I will have to ask your patience with
a few questions.

On 5/2/07 4:55 AM, "Ole Holm Nielsen" <Ole.H.Nielsen_at_[hidden]> wrote:

> Ralph Castain wrote:
>> We would consider it a "feature" that OpenMPI is integrated with Torque. We
>> actually read the PBS_NODEFILE internally ourselves. I believe the problem
>> here is that specifying the "machinefile" prevents us from using that
>> Torque-integrated code and forces us down a different code path that doesn't
>> correctly interpret the PBS_NODEFILE format.
>>
>> We probably should consider your observation a "bug" - frankly, it wasn't
>> something anyone anticipated a user ever doing, so nobody I know of ever
>> tested it. I'd have to dig into the internals to understand how you wound up
>> in that particular error mode.
>
> I'd say that this behavior of mpirun under Torque TM should be considered as
> a bug. Ideally, users should not have to design their scripts differently
> according to whether the sysadmin decided to configure in TM or not.
> Also, for interactive tests one doesn't have TM. I think that mpirun just
> ought to work no matter what...
>

In your prior notes, you indicated that you had in fact configured TM
support into OpenMPI. The issue, therefore, was that you were somehow
getting an error from TM indicating that the tm spawn command was unable to
launch our daemon on a specified node.

Your comment above, however, talks about the problem of NOT having TM
configured into OpenMPI, even though you are running on a Torque-based
system. This is a significantly different scenario - could you please
clarify?

BTW: We run interactive tests under TM all the time - there is no TM
requirement prohibiting you from this mode of operation. I would guess,
therefore, that this may be something your sysadmin has imposed.

Given your comment, however, I must ask: did you get an allocation for the
nodes in your PBS_NODEFILE prior to executing mpirun??

I need to know since your observed errors could easily be explained by an
attempt to execute on nodes that are not allocated to you. For example, if
you either used a PBS_NODEFILE from a prior (possibly batch) execution, or
created one yourself, Torque would refuse to execute on the specified nodes
since they aren't allocated to you - i.e., the system would refuse to
execute the specified executable on that node because you don't have
permission to do so.

In that case, we could improve the error message, but the system is actually
doing everything correctly.

Appreciate the help in tracking this down.

> So I'd strongly propose that "-machinefile" should at least be tolerated
> when mpirun executes under TM. You might issue a warning about -machinefile
> being ignored under TM, but the code should never bomb out, IMHO.

It didn't "bomb", it simply printed an error message (perhaps to be
improved) and exited, which is (IMHO) correct behavior. ;-)

> Such behavior would be much easier for users (and sysadmins :-) to
> understand than the present situation.
>
> Thanks again,
> Ole