Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Displaying Selected MCA Modules
From: Joshua Bernstein (jbernstein_at_[hidden])
Date: 2008-06-24 21:13:18


Ralph Castain wrote:
> Hmmm....well, the problem is as I suspected. The system doesn't see any
> allocation of nodes to your job, and so it aborts with a crummy error
> message that doesn't really tell you the problem. We are working on
> improving them.
>
> How are you allocating nodes to the job? Does this BEOWULF_JOB_MAP contain
> info on the nodes that are to be used?

BEOWULF_JOB_MAP is an array of integers separated by a colon that
contains node mapping information. The easiest way to explain is is just
my example:

BEOWULF_JOB_MAP=0:0

This is a two process job, with each process running on node 0.

BEOWULF_JOB_MAP=0:1:1

A three process job with the first process on node 0, and the next two
on node 1.

All said, this is of little consequent right now, and we/I can worry
about adding support for this later.

> One of the biggest headaches with bproc is that there is no adhered-to
> standard for describing the node allocation. What we implemented will
> support LSF+Bproc (since that is what was being used here) and BJS. It
> sounds like you are using something different - true?

Understood. We aren't using BJS, and have long depricated BJS in favor
of bundling TORQUE with Scyld instead, though legacy functionality for
things like envars like NP, NO_LOCAL, and BEOWULF_JOB_MAP are present in
the MPICH extensions we've put together.

> If so, we can work around it by just mapping enviro variables to what the
> system is seeking. Or, IIRC, we could use the hostfile option (have to check
> on that one).

Exactly, but for now, if I make sure the NODES envar is setup correctly
and make sure the OpenMPI is NFS mounted, and I actually have to copy
out the mca libraries (libcache doesn't seem to work), I actually end up
with something running!

[ats_at_goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 2 ./cpi
Process 0 on n0
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.005377
Process 1 on n0
Hangup

It seems the -H option and using a hostfile with BProc aren't honored
correct? So the only thing that I can use to derrive the host mapping
with BProc support is the BJS RAS MCA (via the NODES Envar?)

-Josh