Ralph Castain wrote:
> Hmmm....well, the problem is as I suspected. The system doesn't see any
> allocation of nodes to your job, and so it aborts with a crummy error
> message that doesn't really tell you the problem. We are working on
> improving them.
> How are you allocating nodes to the job? Does this BEOWULF_JOB_MAP contain
> info on the nodes that are to be used?
BEOWULF_JOB_MAP is an array of integers separated by a colon that
contains node mapping information. The easiest way to explain is is just
This is a two process job, with each process running on node 0.
A three process job with the first process on node 0, and the next two
on node 1.
All said, this is of little consequent right now, and we/I can worry
about adding support for this later.
> One of the biggest headaches with bproc is that there is no adhered-to
> standard for describing the node allocation. What we implemented will
> support LSF+Bproc (since that is what was being used here) and BJS. It
> sounds like you are using something different - true?
Understood. We aren't using BJS, and have long depricated BJS in favor
of bundling TORQUE with Scyld instead, though legacy functionality for
things like envars like NP, NO_LOCAL, and BEOWULF_JOB_MAP are present in
the MPICH extensions we've put together.
> If so, we can work around it by just mapping enviro variables to what the
> system is seeking. Or, IIRC, we could use the hostfile option (have to check
> on that one).
Exactly, but for now, if I make sure the NODES envar is setup correctly
and make sure the OpenMPI is NFS mounted, and I actually have to copy
out the mca libraries (libcache doesn't seem to work), I actually end up
with something running!
[ats_at_goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 2 ./cpi
Process 0 on n0
pi is approximately 3.1416009869231241, Error is 0.0000083333333309
wall clock time = 0.005377
Process 1 on n0
It seems the -H option and using a hostfile with BProc aren't honored
correct? So the only thing that I can use to derrive the host mapping
with BProc support is the BJS RAS MCA (via the NODES Envar?)