Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Displaying Selected MCA Modules
From: Joshua Bernstein (jbernstein_at_[hidden])
Date: 2008-06-24 20:11:10


Ralph,

        I really appreciate all of your help and guidance on this.

Ralph H Castain wrote:
> Of more interest would be understanding why your build isn't working in
> bproc. Could you send me the error you are getting? I'm betting that the
> problem lies in determining the node allocation as that is the usual place
> we hit problems - not much is "standard" about how allocations are
> communicated in the bproc world, though we did try to support a few of the
> more common methods.

Alright, I've been playing around a bit more, and I think I'm
understanding what is going on. Though it seems that for whatever reason
the ORTE daemon is failing to launch on a remote node, and I'm left with:

[ats_at_goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Not
available in file ras_bjs.c at line 247
--------------------------------------------------------------------------
A daemon (pid 4208) launched by the bproc PLS component on node 0 died
unexpectedly so we are aborting.

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in
file pls_bproc.c at line 717
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in
file pls_bproc.c at line 1164
[goldstar.penguincomputing.com:04207] [0,0,0] ORTE_ERROR_LOG: Error in
file rmgr_urm.c at line 462
[goldstar.penguincomputing.com:04207] mpirun: spawn failed with errno=-1

So, I take the advice suggested in the note, and double check to make
sure our library caching is working. It nicely picks up the libraries
though once they are staged on the compute nodes, now mpirun just dies:

[ats_at_goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[goldstar.penguincomputing.com:09335] [0,0,0] ORTE_ERROR_LOG: Not
available in file ras_bjs.c at line 247
[ats_at_goldstar mpi]$

I thought maybe it was actually working, but I/O forwarding wasn't setup
properly, though checking the exit code shows that it infact crashed:

[ats_at_goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
[ats_at_goldstar mpi]$ echo $?
1

Any ideas here?

If I use the NODES envar, I can run a job on the head node though:

[ats_at_goldstar mpi]$ mpirun --mca btl ^openib,udapl -np 1 ./cpi
Process 0 on goldstar.penguincomputing.com
pi is approximately 3.1416009869231254, Error is 0.0000083333333323
wall clock time = 0.000097

What also is interesting, and you suspected correctly, only the NODES
envar is being honored, things like BEOWULF_JOB_MAP is not being
honored. This probably correct as I imagine this BEOWULF_JOB_MAP envar
is Scyld specific and likely not implemented. This isn't a big issue
though, its something I'll likely add later on.

-Joshua Bernstein
Software Engineer
Penguin Computing