Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] mtt IBM SPAWN error
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-06-30 09:01:35

That¹s correct ­ and is precisely the behavior it should exhibit. The

1. when you specify ­host, we assume max_slots is infinite since you cannot
provide any info to the contrary. We therefore allow you to oversubscribe
the node to your heart¹s desire. However, note one problem: if your original
launch is only one proc, we will set it to be aggressive in terms of
yielding the processor. Your subsequent comm_spawn¹d procs will therefore
suffer degraded performance if they oversubscribe the node.

Can¹t be helped - there is no way to pass enough info with -host for us to
do better.

2. when you run with -hostfile, your hostfile is telling us to allow no more
than 4 procs on the node. You used three in your original launch, leaving
only one slot available. Since each of the procs in the IBM test attempts to
spawn another, your job will fail.

We can always do more to improve the error messaging...

On 6/30/08 12:38 AM, "Lenny Verkhovsky" <lenny.verkhovsky_at_[hidden]> wrote:

> Hi,
> trying to run mtt I failed to run IBM spawn test. It fails only when using
> hostfile, and not when using host list.
> ( OMPI from TRUNK )
> This is working :
> #mpirun -np 3 -H witch2 dynamic/spawn
> This Fails:
> # cat hostfile
> witch2 slots=4 max_slots=4
> #mpirun -np 3 -hostfile hostfile dynamic/spawn
> [witch1:12392]
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 3 slots
> that were requested by the application:
> dynamic/spawn
> Either request fewer slots for your application, or make more slots available
> for use.
> --------------------------------------------------------------------------
> [witch1:12392]
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
> launch so we are aborting.
> There may be more information reported by the environment (see above).
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
> Using hostfile1 also works
> #cat hostfile1
> witch2
> witch2
> witch2
> Best Regards
> Lenny.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]