Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI_Comm_spawn_multiple
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-02-21 18:05:09


I very much doubt that either of those mappers has ever been tested against comm_spawn. Just glancing thru them, I don't see an immediate reason why loadbalance wouldn't work, but the error indicates that the system wound up mapping one or more processes to an unknown node.

We are revising the mappers at this time, so I doubt we'll try to fix it for 1.5.2. You might try the 1.4 series to see if it behaves differently, though I suspect those mappers weren't tested against comm_spawn there either.

On Feb 21, 2011, at 12:59 PM, Skouson, Gary B wrote:

> I'm trying to use MPI_Comm_spawn_multiple and it doesn't seem to always work like I'd expect.
>
> The simple test code I have starts a couple of master processes and then tries to spawn a couple of worker threads on each of the nodes running the master processes.
>
> I was using 1.5.1, but gave 1.5.2rc2 a try too.
>
> If I do:
> [skouson_at_cu2n29 mpi2_example]$ mpirun -hostfile hostfile -n 2 -bynode ./mpi2_manager
> MPI Initialized=0, Finalized=0
> MPI Initialized=0, Finalized=0
> I'm manager 0 of 2 on cu2n29 running MPI 2.1
> setting up host cu2n29 - ./mpi2_worker
> setting up host cu2n30 - ./mpi2_worker
> Spawning 2 worker processes running ./mpi2_worker
> Sleeping for a bit...
> I'm manager 1 of 2 on cu2n30 running MPI 2.1
> **** I'm worker 0 of 2 on cu2n29 running MPI 2.1
> **** Worker 0: number of parents = 2
> **** Worker 0: Success!
> **** I'm worker 1 of 2 on cu2n30 running MPI 2.1
> **** Worker 1: number of parents = 2
> **** Worker 1: Success!
> **** Worker 0: Value recd = 25
> 1: MPI Initialized=1, Finalized=1
> 0: MPI Initialized=1, Finalized=1
>
> It seems to work as expected, however, if I use -loadbalance or -npernode 1 rather than the -bynode flag I get an obscure error and things hang until a ctrl-c out of it.
>
> [skouson_at_cu2n29 mpi2_example]$ mpirun -hostfile hostfile -n 2 -loadbalance ./mpi2_manager
> MPI Initialized=0, Finalized=0
> MPI Initialized=0, Finalized=0
> I'm manager 1 of 2 on cu2n30 running MPI 2.1
> I'm manager 0 of 2 on cu2n29 running MPI 2.1
> setting up host cu2n29 - ./mpi2_worker
> setting up host cu2n30 - ./mpi2_worker
> Spawning 2 worker processes running ./mpi2_worker
> Sleeping for a bit...
> [cu2n29:03088] [[62875,0],0] ORTE_ERROR_LOG: Not found in file base/odls_base_default_fns.c at line 906
> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
>
> My environment has:
> OMPI_MCA_btl_openib_ib_retry_count=7
> OMPI_MCA_mpi_keep_peer_hostnames=1
> OMPI_MCA_btl_openib_ib_timeout=31
>
> I've included the sample code, along with config.log etc.
>
> If anyone has any can point out what I'm missing to be able to run with the -loadbalance flag, I'd appreciate it.
>
> -----
> Gary Skouson
> <mpi2_example.tar.bz2>_______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users