Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI_Comm_spawn under Torque
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-02-20 23:30:12


On Feb 20, 2014, at 7:05 PM, Suraj Prabhakaran <suraj.prabhakaran_at_[hidden]> wrote:

> Thanks Ralph!
>
> I must have mentioned though. Without the Torque environment, spawning with ssh works ok. But Under the torque environment, not.

Ah, no - you forgot to mention that point.

>
> I started the simple_spawn with 3 processes and spawned 9 processes (3 per node on 3 nodes).
>
> There is no problem with the Torque environment because all the 9 processes are started on the respective nodes. But the MPI_Comm_spawn of the parent and MPI_Init of the children, "sometimes" don't return!

Seems odd - the launch environment has nothing to do with MPI_Init, so if the processes are indeed being started, they should run. One possibility is that they aren't correctly getting some wireup info.

Can you configure OMPI --enable-debug and then rerun the example with "-mca plm_base_verbose 5 -mca ess_base_verbose 5 -mca grpcomm_base_verbose 5" on the command line?

>
> This is the output of simple_spawn - which confirms the above statement.
>
> [pid 31208] starting up!
> [pid 31209] starting up!
> [pid 31210] starting up!
> 0 completed MPI_Init
> Parent [pid 31208] about to spawn!
> 1 completed MPI_Init
> Parent [pid 31209] about to spawn!
> 2 completed MPI_Init
> Parent [pid 31210] about to spawn!
> [pid 28630] starting up!
> [pid 28631] starting up!
> [pid 9846] starting up!
> [pid 9847] starting up!
> [pid 9848] starting up!
> [pid 6363] starting up!
> [pid 6361] starting up!
> [pid 6362] starting up!
> [pid 28632] starting up!
>
> Any hints?
>
> Best,
> Suraj
>
> On Feb 21, 2014, at 3:44 AM, Ralph Castain wrote:
>
>> Hmmm...I don't see anything immediately glaring. What do you mean by "doesn't work"? Is there some specific behavior you see?
>>
>> You might try the attached program. It's a simple spawn test we use - 1.7.4 seems happy with it.
>>
>> <simple_spawn.c>
>>
>> On Feb 20, 2014, at 10:14 AM, Suraj Prabhakaran <suraj.prabhakaran_at_[hidden]> wrote:
>>
>>> I am using 1.7.4!
>>>
>>> On Feb 20, 2014, at 7:00 PM, Ralph Castain wrote:
>>>
>>>> What OMPI version are you using?
>>>>
>>>> On Feb 20, 2014, at 7:56 AM, Suraj Prabhakaran <suraj.prabhakaran_at_[hidden]> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> I am having problem using MPI_Comm_spawn under torque. It doesn't work when spawning more than 12 processes on various nodes. To be more precise, "sometimes" it works, and "sometimes" it doesn't!
>>>>>
>>>>> Here is my case. I obtain 5 nodes, 3 cores per node and my $PBS_NODEFILE looks like below.
>>>>>
>>>>> node1
>>>>> node1
>>>>> node1
>>>>> node2
>>>>> node2
>>>>> node2
>>>>> node3
>>>>> node3
>>>>> node3
>>>>> node4
>>>>> node4
>>>>> node4
>>>>> node5
>>>>> node5
>>>>> node5
>>>>>
>>>>> I started a hello program (which just spawns itself and of course, the children don't spawn), with
>>>>>
>>>>> mpiexec -np 3 ./hello
>>>>>
>>>>> Spawning 3 more processes (on node 2) - works!
>>>>> spawning 6 more processes (node 2 and 3) - works!
>>>>> spawning 9 processes (node 2,3,4) - "sometimes" OK, "sometimes" not!
>>>>> spawning 12 processes (node 2,3,4,5) - "mostly" not!
>>>>>
>>>>> I ideally want to spawn about 32 processes with large number of nodes, but this is at the moment impossible. I have attached my hello program to this email.
>>>>>
>>>>> I will be happy to provide any more info or verbose outputs if you could please tell me what exactly you would like to see.
>>>>>
>>>>> Best,
>>>>> Suraj
>>>>>
>>>>> <hello.c>_______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel