Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Ralph Castain (rhc_at_[hidden])
Date: 2007-04-02 13:00:27


Hi Prakash

This is telling you that you have an error in the comm_spawn command itself.
I am no expert there, so I'll have to let someone else identify it for you.

There are no limits to launching on nodes in a hostfile - they are all
automatically considered "allocated" when the file is read. If you had the
node name in the file, then there is no "dynamic" addition of nodes going
on.

Meantime, I am going to send you a different solution to dynamically adding
nodes under Torque (or any other resource manager).

Ralph

On 4/2/07 10:53 AM, "Prakash Velayutham" <Prakash.Velayutham_at_[hidden]>
wrote:

> Hello,
>
> Thanks for the patch. I still do not know the internals of Open MPI, so can't
> test this right away. But here is another test I ran and that failed too.
>
> I have now removed Torque from the equation. I am NOT requesting nodes through
> Torque. I SSH to a compute node and start up the code as below.
>
> prakash_at_wins04:~/thesis/CS/Samples>mpirun -np 4 --bynode --hostfile
> machinefile ./parallel.laplace
>
> [wins01:17699] *** An error occurred in MPI_Comm_spawn
> [wins01:17699] *** on communicator MPI_COMM_WORLD
> [wins01:17699] *** MPI_ERR_ARG: invalid argument of some other kind
> [wins01:17699] *** MPI_ERRORS_ARE_FATAL (goodbye)
> mpirun noticed that job rank 1 with PID 25074 on node wins02 exited on signal
> 15 (Terminated).
> 2 additional processes aborted (not shown)
>
> What happened here? Why would orted not let me spawn on new nodes? What kind
> of restrictions apply in this case? I even have the new node name in the
> hostfile (machinefile), just in case.
>
> Thanks,
> Prakash
>
>
>>>> jbuisson_at_[hidden] 04/02/07 12:34 PM >>>
> Ralph Castain a écrit :
>> The runtime underneath Open MPI (called OpenRTE) will not allow you to spawn
>> processes on nodes outside of your allocation. This is for several reasons,
>> but primarily because (a) we only know about the nodes that were allocated,
>> so we have no idea how to spawn a process anywhere else, and (b) most
>> resource managers wouldn't let us do it anyway.
>>
>> I gather you have some node that you know about and have hard-coded into
>> your application? How do you know the name of the node if it isn't in your
>> allocation??
>
> Because I can give that names to OpenMPI (or OpenRTE, or whatever). I
> also would like to do the same, and I don't want OpenMPI to restrict to
> what it thinks to be the allocation, while I'm sure to know better than
> it what I am doing.
> The concept of nodes being in allocations fixed at launch-time is really
> rigid; and it prevents the application (or whatever else) to modify the
> allocation at runtime, which may be quite nice.
>
> Here is an ugly patch I've quickly done for my own use, which changes
> the round-robin rmaps such that is first allocates the hosts to the
> rmgr, as a copy&paste of some code in the dash_host ras component. It's
> far from being bugfree, but it can be a startpoint to hack.
>
> Jeremy
>
>> Ralph
>>
>>
>> On 4/2/07 10:05 AM, "Prakash Velayutham" <Prakash.Velayutham_at_[hidden]>
>> wrote:
>>
>>> Hello,
>>>
>>> I have built Open MPI (1.2) with run-time environment enabled for Torque
>>> (2.1.6) resource manager. Initially I am requesting 4 nodes (1 CPU each)
>>> from Torque. The from inside of my MPI code I am trying to spawn more
>>> processes to nodes outside of Torque-assigned nodes using
>>> MPI_Comm_spawn, but this is failing with an error below:
>>>
>>> [wins04:13564] *** An error occurred in MPI_Comm_spawn
>>> [wins04:13564] *** on communicator MPI_COMM_WORLD
>>> [wins04:13564] *** MPI_ERR_ARG: invalid argument of some other kind
>>> [wins04:13564] *** MPI_ERRORS_ARE_FATAL (goodbye)
>>> mpirun noticed that job rank 1 with PID 15070 on node wins03 exited on
>>> signal 15 (Terminated).
>>> 2 additional processes aborted (not shown)
>>>
>>> #################################
>>>
>>> MPI_Info info;
>>> MPI_Comm comm, *intercomm;
>>> ...
>>> ...
>>> char *key, *value;
>>> key = "host";
>>> value = "wins08";
>>> rc1 = MPI_Info_create(&info);
>>> rc1 = MPI_Info_set(info, key, value);
>>> rc1 = MPI_Comm_spawn(slave,MPI_ARGV_NULL, 1, info, 0,
>>> MPI_COMM_WORLD, intercomm, arr);
>>> ...
>>> }
>>>
>>> ###################################################
>>>
>>> Would this work as it is or is something wrong with my assumption? Is
>>> OpenRTE stopping me from spawning processes outside of the initially
>>> allocated nodes through Torque?
>>>
>>> Thanks,
>>> Prakash
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>