Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI_Com_spawn
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-10-29 10:49:39


Hi George

Your help is welcome! See below for some thoughts

On Oct 29, 2008, at 8:12 AM, George Bosilca wrote:

> Thanks Ralph, this indeed fixed my problem. However, I run in more
> troubles ...
>
> I have a simple application that keep spawning MPI processes,
> exchange some data and then the children disconnect and vanish. But
> I keep doing this in a loop ... absolutely legal from the MPI
> standard perspective. However, with Open MPI trunk I run in two
> kinds of troubles:
>
> 1. I run out of fds. Apparently the orteds don't close the
> connections when the children disconnect, and after few iterations I
> exhaust the available fd, the orted start complaining and everything
> end up being killed. If I check with lsof I can see the pending fd
> (in an invalid state) but still attached to the orted.

Good point - this was actually the case with the old system too, IIRC.
We didn't have a mechanism by which the orted could reach down into
the iof and "close" the file descriptors from a child process when it
terminates.

Here is what I suggest:

1. in orte/mca/odls/base/
odls_base_default_fns.c:odls_base_default_wait_local_proc function,
add a call to orte_iof.close(child->name)

2. in orte/mca/iof/orted/iof_orted.c:orted_close, look up the read
events and sinks that refer to that process and close those fds. Be
sure to also terminate the read events, cleanup any outputs still on
the sink's write event, and release those objects

That should do the trick.

>
>
> 2. I tried to be helpful and provide a host file describing the
> cluster. I even annotate the nodes with he number of slots and max-
> slots. When we spawn processes we correctly load balance them on the
> available nodes, but when they finish we do not release the
> resources. After few iterations we run out of available nodes, and
> the application exit with the following error:
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 2
> slots
> that were requested by the application:
> ./slave
>
> Either request fewer slots for your application, or make more slots
> available
> for use.
> --------------------------------------------------------------------------
>
> However, at this point there is only one MPI process running, the
> master. All other resources are fully available for the children.

This isn't an IOF issue, but rather a problem in how we track resource
usage in mpirun. When a job completes, we don't "release" its
resources back to the node pool.

Been that way since day one, now that I think about it - just nobody
noticed! :-)

Here is what I suggest.In orte/mca/plm/base/plm_base_launch_support.c:
orte_plm_base_check_job_completed - this is where we detect that a job
has actually completed. You could add a function call here to a new
routine that:

1. calls orte_rmaps.get_job_map(job) to get the map for this job -
that will tell you exactly which nodes and how many slots on each of
those nodes was used

2. the nodes in the map are stored as pointers to the corresponding
orte_node_t object in the global orte_node_pool. So all you would need
to do is cycle through the resulting array of node pointers,
decrementing their slots_in_use by the appropriate amount.

That should do the trick. I can't think of anything else that would be
required, though I can't swear I didn't miss something.

Thanks!
Ralph

>
>
> I would like to get involved in this and help fix the two problems.
> But I have a hard time figuring out where to start. Any pointers
> will be welcomed.
>
> Thanks,
> george.
>
> On Oct 28, 2008, at 10:50 AM, Ralph Castain wrote:
>
>> Done...r19820
>>
>> On Oct 28, 2008, at 8:37 AM, Ralph Castain wrote:
>>
>>> Yes, of course it does - the problem is in a sanity check I just
>>> installed over the weekend.
>>>
>>> Easily fixed...
>>>
>>>
>>> On Oct 28, 2008, at 8:33 AM, George Bosilca wrote:
>>>
>>>> Ralph,
>>>>
>>>> I run in troubles with the new IO framework when I spawn a new
>>>> process. The following error message is dumped and the job is
>>>> aborted.
>>>>
>>>> --------------------------------------------------------------------------
>>>> The requested stdin target is out of range for this job - it points
>>>> to a process rank that is greater than the number of process in the
>>>> job.
>>>>
>>>> Specified target: INVALID
>>>> Number of procs: 2
>>>>
>>>> This could be caused by specifying a negative number for the stdin
>>>> target, or by mistyping the desired rank. Please correct the cmd
>>>> line
>>>> and try again.
>>>> --------------------------------------------------------------------------
>>>>
>>>> Is the new IO framework supposed to support MPI2 dynamics ?
>>>>
>>>> Thanks,
>>>> george.
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel