Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r17941
From: Ralph H Castain (rhc_at_[hidden])
Date: 2008-03-27 13:40:05


Appears fixed with r17992 - at least, it works on TM, slurm (odin), and Mac.

On 3/27/08 11:06 AM, "Ralph H Castain" <rhc_at_[hidden]> wrote:

> Found the problem - should have a fix committed soon. Issue is with
> differences in the number of daemons launched by the various plms (whether
> or not procs are launched local to mpirun).
>
>
>
> On 3/27/08 10:39 AM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
>
>> Hmmm...puzzling. It is working fine for me on TM machines and on my Mac.
>> However, Galen reports it borked on alps as well.
>>
>> I'll have to dig a little to check this out and see if there is something
>> missing on those PLMs. Will get back shortly.
>>
>> Sorry for problem
>>
>>
>> On 3/27/08 10:28 AM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>
>>> Unfortunately now with r17988 I cannot run any mpi programs, they seem
>>> to hang in the modex.
>>>
>>> Tim
>>>
>>> Ralph H Castain wrote:
>>>> Thanks Tim - I found the problem and will commit a fix shortly.
>>>>
>>>> Appreciate your testing and reporting!
>>>>
>>>>
>>>> On 3/27/08 8:24 AM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>>>
>>>>> This commit breaks things for me. Running on 3 nodes of odin:
>>>>>
>>>>> mpirun -mca btl tcp,sm,self examples/ring_c
>>>>>
>>>>> causes a hang. All of the processes are stuck in
>>>>> orte_grpcomm_base_barrier during MPI_Finalize. Not all programs hang,
>>>>> and the ring program does not hang all the time, but fairly often.
>>>>>
>>>>> Tim
>>>>>
>>>>> rhc_at_[hidden] wrote:
>>>>>> Author: rhc
>>>>>> Date: 2008-03-24 16:50:31 EDT (Mon, 24 Mar 2008)
>>>>>> New Revision: 17941
>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/17941
>>>>>>
>>>>>> Log:
>>>>>> Fix the allgather and allgather_list functions to avoid deadlocks at
>>>>>> large
>>>>>> node/proc counts. Violated the RML rules here - we received the allgather
>>>>>> buffer and then did an xcast, which causes a send to go out, and is then
>>>>>> subsequently received by the sender. This fix breaks that pattern by
>>>>>> forcing
>>>>>> the recv to complete outside of the function itself - thus, the allgather
>>>>>> and
>>>>>> allgather_list always complete their recvs before returning or sending.
>>>>>>
>>>>>> Reogranize the grpcomm code a little to provide support for soon-to-come
>>>>>> new
>>>>>> grpcomm components. The revised organization puts what will be common
>>>>>> code
>>>>>> elements in the base to avoid duplication, while allowing components that
>>>>>> don't need those functions to ignore them.
>>>>>>
>>>>>> Added:
>>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_allgather.c
>>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_barrier.c
>>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_modex.c
>>>>>> Text files modified:
>>>>>> trunk/orte/mca/grpcomm/base/Makefile.am | 5
>>>>>> trunk/orte/mca/grpcomm/base/base.h | 23 +
>>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_close.c | 4
>>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_open.c | 1
>>>>>> trunk/orte/mca/grpcomm/base/grpcomm_base_select.c | 121 ++---
>>>>>> trunk/orte/mca/grpcomm/basic/grpcomm_basic.h | 16
>>>>>> trunk/orte/mca/grpcomm/basic/grpcomm_basic_component.c | 30 -
>>>>>> trunk/orte/mca/grpcomm/basic/grpcomm_basic_module.c | 845
>>>>>> ++-------------------------------------
>>>>>> trunk/orte/mca/grpcomm/cnos/grpcomm_cnos.h | 8
>>>>>> trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_component.c | 8
>>>>>> trunk/orte/mca/grpcomm/cnos/grpcomm_cnos_module.c | 21
>>>>>> trunk/orte/mca/grpcomm/grpcomm.h | 45 +
>>>>>> trunk/orte/mca/rml/rml_types.h | 31
>>>>>> trunk/orte/orted/orted_comm.c | 27 +
>>>>>> 14 files changed, 226 insertions(+), 959 deletions(-)
>>>>>>
>>>>>>
>>>>>> Diff not shown due to size (92619 bytes).
>>>>>> To see the diff, run the following command:
>>>>>>
>>>>>> svn diff -r 17940:17941 --no-diff-deleted
>>>>>>
>>>>>> _______________________________________________
>>>>>> svn mailing list
>>>>>> svn_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel