Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Strange problem with 1.2.6
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-07-14 10:04:30


maybe it's related to #1378 PML ob1 deadlock for ping/ping ?

On 7/14/08, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> What application is it? The majority of the message passing engine did not
> change in the 1.2 series; we did add a new option into 1.2.6 for disabling
> early completion:
>
>
> http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
>
> See if that helps you out.
>
> Note that I don't think many (any?) of us developers monitor the beowulf
> list. Too much mail in our INBOXes already... :-(
>
>
> On Jul 10, 2008, at 11:04 PM, Joe Landman wrote:
>
> Hi folks:
>>
>> I am running into a strange problem with Open-MPI 1.2.6, built using
>> gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-ish). The problem
>> appears to be that if I run using the tcp btl, disabling sm and openib, the
>> run completes successfully (on several different platforms), and does so
>> repeatably.
>>
>> Similarly, if I enable either openib or sm btl, the run does not
>> complete, hanging at different places.
>>
>> An strace of the master thread while it is hanging shows it in a tight
>> loop
>>
>> Process 15547 attached - interrupt to quit
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
>> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
>> {fd=8, events=POLLIN}, {fd=9, events=
>> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
>> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
>> {fd=8, events=POLLIN}, {fd=9, events=
>> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>>
>> The code ran fine about 18 months ago with earlier OpenMPI. This is
>> identical source and data to what is known to work, and demonstrated to work
>> on a few different platforms.
>>
>> Posing the question on Beowulf, some suggested turning off sm and openib.
>> So this run works repeatedly when we do as indicated. The suggestion was
>> that there was some sort of buffer size issue on the sm device.
>>
>> Turning off sm and tcp, leaving openib also appears to loop forever.
>>
>> So, with all this, are there any sort of tunables that I should be playing
>> with?
>>
>> I tried adusting a few things by setting some mca parameters in
>> $HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun
>> claimed it was going to ignore those anyway).
>>
>> Any clues? Thanks.
>>
>> Joe
>> --
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics LLC,
>> email: landman_at_[hidden]
>> web : http://www.scalableinformatics.com
>> http://jackrabbit.scalableinformatics.com
>> phone: +1 734 786 8423
>> fax : +1 866 888 3112
>> cell : +1 734 612 4615
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>