Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Strange problem with 1.2.6
From: Lenny Verkhovsky (lenny.verkhovsky_at_[hidden])
Date: 2008-07-14 10:04:30


maybe it's related to #1378 PML ob1 deadlock for ping/ping ?

On 7/14/08, Jeff Squyres <jsquyres_at_[hidden]> wrote:
>
> What application is it? The majority of the message passing engine did not
> change in the 1.2 series; we did add a new option into 1.2.6 for disabling
> early completion:
>
>
> http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion
>
> See if that helps you out.
>
> Note that I don't think many (any?) of us developers monitor the beowulf
> list. Too much mail in our INBOXes already... :-(
>
>
> On Jul 10, 2008, at 11:04 PM, Joe Landman wrote:
>
> Hi folks:
>>
>> I am running into a strange problem with Open-MPI 1.2.6, built using
>> gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-ish). The problem
>> appears to be that if I run using the tcp btl, disabling sm and openib, the
>> run completes successfully (on several different platforms), and does so
>> repeatably.
>>
>> Similarly, if I enable either openib or sm btl, the run does not
>> complete, hanging at different places.
>>
>> An strace of the master thread while it is hanging shows it in a tight
>> loop
>>
>> Process 15547 attached - interrupt to quit
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
>> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
>> {fd=8, events=POLLIN}, {fd=9, events=
>> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
>> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
>> {fd=8, events=POLLIN}, {fd=9, events=
>> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
>> 0x2b8d766be130}, NULL, 8) = 0
>> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>>
>> The code ran fine about 18 months ago with earlier OpenMPI. This is
>> identical source and data to what is known to work, and demonstrated to work
>> on a few different platforms.
>>
>> Posing the question on Beowulf, some suggested turning off sm and openib.
>> So this run works repeatedly when we do as indicated. The suggestion was
>> that there was some sort of buffer size issue on the sm device.
>>
>> Turning off sm and tcp, leaving openib also appears to loop forever.
>>
>> So, with all this, are there any sort of tunables that I should be playing
>> with?
>>
>> I tried adusting a few things by setting some mca parameters in
>> $HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun
>> claimed it was going to ignore those anyway).
>>
>> Any clues? Thanks.
>>
>> Joe
>> --
>> Joseph Landman, Ph.D
>> Founder and CEO
>> Scalable Informatics LLC,
>> email: landman_at_[hidden]
>> web : http://www.scalableinformatics.com
>> http://jackrabbit.scalableinformatics.com
>> phone: +1 734 786 8423
>> fax : +1 866 888 3112
>> cell : +1 734 612 4615
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>