Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Strange problem with 1.2.6
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-07-14 09:59:19


What application is it? The majority of the message passing engine
did not change in the 1.2 series; we did add a new option into 1.2.6
for disabling early completion:

     http://www.open-mpi.org/faq/?category=openfabrics#v1.2-use-early-completion

See if that helps you out.

Note that I don't think many (any?) of us developers monitor the
beowulf list. Too much mail in our INBOXes already... :-(

On Jul 10, 2008, at 11:04 PM, Joe Landman wrote:

> Hi folks:
>
> I am running into a strange problem with Open-MPI 1.2.6, built
> using gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-
> ish). The problem appears to be that if I run using the tcp btl,
> disabling sm and openib, the run completes successfully (on several
> different platforms), and does so repeatably.
>
> Similarly, if I enable either openib or sm btl, the run does not
> complete, hanging at different places.
>
> An strace of the master thread while it is hanging shows it in a
> tight loop
>
> Process 15547 attached - interrupt to quit
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|
> SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|
> SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|
> SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|
> SA_RESTART, 0x2b8d766be130}, NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>
> The code ran fine about 18 months ago with earlier OpenMPI. This is
> identical source and data to what is known to work, and demonstrated
> to work on a few different platforms.
>
> Posing the question on Beowulf, some suggested turning off sm and
> openib. So this run works repeatedly when we do as indicated. The
> suggestion was that there was some sort of buffer size issue on the
> sm device.
>
> Turning off sm and tcp, leaving openib also appears to loop forever.
>
> So, with all this, are there any sort of tunables that I should be
> playing with?
>
> I tried adusting a few things by setting some mca parameters in
> $HOME/.openmpi/mca-params.conf , but this had no effect (and the
> mpirun claimed it was going to ignore those anyway).
>
> Any clues? Thanks.
>
> Joe
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman_at_[hidden]
> web : http://www.scalableinformatics.com
> http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423
> fax : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems