Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Strange problem with 1.2.6
From: Willem Vermin (willem_at_[hidden])
Date: 2008-07-11 04:01:30


Hello Joe,

I have no solution, but the same problem, see
http://www.open-mpi.org/community/lists/users/2008/07/6007.php
There you will find a small program to demonstrate the problem.

I found that the problem does not exists on all hardware, I have the
impression that the problem manifests itself on systems with 2 or more
cores. I tried it on a single core machine, and there was no problem.

Regards,

Willem

Joe Landman wrote:
> Hi folks:
>
> I am running into a strange problem with Open-MPI 1.2.6, built using
> gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-ish). The
> problem appears to be that if I run using the tcp btl, disabling sm and
> openib, the run completes successfully (on several different platforms),
> and does so repeatably.
>
> Similarly, if I enable either openib or sm btl, the run does not
> complete, hanging at different places.
>
> An strace of the master thread while it is hanging shows it in a
> tight loop
>
> Process 15547 attached - interrupt to quit
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
> 0x2b8d766be130}, NULL, 8) = 0
> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
> 0x2b8d766be130}, NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
> 0x2b8d766be130}, NULL, 8) = 0
> rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
> poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
> POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
> rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
> 0x2b8d766be130}, NULL, 8) = 0
> rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
>
> The code ran fine about 18 months ago with earlier OpenMPI. This is
> identical source and data to what is known to work, and demonstrated to
> work on a few different platforms.
>
> Posing the question on Beowulf, some suggested turning off sm and
> openib. So this run works repeatedly when we do as indicated. The
> suggestion was that there was some sort of buffer size issue on the sm
> device.
>
> Turning off sm and tcp, leaving openib also appears to loop forever.
>
> So, with all this, are there any sort of tunables that I should be
> playing with?
>
> I tried adusting a few things by setting some mca parameters in
> $HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun
> claimed it was going to ignore those anyway).
>
> Any clues? Thanks.
>
> Joe
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman_at_[hidden]
> web : http://www.scalableinformatics.com
> http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423
> fax : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Willem Vermin         tel (31)20 5923054/5923000
SARA, Kruislaan 415   fax (31)20 6683167
1098 SJ Amsterdam     willem_at_[hidden]
Nederland