Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Strange problem with 1.2.6
From: Joe Landman (landman_at_[hidden])
Date: 2008-07-10 23:04:30


Hi folks:

   I am running into a strange problem with Open-MPI 1.2.6, built using
gcc/g++ and intel ifort 10.1.015, atop an OFED stack (1.1-ish). The
problem appears to be that if I run using the tcp btl, disabling sm and
openib, the run completes successfully (on several different platforms),
and does so repeatably.

   Similarly, if I enable either openib or sm btl, the run does not
complete, hanging at different places.

   An strace of the master thread while it is hanging shows it in a
tight loop

Process 15547 attached - interrupt to quit
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [CHLD], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=
POLLIN}, {fd=10, events=POLLIN}], 6, 0) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0
rt_sigaction(SIGCHLD, {0x2b8d7587f9b2, [CHLD], SA_RESTORER|SA_RESTART,
0x2b8d766be130}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], NULL, 8) = 0

The code ran fine about 18 months ago with earlier OpenMPI. This is
identical source and data to what is known to work, and demonstrated to
work on a few different platforms.

Posing the question on Beowulf, some suggested turning off sm and
openib. So this run works repeatedly when we do as indicated. The
suggestion was that there was some sort of buffer size issue on the sm
device.

Turning off sm and tcp, leaving openib also appears to loop forever.

So, with all this, are there any sort of tunables that I should be
playing with?

I tried adusting a few things by setting some mca parameters in
$HOME/.openmpi/mca-params.conf , but this had no effect (and the mpirun
claimed it was going to ignore those anyway).

Any clues? Thanks.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman_at_[hidden]
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615