Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Marcin Skoczylas (Marcin.Skoczylas_at_[hidden])
Date: 2007-06-20 05:51:11

I had almost the same situation when I upgraded OpenMPI from very old
version to 1.2.2. All processes seemed to stuck in MPI_Barrier, as a
walk-around I just commented out all MPI_Barrier occurrences in my
program and it started to work perfectly.

greets, Marcin

Chris Reeves wrote:
> (This time with attachments...)
> Hi there,
> I've had a look through the FAQ and searched the list archives and can't find
> any similar problems to this one.
> I'm running OpenMPI 1.2.2 on 10 Intel iMacs (Intel Core2 Duo CPU). I am
> specifiying two slots per machine and starting my job with:
> /Network/Guanine/csr201/local-i386/opt/openmpi/bin/mpirun -np 20 --hostfile
> bhost.jobControl nice -19 /Network/Guanine/csr201/jobControl/
> /Network/Guanine/csr201/models-gap/torus/torus.ompiosx-intel
> The config.log and output of 'ompi_info --all' are attached.
> Also attached is a small patch that I wrote to work around some firewall
> limitations on the nodes (I don't know if there's a better way to do this -
> suggestions are welcome). The patch may or may not be relevant, but I'm not
> ruling out network issues and a bit of peer review never goes amiss in case
> I've done something very silly.
> The programme that I'm trying to run is fairly hefty, so I'm afraid that I
> can't provide you with a simple test case to highlight the problem. The best I
> can do it provide you with a description of where I'm at and then ask for some
> advice/suggestions...
> The code itself has run in the past with various versions of MPI/LAM and
> OpenMPI and hasn't, to my knowledge, undergone any significant changes
> recently. I have noticed delays before, both on this system and on others,
> when MPI_BARRIER is called but they don't always result in a permanent
> 'spinning' of the process.
> The 20-node job that I'm running right now is using 90-100% of every CPU, but
> hasn't made any progress for around 14 hours. I've used GDB to attach to each
> of these processes and verified that every single one of them is sitting
> inside a call to MPI_BARRIER. My understanding is that once every process hits
> the barrier, they should then move on to the next part of the code.
> Here's an example of what I see when I attach to one of these processes:
> ------------------------------------------------------------------------------
> Attaching to program: `/private/var/automount/Network/Guanine/csr201/models-gap/torus/torus.ompiosx-intel', process 29578.
> Reading symbols for shared libraries ..+++++.................................................................... done
> 0x9000121c in sigprocmask ()
> (gdb) where
> #0 0x9000121c in sigprocmask ()
> #1 0x01c46f96 in opal_evsignal_recalc ()
> #2 0x01c458c2 in opal_event_base_loop ()
> #3 0x01c45d32 in opal_event_loop ()
> #4 0x01c3e6f2 in opal_progress ()
> #5 0x01b6083e in ompi_request_wait_all ()
> #6 0x01ec68d8 in ompi_coll_tuned_sendrecv_actual ()
> #7 0x01ecbf64 in ompi_coll_tuned_barrier_intra_bruck ()
> #8 0x01b75590 in MPI_Barrier ()
> #9 0x01aec47a in mpi_barrier__ ()
> #10 0x0011c66c in MAIN_ ()
> #11 0x002870f9 in main (argc=1, argv=0xbfffe6ec)
> (gdb)
> ------------------------------------------------------------------------------
> Does anyone have any suggestions as to what might be happening here? Is there
> any way to 'tickle' the processes and get them to move on? What if some
> packets went missing on the network? Surely TCP should take care of this an
> resend? As implied by my line of questioning, my current thoughts are that
> some messages between nodes have somehow gone missing. Could this happen? What
> could cause this? All machines are on the same subnet.
> I'm sorry my question is so open, but I don't know much about the internals of
> OpenMPI and how it passes messages and I'm looking for some ideas on where to
> start searching!
> Thanks in advance for any help or suggestions that you can offer,
> Chris
> ------------------------------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden]