Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-06-19 14:28:33


The deadlock happens with or without your patch ? If it's with your
patch, the problem might come from the fact that you start 2
processes on each node and you will share the port range (because of
your patch).

Please re-run either with 2 processes by node but without your patch
or with only one process by node with your patch.

   Thanks,
     george.

On Jun 19, 2007, at 6:18 AM, Chris Reeves wrote:

>
> (This time with attachments...)
>
> Hi there,
>
> I've had a look through the FAQ and searched the list archives and
> can't find
> any similar problems to this one.
>
> I'm running OpenMPI 1.2.2 on 10 Intel iMacs (Intel Core2 Duo CPU).
> I am
> specifiying two slots per machine and starting my job with:
> /Network/Guanine/csr201/local-i386/opt/openmpi/bin/mpirun -np 20 --
> hostfile
> bhost.jobControl nice -19 /Network/Guanine/csr201/jobControl/
> run_torus.pl
> /Network/Guanine/csr201/models-gap/torus/torus.ompiosx-intel
>
> The config.log and output of 'ompi_info --all' are attached.
>
> Also attached is a small patch that I wrote to work around some
> firewall
> limitations on the nodes (I don't know if there's a better way to
> do this -
> suggestions are welcome). The patch may or may not be relevant, but
> I'm not
> ruling out network issues and a bit of peer review never goes amiss
> in case
> I've done something very silly.
>
> The programme that I'm trying to run is fairly hefty, so I'm afraid
> that I
> can't provide you with a simple test case to highlight the problem.
> The best I
> can do it provide you with a description of where I'm at and then
> ask for some
> advice/suggestions...
>
> The code itself has run in the past with various versions of MPI/
> LAM and
> OpenMPI and hasn't, to my knowledge, undergone any significant changes
> recently. I have noticed delays before, both on this system and on
> others,
> when MPI_BARRIER is called but they don't always result in a permanent
> 'spinning' of the process.
>
> The 20-node job that I'm running right now is using 90-100% of
> every CPU, but
> hasn't made any progress for around 14 hours. I've used GDB to
> attach to each
> of these processes and verified that every single one of them is
> sitting
> inside a call to MPI_BARRIER. My understanding is that once every
> process hits
> the barrier, they should then move on to the next part of the code.
>
> Here's an example of what I see when I attach to one of these
> processes:
> ----------------------------------------------------------------------
> --------
>
> Attaching to program: `/private/var/automount/Network/Guanine/
> csr201/models-gap/torus/torus.ompiosx-intel', process 29578.
> Reading symbols for shared libraries ..++++
> +....................................................................
> done
> 0x9000121c in sigprocmask ()
> (gdb) where
> #0 0x9000121c in sigprocmask ()
> #1 0x01c46f96 in opal_evsignal_recalc ()
> #2 0x01c458c2 in opal_event_base_loop ()
> #3 0x01c45d32 in opal_event_loop ()
> #4 0x01c3e6f2 in opal_progress ()
> #5 0x01b6083e in ompi_request_wait_all ()
> #6 0x01ec68d8 in ompi_coll_tuned_sendrecv_actual ()
> #7 0x01ecbf64 in ompi_coll_tuned_barrier_intra_bruck ()
> #8 0x01b75590 in MPI_Barrier ()
> #9 0x01aec47a in mpi_barrier__ ()
> #10 0x0011c66c in MAIN_ ()
> #11 0x002870f9 in main (argc=1, argv=0xbfffe6ec)
> (gdb)
>
> ----------------------------------------------------------------------
> --------
>
> Does anyone have any suggestions as to what might be happening
> here? Is there
> any way to 'tickle' the processes and get them to move on? What if
> some
> packets went missing on the network? Surely TCP should take care of
> this an
> resend? As implied by my line of questioning, my current thoughts
> are that
> some messages between nodes have somehow gone missing. Could this
> happen? What
> could cause this? All machines are on the same subnet.
>
> I'm sorry my question is so open, but I don't know much about the
> internals of
> OpenMPI and how it passes messages and I'm looking for some ideas
> on where to
> start searching!
>
> Thanks in advance for any help or suggestions that you can offer,
> Chris
> <ompi_config.log.gz>
> <ompi_info.out.gz>
> <openmpi-cluster_firewall.patch>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • application/pkcs7-signature attachment: smime.p7s