Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] CP2K mpi hang
From: Noam Bernstein (noam.bernstein_at_[hidden])
Date: 2009-05-19 14:01:04


On May 19, 2009, at 12:13 PM, Ashley Pittman wrote:
>
> That is indeed odd but it shouldn't be too hard to track down, how
> often
> does the failure occur? Presumably when you say you have three
> invocations of the program they communicate via files, is the location
> of these files changing?

Yeay. We have a winner. The CP2K code doesn't do all I/O from
the head node. For most of the input files it wants, it crashes if
it can't find them on the other nodes. My script therefore copies
those the input files to each node. However, there are two files that
are
generated on the fly (output of one call as input for the next one) on
the
head node. With one of them, apparently, CP2K will silently go on if
the
file is missing, but then lock up in an MPI call (maybe it leaves some
variables uninitialized, and then uses them in the call to the MPI
function?).
If I copy that file to each node, it seems to work fine.

This interpretation is also confirmed by the observation that running
with '--mca btl ^openib' hangs in essentially the same place:

#0 0x0000003b8daca3ff in poll () from /lib64/libc.so.6
#1 0x00002b3c817ab2c6 in poll_dispatch () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libopen-pal.so.0
#2 0x00002b3c817aa2a3 in opal_event_base_loop () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libopen-pal.so.0
#3 0x00002b3c8179fb2e in opal_progress () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libopen-pal.so.0
#4 0x00002b3c812d9e55 in ompi_request_default_wait_all () from /share/
apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
#5 0x00002b3c866d007a in ompi_coll_tuned_alltoall_intra_basic_linear
() from /share/apps/mpi/openmpi-1.3.2/intel-11.0.083/lib/openmpi/
mca_coll_tuned.so
#6 0x00002b3c812edb8f in PMPI_Alltoall () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libmpi.so.0
#7 0x00002b3c81094af6 in pmpi_alltoall__ () from /share/apps/mpi/
openmpi-1.3.2/intel-11.0.083/lib/libmpi_f77.so.0
#8 0x000000000078ca5b in message_passing_mp_mp_alltoall_i_ ()
#9 0x000000000116e6ab in
cp_sm_fm_interactions_mp_fm_reshuffle_create_layout_ ()

Thank you all for your help, and I apologize for the red herring :)

                                                                                        Noam