Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Peter Kjellström (cap_at_[hidden])
Date: 2005-09-02 04:38:13


Hello,

I'm playing with a copy of svn7132 that built and installed just fine. At
first everything seemed ok, unlike earlier it now runs on mvapi
automagically :-)

But then a small testprogram failed and then another. After scratching my head
a while I realised that the pattern was that as soon as I had two ranks
sharing one node and used "mpi_leave_pinned 1" it broke... (segfaulted)

Here is a bidirect point-to-point running two ranks on the same host (this one
actually starts but segfaults half way through):

NODEFILE is "n50 n50"
[cap_at_n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1
--np 2 mpibibench.ompi7132
Using Zero pattern.
starting _bidirect_ lat-bw test.
Latency: 1.8 µsec (total)Bandwidth: 0.0 bytes/s (0 x 10000)
Latency: 2.0 µsec (total)Bandwidth: 1.0 Mbytes/s (1 x 10000)
Latency: 2.0 µsec (total)Bandwidth: 2.0 Mbytes/s (2 x 10000)
Latency: 1.9 µsec (total)Bandwidth: 4.2 Mbytes/s (4 x 10000)
Latency: 2.0 µsec (total)Bandwidth: 8.1 Mbytes/s (8 x 10000)
Latency: 2.2 µsec (total)Bandwidth: 14.8 Mbytes/s (16 x 10000)
Latency: 2.0 µsec (total)Bandwidth: 31.7 Mbytes/s (32 x 10000)
Latency: 2.2 µsec (total)Bandwidth: 57.3 Mbytes/s (64 x 10000)
Latency: 2.2 µsec (total)Bandwidth: 114.3 Mbytes/s (128 x 10000)
Latency: 2.3 µsec (total)Bandwidth: 224.8 Mbytes/s (256 x 10000)
Latency: 2.8 µsec (total)Bandwidth: 369.8 Mbytes/s (512 x 10000)
mpirun noticed that job rank 0 with PID 5879 on node "n50" exited on signal
11.
1 additional process aborted (not shown)

from dmesg:
mpibibench.ompi[5879]: segfault at 0000000000000000 rip 0000000000000000 rsp
0000007fbfffe8e8 error 14

running on more than one node seems to die instantly (simple all-to-all app):

NODEFILE is "n50 n50 n49 n49"
[cap_at_n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1
--np 4 alltoall.ompi7132
mpirun noticed that job rank 3 with PID 27857 on node "n49" exited on signal
11.
3 additional processes aborted (not shown)

and with similar segfault on dmesg

Either running with one proc per node or skipping mpi_leave_pinned makes it
work 100% Is this expected?

tia,
 Peter

System config:
OS: centos-4.1 x86_64 2.6.9-11smp (el4u1)
ompi: svn7132 vpath build with recommended libtool/autoconf/automake
compilers: 64-bit icc/ifort 8.1-029
configure: ./configure --prefix=xxx --with-btl-mvapi=yyy --disable-cxx
--disable-f90 --disable-io-romio

-- 
------------------------------------------------------------
  Peter Kjellström               |
  National Supercomputer Centre  |
  Sweden                         | http://www.nsc.liu.se


  • application/pgp-signature attachment: stored