Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Tim S. Woodall (twoodall_at_[hidden])
Date: 2005-09-02 08:07:42


Thanks Peter,

We'll look into this...

Tim

Peter Kjellström wrote:
> Hello,
>
> I'm playing with a copy of svn7132 that built and installed just fine. At
> first everything seemed ok, unlike earlier it now runs on mvapi
> automagically :-)
>
> But then a small testprogram failed and then another. After scratching my head
> a while I realised that the pattern was that as soon as I had two ranks
> sharing one node and used "mpi_leave_pinned 1" it broke... (segfaulted)
>
> Here is a bidirect point-to-point running two ranks on the same host (this one
> actually starts but segfaults half way through):
>
> NODEFILE is "n50 n50"
> [cap_at_n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1
> --np 2 mpibibench.ompi7132
> Using Zero pattern.
> starting _bidirect_ lat-bw test.
> Latency: 1.8 µsec (total)Bandwidth: 0.0 bytes/s (0 x 10000)
> Latency: 2.0 µsec (total)Bandwidth: 1.0 Mbytes/s (1 x 10000)
> Latency: 2.0 µsec (total)Bandwidth: 2.0 Mbytes/s (2 x 10000)
> Latency: 1.9 µsec (total)Bandwidth: 4.2 Mbytes/s (4 x 10000)
> Latency: 2.0 µsec (total)Bandwidth: 8.1 Mbytes/s (8 x 10000)
> Latency: 2.2 µsec (total)Bandwidth: 14.8 Mbytes/s (16 x 10000)
> Latency: 2.0 µsec (total)Bandwidth: 31.7 Mbytes/s (32 x 10000)
> Latency: 2.2 µsec (total)Bandwidth: 57.3 Mbytes/s (64 x 10000)
> Latency: 2.2 µsec (total)Bandwidth: 114.3 Mbytes/s (128 x 10000)
> Latency: 2.3 µsec (total)Bandwidth: 224.8 Mbytes/s (256 x 10000)
> Latency: 2.8 µsec (total)Bandwidth: 369.8 Mbytes/s (512 x 10000)
> mpirun noticed that job rank 0 with PID 5879 on node "n50" exited on signal
> 11.
> 1 additional process aborted (not shown)
>
> from dmesg:
> mpibibench.ompi[5879]: segfault at 0000000000000000 rip 0000000000000000 rsp
> 0000007fbfffe8e8 error 14
>
> running on more than one node seems to die instantly (simple all-to-all app):
>
> NODEFILE is "n50 n50 n49 n49"
> [cap_at_n50 mpi]$ mpirun --machinefile $PBS_NODEFILE --mca mpi_leave_pinned 1
> --np 4 alltoall.ompi7132
> mpirun noticed that job rank 3 with PID 27857 on node "n49" exited on signal
> 11.
> 3 additional processes aborted (not shown)
>
> and with similar segfault on dmesg
>
> Either running with one proc per node or skipping mpi_leave_pinned makes it
> work 100% Is this expected?
>
> tia,
> Peter
>
> System config:
> OS: centos-4.1 x86_64 2.6.9-11smp (el4u1)
> ompi: svn7132 vpath build with recommended libtool/autoconf/automake
> compilers: 64-bit icc/ifort 8.1-029
> configure: ./configure --prefix=xxx --with-btl-mvapi=yyy --disable-cxx
> --disable-f90 --disable-io-romio
>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users