Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Runtime error only on one node.
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-05 17:25:34


Whoops; we shouldn't be seg faulting. :-\

The warning is exactly what it implies -- it found the OpenFabrics
network stack by no functioning OpenFabrics-capable hardware. You can
disable it (and the segv) by disabling the openfabrics BTL from running:

   mpirun --mca btl ^openib

But what I don't see is why we're segv'ing when calling
ibv_destroy_srq(). This is a function in the shutdown sequence of the
openib BTL, but that shouldn't be getting called with the error
message that you're seeing. Are you getting corefiles, perchance?
Could you get a stack trace with the file and line numbers in OMPI
where this is happening, perchance?

Do you have OpenFabrics hardware on your cluster? According to your
error message, node18 is the one that doesn't find an OF-capable
hardware, but node66 is the one that segv's, which is darn weird...

On Mar 5, 2009, at 12:13 AM, Shinta Bonnefoy wrote:

> Hi,
>
> I am the admin of a small cluster (server running under SLES 10.1 and
> nodes on OSS 10.3).
> and I have just installed openmpi 1.3 on it.
>
> I'm trying to get a simple program (like hello world) running but it
> fails all the time on on of the node but never on the others.
>
> I don't think it's related to the program since it's the simplest on
> you
> can write.
>
> All the nodes are sharing the openmpi install directory (trhough) nfs
> and have all the same profile.
>
> Here is the runtime code error I've got :
> mpirun -machinefile no -np 6 ~/hello.x
> --------------------------------------------------------------------------
> [[6735,1],0]: A high-performance Open MPI point-to-point messaging
> module
> was unable to find any relevant network interfaces:
>
> Module: OpenFabrics (openib)
> Host: node18
>
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> Hello world from process 3 of 6
> Hello world from process 1 of 6
> Hello world from process 4 of 6
> Hello world from process 2 of 6
> Hello world from process 5 of 6
> Hello world from process 0 of 6
> [node66:03997] *** Process received signal ***
> [node66:03997] Signal: Segmentation fault (11)
> [node66:03997] Signal code: Address not mapped (1)
> [node66:03997] Failing at address: (nil)
> [node66:03997] [ 0] /lib64/libpthread.so.0 [0x2b5e227a4fb0]
> [node66:03997] [ 1] /usr/lib64/libibverbs.so.1(ibv_destroy_srq+0)
> [0x2b5e24ee0fa0]
> [node66:03997] [ 2]
> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_btl_openib.so
> [0x2b5e250eb2dd]
> [node66:03997] [ 3]
> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(mca_btl_base_close
> +0x87)
> [0x2b5e21aa2a67]
> [node66:03997] [ 4]
> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_bml_r2.so
> [0x2b5e24cc39d2]
> [node66:03997] [ 5]
> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_pml_ob1.so
> [0x2b5e24aa2d0e]
> [node66:03997] [ 6]
> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.
> 0(mca_pml_base_finalize+0x1b)
> [0x2b5e21aacd2f]
> [node66:03997] [ 7] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0
> [0x2b5e21a66a7b]
> [node66:03997] [ 8]
> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(MPI_Finalize+0x17)
> [0x2b5e21a84207]
> [node66:03997] [ 9] /home/donald/hello.x(main+0x6d) [0x401bd5]
> [node66:03997] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2b5e229cfb54]
> [node66:03997] [11] /home/donald/hello.x [0x401ad9]
> [node66:03997] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 5 with PID 3997 on node node66 exited
> on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> [node72:07895] 4 more processes have sent help message
> help-mpi-btl-base.txt / btl:no-nics
> [node72:07895] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see
> all help / error messages
>
>
>
>
> Please advise,
> Thanks and regards,
> SB
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems