Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Runtime error only on one node.
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-05 17:25:34


Whoops; we shouldn't be seg faulting. :-\

The warning is exactly what it implies -- it found the OpenFabrics
network stack by no functioning OpenFabrics-capable hardware. You can
disable it (and the segv) by disabling the openfabrics BTL from running:

   mpirun --mca btl ^openib

But what I don't see is why we're segv'ing when calling
ibv_destroy_srq(). This is a function in the shutdown sequence of the
openib BTL, but that shouldn't be getting called with the error
message that you're seeing. Are you getting corefiles, perchance?
Could you get a stack trace with the file and line numbers in OMPI
where this is happening, perchance?

Do you have OpenFabrics hardware on your cluster? According to your
error message, node18 is the one that doesn't find an OF-capable
hardware, but node66 is the one that segv's, which is darn weird...

On Mar 5, 2009, at 12:13 AM, Shinta Bonnefoy wrote:

> Hi,
>
> I am the admin of a small cluster (server running under SLES 10.1 and
> nodes on OSS 10.3).
> and I have just installed openmpi 1.3 on it.
>
> I'm trying to get a simple program (like hello world) running but it
> fails all the time on on of the node but never on the others.
>
> I don't think it's related to the program since it's the simplest on
> you
> can write.
>
> All the nodes are sharing the openmpi install directory (trhough) nfs
> and have all the same profile.
>
> Here is the runtime code error I've got :
> mpirun -machinefile no -np 6 ~/hello.x
> --------------------------------------------------------------------------
> [[6735,1],0]: A high-performance Open MPI point-to-point messaging
> module
> was unable to find any relevant network interfaces:
>
> Module: OpenFabrics (openib)
> Host: node18
>
> Another transport will be used instead, although this may result in
> lower performance.
> --------------------------------------------------------------------------
> Hello world from process 3 of 6
> Hello world from process 1 of 6
> Hello world from process 4 of 6
> Hello world from process 2 of 6
> Hello world from process 5 of 6
> Hello world from process 0 of 6
> [node66:03997] *** Process received signal ***
> [node66:03997] Signal: Segmentation fault (11)
> [node66:03997] Signal code: Address not mapped (1)
> [node66:03997] Failing at address: (nil)
> [node66:03997] [ 0] /lib64/libpthread.so.0 [0x2b5e227a4fb0]
> [node66:03997] [ 1] /usr/lib64/libibverbs.so.1(ibv_destroy_srq+0)
> [0x2b5e24ee0fa0]
> [node66:03997] [ 2]
> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_btl_openib.so
> [0x2b5e250eb2dd]
> [node66:03997] [ 3]
> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(mca_btl_base_close
> +0x87)
> [0x2b5e21aa2a67]
> [node66:03997] [ 4]
> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_bml_r2.so
> [0x2b5e24cc39d2]
> [node66:03997] [ 5]
> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_pml_ob1.so
> [0x2b5e24aa2d0e]
> [node66:03997] [ 6]
> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.
> 0(mca_pml_base_finalize+0x1b)
> [0x2b5e21aacd2f]
> [node66:03997] [ 7] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0
> [0x2b5e21a66a7b]
> [node66:03997] [ 8]
> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(MPI_Finalize+0x17)
> [0x2b5e21a84207]
> [node66:03997] [ 9] /home/donald/hello.x(main+0x6d) [0x401bd5]
> [node66:03997] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2b5e229cfb54]
> [node66:03997] [11] /home/donald/hello.x [0x401ad9]
> [node66:03997] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 5 with PID 3997 on node node66 exited
> on signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> [node72:07895] 4 more processes have sent help message
> help-mpi-btl-base.txt / btl:no-nics
> [node72:07895] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see
> all help / error messages
>
>
>
>
> Please advise,
> Thanks and regards,
> SB
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems