Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Runtime error only on one node.
From: Shinta Bonnefoy (shinta.bonnefoy_at_[hidden])
Date: 2009-03-05 19:05:10


Hi Jeff,

Thanks, the option --mca btl ^openib works fine !

Half of the cluster has Infiniband/OpenFabrics (from node49 to node96)
and the other half (nodes from 01 to 48) doesn't.

I just wanted to make openmpi run over ethernet/tcp first.

I will try to make it run using OpenFabrics but I guess I need to
recompile another package to do it so ?

If I mix some nodes with OpenFabrics and some other which don't have
OpenFabrics, I should use the option "--mca btl ^openib" right ?
And if I use exclusively similar nodes (either non OpenFabrics and only
OpenFabrics), I don't have to use the option anymore.
But over OpenFabrics, does openmpi will use automatically the Infiniband
hardware ???

Thanks a lot.
SB

users-request_at_[hidden] wrote:
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 5 Mar 2009 17:25:34 -0500
> From: Jeff Squyres <jsquyres_at_[hidden]>
> Subject: Re: [OMPI users] Runtime error only on one node.
> To: "Open MPI Users" <users_at_[hidden]>
> Message-ID: <70D31C29-B711-419F-9973-73B41FEB0DBC_at_[hidden]>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> Whoops; we shouldn't be seg faulting. :-\
>
> The warning is exactly what it implies -- it found the OpenFabrics
> network stack by no functioning OpenFabrics-capable hardware. You can
> disable it (and the segv) by disabling the openfabrics BTL from running:
>
> mpirun --mca btl ^openib
>
> But what I don't see is why we're segv'ing when calling
> ibv_destroy_srq(). This is a function in the shutdown sequence of the
> openib BTL, but that shouldn't be getting called with the error
> message that you're seeing. Are you getting corefiles, perchance?
> Could you get a stack trace with the file and line numbers in OMPI
> where this is happening, perchance?
>
> Do you have OpenFabrics hardware on your cluster? According to your
> error message, node18 is the one that doesn't find an OF-capable
> hardware, but node66 is the one that segv's, which is darn weird...
>
>
> On Mar 5, 2009, at 12:13 AM, Shinta Bonnefoy wrote:
>
>
>> Hi,
>>
>> I am the admin of a small cluster (server running under SLES 10.1 and
>> nodes on OSS 10.3).
>> and I have just installed openmpi 1.3 on it.
>>
>> I'm trying to get a simple program (like hello world) running but it
>> fails all the time on on of the node but never on the others.
>>
>> I don't think it's related to the program since it's the simplest on
>> you
>> can write.
>>
>> All the nodes are sharing the openmpi install directory (trhough) nfs
>> and have all the same profile.
>>
>> Here is the runtime code error I've got :
>> mpirun -machinefile no -np 6 ~/hello.x
>> --------------------------------------------------------------------------
>> [[6735,1],0]: A high-performance Open MPI point-to-point messaging
>> module
>> was unable to find any relevant network interfaces:
>>
>> Module: OpenFabrics (openib)
>> Host: node18
>>
>> Another transport will be used instead, although this may result in
>> lower performance.
>> --------------------------------------------------------------------------
>> Hello world from process 3 of 6
>> Hello world from process 1 of 6
>> Hello world from process 4 of 6
>> Hello world from process 2 of 6
>> Hello world from process 5 of 6
>> Hello world from process 0 of 6
>> [node66:03997] *** Process received signal ***
>> [node66:03997] Signal: Segmentation fault (11)
>> [node66:03997] Signal code: Address not mapped (1)
>> [node66:03997] Failing at address: (nil)
>> [node66:03997] [ 0] /lib64/libpthread.so.0 [0x2b5e227a4fb0]
>> [node66:03997] [ 1] /usr/lib64/libibverbs.so.1(ibv_destroy_srq+0)
>> [0x2b5e24ee0fa0]
>> [node66:03997] [ 2]
>> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_btl_openib.so
>> [0x2b5e250eb2dd]
>> [node66:03997] [ 3]
>> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(mca_btl_base_close
>> +0x87)
>> [0x2b5e21aa2a67]
>> [node66:03997] [ 4]
>> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_bml_r2.so
>> [0x2b5e24cc39d2]
>> [node66:03997] [ 5]
>> /opt/cluster/software/openmpi/1.3/lib/openmpi/mca_pml_ob1.so
>> [0x2b5e24aa2d0e]
>> [node66:03997] [ 6]
>> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.
>> 0(mca_pml_base_finalize+0x1b)
>> [0x2b5e21aacd2f]
>> [node66:03997] [ 7] /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0
>> [0x2b5e21a66a7b]
>> [node66:03997] [ 8]
>> /opt/cluster/software/openmpi/1.3/lib/libmpi.so.0(MPI_Finalize+0x17)
>> [0x2b5e21a84207]
>> [node66:03997] [ 9] /home/donald/hello.x(main+0x6d) [0x401bd5]
>> [node66:03997] [10] /lib64/libc.so.6(__libc_start_main+0xf4)
>> [0x2b5e229cfb54]
>> [node66:03997] [11] /home/donald/hello.x [0x401ad9]
>> [node66:03997] *** End of error message ***
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 5 with PID 3997 on node node66 exited
>> on signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> [node72:07895] 4 more processes have sent help message
>> help-mpi-btl-base.txt / btl:no-nics
>> [node72:07895] Set MCA parameter "orte_base_help_aggregate" to 0 to
>> see
>> all help / error messages
>>
>>
>>
>>
>> Please advise,
>> Thanks and regards,
>> SB
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>