Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2007-01-04 16:06:36


There is some confusion here. I see that you try to run using mtl but
you have the wrong mca parameters. In order to activate the MTL you
should have on the mpirun command line "--mca pml cm --mca mtl mx".
As you can se from your backtrace it segfault in the BTL
initialization, which means that you're using the BTL and not the MTL.

Second thing. From one of your previous emails, I see that MX is
configured with 4 instance by node. Your running with exactly 4
processes on the first 2 nodes. Weirds things might happens ...

Now, if you use the latest trunk, you can use the new MX BTL which
provide support for shared memory and self communications. Add "--mca
pml ob1 --mca btl mx --mca btl_mx_shared_mem 1 --mca btl_mx_self 1"
in order to activate these new features. If you have a 10G cards, I
suggest you add "--mca btl_mx_flags 2" as well.

   Thanks,
     george.

PS: Is there any way you can attach to the processes with gdb ? I
would like to see the backtrace as showed by gdb in order to be able
to figure out what's wrong there.

On Jan 4, 2007, at 2:00 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

> I've grabbed last nights tarball (1.2b3r12981) and tried using the
> shared mem transport on btl and mx,self on mtl, same results. What I
> don't get is that, sometimes it works, and sometimes it doesn't (for
> either). For example, I can run it 10 times successfully then incr the
> -np from 7 to 10 across 3 nodes, and it'll immediately fail.
>
> Here's an example of one run right after another.
>
> $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca
> mtl
> mx,self ./cpi
> Process 0 of 10 is on node-25
> Process 4 of 10 is on node-26
> Process 1 of 10 is on node-25
> Process 5 of 10 is on node-26
> Process 2 of 10 is on node-25
> Process 8 of 10 is on node-27
> Process 6 of 10 is on node-26
> Process 9 of 10 is on node-27
> Process 7 of 10 is on node-26
> Process 3 of 10 is on node-25
> pi is approximately 3.1415926544231256, Error is 0.0000000008333325
> wall clock time = 0.017513
>
> $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca
> mtl
> mx,self ./cpi
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:(nil)
> [0]
> func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
> (opal_backtrace_
> print+0x1f) [0x2b8ddf3ccd3f]
> [1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
> [0x2b8ddf3cb891]
> [2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0]
> [3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df)
> [0x2b8de25bf2af]
> [4]
> func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so
> (mca_btl_mx
> _component_init+0x5d7) [0x2b8de27dcd27]
> [5]
> func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0
> (mca_btl_base_select+
> 0x156) [0x2b8ddf125b46]
> [6]
> func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so
> (mca_bml_r2
> _component_init+0x11) [0x2b8de26d7491]
> [7]
> func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0
> (mca_bml_base_init+0x
> 7d) [0x2b8ddf12543d]
> [8]
> func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so
> (mca_pml_o
> b1_component_init+0x6b) [0x2b8de23a4f8b]
> [9]
> func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0
> (mca_pml_base_select+
> 0x113) [0x2b8ddf12cea3]
> [10]
> func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init
> +0x45a)
> [0x2b8ddf0f5bda]
> [11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init
> +0x83)
> [0x2b8ddf116af3]
> [12] func:./cpi(main+0x42) [0x400cd5]
> [13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3]
> [14] func:./cpi [0x400bd9]
> *** End of error message ***
> mpirun noticed that job rank 0 with PID 0 on node node-25 exited on
> signal 11.
> 9 additional processes aborted (not shown)
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
> mpi.org] On
> Behalf Of Brian W. Barrett
> Sent: Tuesday, January 02, 2007 4:11 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Ompi failing on mx only
>
> Sorry to jump into the discussion late. The mx btl does not support
> communication between processes on the same node by itself, so you
> have
> to include the shared memory transport when using MX. This will
> eventually be fixed, but likely not for the 1.2 release. So if you
> do:
>
> mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
> hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi
>
> It should work much better. As for the MTL, there is a bug in the MX
> MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
> random failures you were seeing. It will work much better after
> 1.2b3 is released (or if you are feeling really lucky, you can try out
> the 1.2 nightly tarballs).
>
> The MTL is a new feature in v1.2. It is a different communication
> abstraction designed to support interconnects that have matching
> implemented in the lower level library or in hardware (Myrinet/MX,
> Portals, InfiniPath are currently implemented). The MTL allows us to
> exploit the low latency and asynchronous progress these libraries can
> provide, but does mean multi-nic abilities are reduced. Further, the
> MTL is not well suited to interconnects like TCP or InfiniBand, so we
> will continue supporting the BTL interface as well.
>
> Brian
>
>
> On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:
>
>> About the -x, I've been trying it both ways and prefer the latter,
>> and
>
>> results for either are the same. But it's value is correct.
>> I've attached the ompi_info from node-1 and node-2. Sorry for not
>> zipping them, but they were small and I think I'd have firewall
>> issues.
>>
>> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
>> hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect
>> fail for node-14:0 with key aaaaffff (error Endpoint closed or not
>> connectable!) [node-14:19261] mx_connect fail for node-14:0 with key
>> aaaaffff (error Endpoint closed or not connectable!) ...
>>
>> Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and
>> sometimes it actually worked once. But now I can't reproduce it and
>> it's throwing sig 7's, 11's, and 4's depending upon the number of
>> procs I give it. But now that you mention mapper, I take it that's
>> what SEGV_MAPERR might be referring to. I'm looking into the
>>
>> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$
>> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi
>> Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1
>> of 5
>
>> is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1
>> pi is approximately 3.1415926544231225, Error is 0.0000000008333294
>> wall clock time = 0.019305
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
>> addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node
>> node-1 exited on signal 1.
>> 4 additional processes aborted (not shown) Or sometimes I'll get this
>> error, just depending upon the number of procs ...
>>
>> mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$
>> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi
>> Signal:7 info.si_errno:0(Success) si_code:2() Failing at
>> addr:0x2aaaaaaab000 [0]
>> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
>> (opal_backtrace_print+0x1f) [0x2b9b7fa52d1f] [1]
>> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
>> [0x2b9b7fa51871]
>> [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3]
>> func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0
>> (mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83] [4]
>> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
>> [0x2b9b8260d0ff]
>> [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0
>> (mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0] [6]
>> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so
>> (mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517] [7]
>> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so
>> (mca_bml_r2_add_procs+0x206) [0x2b9b82d5f576] [8]
>> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so
>> (mca_pml_ob1_add_procs+0xe3) [0x2b9b82a2d0a3] [9]
>> func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init
>> +0x697) [0x2b9b7f77be07]
>> [10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83)
>> [0x2b9b7f79c943]
>> [11] func:./cpi(main+0x42) [0x400cd5]
>> [12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134]
>> [13]
>
>> func:./cpi [0x400bd9]
>> *** End of error message ***
>> Process 4 of 7 is on node-2
>> Process 5 of 7 is on node-2
>> Process 6 of 7 is on node-2
>> Process 0 of 7 is on node-1
>> Process 1 of 7 is on node-1
>> Process 2 of 7 is on node-1
>> Process 3 of 7 is on node-1
>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>> wall clock time = 0.009331
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
>> addr:0x2b4ba33652be
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
>> addr:0x2b8685aba2be
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
>> addr:0x2b304ffbe2be mpirun noticed that job rank 0 with PID 0 on node
>> node-1 exited on signal 1.
>> 6 additional processes aborted (not shown)
>>
>> Ok, so I take it one is down. Would this be the cause for all the
>> different errors I'm seeing?
>>
>> $ fm_status
>> FMS Fabric status
>>
>> 17 hosts known
>> 16 FMAs found
>> 3 un-ACKed alerts
>> Mapping is complete, last map generated by node-20 Database
>> generation
>
>> not yet complete.
>>
>>
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open- mpi.org]
>> On Behalf Of Reese Faucette
>> Sent: Tuesday, January 02, 2007 2:52 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Ompi failing on mx only
>>
>> Hi, Gary-
>> This looks like a config problem, and not a code problem yet.
>> Could you send the output of mx_info from node-1 and from node-2?
>> Also, forgive me counter-asking a possibly dumb OMPI question, but is
>> "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x
>> LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if
>> not specifying a value defaults to this behavior, but have to ask).
>>
>> Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca
>> mtl
>
>> mx,self (it looks like you did)
>>
>> "[node-2:10464] mx_connect fail for node-2:0 with key aaaaffff "
>> makes it look like your fabric may not be fully mapped or that you
>> may
>
>> have a down link.
>>
>> thanks,
>> -reese
>> Myricom, Inc.
>>
>> I was initially using 1.1.2 and moved to 1.2b2 because of a hang on
>> MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so.
>> My compute nodes are 2 dual core xeons on myrinet with mx. The
>> problem
>
>> is trying to get ompi running on mx only. My machine file is as
>> follows ...
>>
>> node-1 slots=4 max-slots=4
>> node-2 slots=4 max-slots=4
>> node-3 slots=4 max-slots=4
>>
>> 'mpirun' with the minimum number of processes in order to get the
>> error ...
>> mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH
>> --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi
>>
>> I don't believe there'a anything wrong w/ the hardware as I can ping
>> on mx between this failed node and the master fine. So I tried a
>> different set of 3 nodes and I got the same error, it always fails on
>> the 2nd node of any group of nodes I choose.
>>
>> <node-2.out>
>> <node-1.out>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Brian Barrett
> Open MPI Team, CCS-1
> Los Alamos National Laboratory
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users