Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-01-04 14:18:23


FWIW, I think we may have broken something in last night's tarball
(this just came up on an internal development list, too). I.e.,
someone broke something that was fixed a little while later, but the
nightly tarball was created before the problem was fixed.

Sorry about that. :-( Such is the nature of nightly snapshots...

On Jan 4, 2007, at 2:00 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:

> I've grabbed last nights tarball (1.2b3r12981) and tried using the
> shared mem transport on btl and mx,self on mtl, same results. What I
> don't get is that, sometimes it works, and sometimes it doesn't (for
> either). For example, I can run it 10 times successfully then incr the
> -np from 7 to 10 across 3 nodes, and it'll immediately fail.
>
> Here's an example of one run right after another.
>
> $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca
> mtl
> mx,self ./cpi
> Process 0 of 10 is on node-25
> Process 4 of 10 is on node-26
> Process 1 of 10 is on node-25
> Process 5 of 10 is on node-26
> Process 2 of 10 is on node-25
> Process 8 of 10 is on node-27
> Process 6 of 10 is on node-26
> Process 9 of 10 is on node-27
> Process 7 of 10 is on node-26
> Process 3 of 10 is on node-25
> pi is approximately 3.1415926544231256, Error is 0.0000000008333325
> wall clock time = 0.017513
>
> $ mpirun --prefix /usr/local/openmpi-1.2b3r12981/ -x
> LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h25-27 -np 10 --mca
> mtl
> mx,self ./cpi
> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> Failing at addr:(nil)
> [0]
> func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
> (opal_backtrace_
> print+0x1f) [0x2b8ddf3ccd3f]
> [1] func:/usr/local/openmpi-1.2b3r12981/lib/libopen-pal.so.0
> [0x2b8ddf3cb891]
> [2] func:/lib/libpthread.so.0 [0x2b8ddf98f6c0]
> [3] func:/opt/mx/lib/libmyriexpress.so(mx_open_endpoint+0x6df)
> [0x2b8de25bf2af]
> [4]
> func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_btl_mx.so
> (mca_btl_mx
> _component_init+0x5d7) [0x2b8de27dcd27]
> [5]
> func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0
> (mca_btl_base_select+
> 0x156) [0x2b8ddf125b46]
> [6]
> func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_bml_r2.so
> (mca_bml_r2
> _component_init+0x11) [0x2b8de26d7491]
> [7]
> func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0
> (mca_bml_base_init+0x
> 7d) [0x2b8ddf12543d]
> [8]
> func:/usr/local/openmpi-1.2b3r12981/lib/openmpi/mca_pml_ob1.so
> (mca_pml_o
> b1_component_init+0x6b) [0x2b8de23a4f8b]
> [9]
> func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0
> (mca_pml_base_select+
> 0x113) [0x2b8ddf12cea3]
> [10]
> func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(ompi_mpi_init
> +0x45a)
> [0x2b8ddf0f5bda]
> [11] func:/usr/local/openmpi-1.2b3r12981/lib/libmpi.so.0(MPI_Init
> +0x83)
> [0x2b8ddf116af3]
> [12] func:./cpi(main+0x42) [0x400cd5]
> [13] func:/lib/libc.so.6(__libc_start_main+0xe3) [0x2b8ddfab50e3]
> [14] func:./cpi [0x400bd9]
> *** End of error message ***
> mpirun noticed that job rank 0 with PID 0 on node node-25 exited on
> signal 11.
> 9 additional processes aborted (not shown)
>
> -----Original Message-----
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open-
> mpi.org] On
> Behalf Of Brian W. Barrett
> Sent: Tuesday, January 02, 2007 4:11 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Ompi failing on mx only
>
> Sorry to jump into the discussion late. The mx btl does not support
> communication between processes on the same node by itself, so you
> have
> to include the shared memory transport when using MX. This will
> eventually be fixed, but likely not for the 1.2 release. So if you
> do:
>
> mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
> hostfile ./h1-3 -np 2 --mca btl mx,sm,self ./cpi
>
> It should work much better. As for the MTL, there is a bug in the MX
> MTL for v1.2 that has been fixed, but after 1.2b2 that could cause the
> random failures you were seeing. It will work much better after
> 1.2b3 is released (or if you are feeling really lucky, you can try out
> the 1.2 nightly tarballs).
>
> The MTL is a new feature in v1.2. It is a different communication
> abstraction designed to support interconnects that have matching
> implemented in the lower level library or in hardware (Myrinet/MX,
> Portals, InfiniPath are currently implemented). The MTL allows us to
> exploit the low latency and asynchronous progress these libraries can
> provide, but does mean multi-nic abilities are reduced. Further, the
> MTL is not well suited to interconnects like TCP or InfiniBand, so we
> will continue supporting the BTL interface as well.
>
> Brian
>
>
> On Jan 2, 2007, at 2:44 PM, Grobe, Gary L. ((JSC-EV))[ESCG] wrote:
>
>> About the -x, I've been trying it both ways and prefer the latter,
>> and
>
>> results for either are the same. But it's value is correct.
>> I've attached the ompi_info from node-1 and node-2. Sorry for not
>> zipping them, but they were small and I think I'd have firewall
>> issues.
>>
>> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --
>> hostfile ./h13-15 -np 6 --mca pml cm ./cpi [node-14:19260] mx_connect
>> fail for node-14:0 with key aaaaffff (error Endpoint closed or not
>> connectable!) [node-14:19261] mx_connect fail for node-14:0 with key
>> aaaaffff (error Endpoint closed or not connectable!) ...
>>
>> Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and
>> sometimes it actually worked once. But now I can't reproduce it and
>> it's throwing sig 7's, 11's, and 4's depending upon the number of
>> procs I give it. But now that you mention mapper, I take it that's
>> what SEGV_MAPERR might be referring to. I'm looking into the
>>
>> $ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$
>> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi
>> Process 4 of 5 is on node-2 Process 0 of 5 is on node-1 Process 1
>> of 5
>
>> is on node-1 Process 2 of 5 is on node-1 Process 3 of 5 is on node-1
>> pi is approximately 3.1415926544231225, Error is 0.0000000008333294
>> wall clock time = 0.019305
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
>> addr:0x2b88243862be mpirun noticed that job rank 0 with PID 0 on node
>> node-1 exited on signal 1.
>> 4 additional processes aborted (not shown) Or sometimes I'll get this
>> error, just depending upon the number of procs ...
>>
>> mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=$
>> {LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi
>> Signal:7 info.si_errno:0(Success) si_code:2() Failing at
>> addr:0x2aaaaaaab000 [0]
>> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
>> (opal_backtrace_print+0x1f) [0x2b9b7fa52d1f] [1]
>> func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
>> [0x2b9b7fa51871]
>> [2] func:/lib/libpthread.so.0 [0x2b9b80013d00] [3]
>> func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0
>> (mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83] [4]
>> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
>> [0x2b9b8260d0ff]
>> [5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0
>> (mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0] [6]
>> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so
>> (mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517] [7]
>> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so
>> (mca_bml_r2_add_procs+0x206) [0x2b9b82d5f576] [8]
>> func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so
>> (mca_pml_ob1_add_procs+0xe3) [0x2b9b82a2d0a3] [9]
>> func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init
>> +0x697) [0x2b9b7f77be07]
>> [10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83)
>> [0x2b9b7f79c943]
>> [11] func:./cpi(main+0x42) [0x400cd5]
>> [12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134]
>> [13]
>
>> func:./cpi [0x400bd9]
>> *** End of error message ***
>> Process 4 of 7 is on node-2
>> Process 5 of 7 is on node-2
>> Process 6 of 7 is on node-2
>> Process 0 of 7 is on node-1
>> Process 1 of 7 is on node-1
>> Process 2 of 7 is on node-1
>> Process 3 of 7 is on node-1
>> pi is approximately 3.1415926544231239, Error is 0.0000000008333307
>> wall clock time = 0.009331
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
>> addr:0x2b4ba33652be
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
>> addr:0x2b8685aba2be
>> Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR) Failing at
>> addr:0x2b304ffbe2be mpirun noticed that job rank 0 with PID 0 on node
>> node-1 exited on signal 1.
>> 6 additional processes aborted (not shown)
>>
>> Ok, so I take it one is down. Would this be the cause for all the
>> different errors I'm seeing?
>>
>> $ fm_status
>> FMS Fabric status
>>
>> 17 hosts known
>> 16 FMAs found
>> 3 un-ACKed alerts
>> Mapping is complete, last map generated by node-20 Database
>> generation
>
>> not yet complete.
>>
>>
>> From: users-bounces_at_[hidden] [mailto:users-bounces_at_open- mpi.org]
>> On Behalf Of Reese Faucette
>> Sent: Tuesday, January 02, 2007 2:52 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Ompi failing on mx only
>>
>> Hi, Gary-
>> This looks like a config problem, and not a code problem yet.
>> Could you send the output of mx_info from node-1 and from node-2?
>> Also, forgive me counter-asking a possibly dumb OMPI question, but is
>> "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x
>> LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if
>> not specifying a value defaults to this behavior, but have to ask).
>>
>> Also, have you tried MX MTL as opposed to BTL? --mca pml cm --mca
>> mtl
>
>> mx,self (it looks like you did)
>>
>> "[node-2:10464] mx_connect fail for node-2:0 with key aaaaffff "
>> makes it look like your fabric may not be fully mapped or that you
>> may
>
>> have a down link.
>>
>> thanks,
>> -reese
>> Myricom, Inc.
>>
>> I was initially using 1.1.2 and moved to 1.2b2 because of a hang on
>> MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so.
>> My compute nodes are 2 dual core xeons on myrinet with mx. The
>> problem
>
>> is trying to get ompi running on mx only. My machine file is as
>> follows ...
>>
>> node-1 slots=4 max-slots=4
>> node-2 slots=4 max-slots=4
>> node-3 slots=4 max-slots=4
>>
>> 'mpirun' with the minimum number of processes in order to get the
>> error ...
>> mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH
>> --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi
>>
>> I don't believe there'a anything wrong w/ the hardware as I can ping
>> on mx between this failed node and the master fine. So I tried a
>> different set of 3 nodes and I got the same error, it always fails on
>> the 2nd node of any group of nodes I choose.
>>
>> <node-2.out>
>> <node-1.out>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Brian Barrett
> Open MPI Team, CCS-1
> Los Alamos National Laboratory
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems