About the -x, I've been trying it both ways and prefer the latter, and results for either are the same. But it's value is correct. I've attached the ompi_info from node-1 and node-2. Sorry for not zipping them, but they were small and I think I'd have firewall issues.
 
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h13-15 -np 6 --mca pml cm ./cpi
[node-14:19260] mx_connect fail for node-14:0 with key aaaaffff (error Endpoint closed or not connectable!)
[node-14:19261] mx_connect fail for node-14:0 with key aaaaffff (error Endpoint closed or not connectable!)
...
 
Is there any info anywhere's on MTL? Anyways, I've run w/ mtl, and sometimes it actually worked once. But now I can't reproduce it and it's throwing sig 7's, 11's, and 4's depending upon the number of procs I give it. But now that you mention mapper, I take it that's what SEGV_MAPERR might be referring to. I'm looking into the
 
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self ./cpi
Process 4 of 5 is on node-2
Process 0 of 5 is on node-1
Process 1 of 5 is on node-1
Process 2 of 5 is on node-1
Process 3 of 5 is on node-1
pi is approximately 3.1415926544231225, Error is 0.0000000008333294
wall clock time = 0.019305
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b88243862be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1.
4 additional processes aborted (not shown)
Or sometimes I'll get this error, just depending upon the number of procs ...
 
 mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self ./cpi
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at addr:0x2aaaaaaab000
[0] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0(opal_backtrace_print+0x1f) [0x2b9b7fa52d1f]
[1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0 [0x2b9b7fa51871]
[2] func:/lib/libpthread.so.0 [0x2b9b80013d00]
[3] func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0(mca_common_sm_mmap_init+0x1e3) [0x2b9b8270ef83]
[4] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so [0x2b9b8260d0ff]
[5] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(mca_mpool_base_module_create+0x70) [0x2b9b7f7afac0]
[6] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_procs_same_base_addr+0x907) [0x2b9b83070517]
[7] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x206) [0x2b9b82d5f576]
[8] func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xe3) [0x2b9b82a2d0a3]
[9] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init+0x697) [0x2b9b7f77be07]
[10] func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83) [0x2b9b7f79c943]
[11] func:./cpi(main+0x42) [0x400cd5]
[12] func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134]
[13] func:./cpi [0x400bd9]
*** End of error message ***
Process 4 of 7 is on node-2
Process 5 of 7 is on node-2
Process 6 of 7 is on node-2
Process 0 of 7 is on node-1
Process 1 of 7 is on node-1
Process 2 of 7 is on node-1
Process 3 of 7 is on node-1
pi is approximately 3.1415926544231239, Error is 0.0000000008333307
wall clock time = 0.009331
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b4ba33652be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b8685aba2be
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x2b304ffbe2be
mpirun noticed that job rank 0 with PID 0 on node node-1 exited on signal 1.
6 additional processes aborted (not shown)
 
Ok, so I take it one is down. Would this be the cause for all the different errors I'm seeing?
 
$ fm_status
FMS Fabric status
 
17      hosts known
16      FMAs found
3       un-ACKed alerts
Mapping is complete, last map generated by node-20
Database generation not yet complete.

 

From: users-bounces@open-mpi.org [mailto:users-bounces@open-mpi.org] On Behalf Of Reese Faucette
Sent: Tuesday, January 02, 2007 2:52 PM
To: Open MPI Users
Subject: Re: [OMPI users] Ompi failing on mx only

Hi, Gary-
This looks like a config problem, and not a code problem yet.  Could you send the output of mx_info from node-1 and from node-2?  Also, forgive me counter-asking a possibly dumb OMPI question, but is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ?  (I would not be surprised if not specifying a value defaults to this behavior, but have to ask).
 
Also, have you tried MX MTL as opposed to BTL?  --mca pml cm --mca mtl mx,self  (it looks like you did)
 
"[node-2:10464] mx_connect fail for node-2:0 with key aaaaffff " makes it look like your fabric may not be fully mapped or that you may have a down link.
 
thanks,
-reese
Myricom, Inc.

I was initially using 1.1.2 and moved to 1.2b2 because of a hang on MPI_Bcast() which 1.2b2 reports to fix, and seemed to have done so. My compute nodes are 2 dual core xeons on myrinet with mx. The problem is trying to get ompi running on mx only. My machine file is as follows …

node-1 slots=4 max-slots=4
node-2 slots=4 max-slots=4
node-3 slots=4 max-slots=4

'mpirun' with the minimum number of processes in order to get the error ...
        mpirun --prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2 --mca btl mx,self ./cpi

I don't believe there'a anything wrong w/ the hardware as I can ping on mx between this failed node and the master fine. So I tried a different set of 3 nodes and I got the same error, it always fails on the 2nd node of any group of nodes I choose.