About the -x, I've been trying it both ways and prefer the
latter, and results for either are the same. But it's value is
correct. I've attached the ompi_info from node-1 and node-2. Sorry for not
zipping them, but they were small and I think I'd have firewall
issues.
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH --hostfile ./h13-15 -np 6 --mca pml cm ./cpi
[node-14:19260] mx_connect fail for node-14:0 with key
aaaaffff (error Endpoint closed or not connectable!)
[node-14:19261]
mx_connect fail for node-14:0 with key aaaaffff (error Endpoint closed or not
connectable!)
...
Is there any info anywhere's on MTL? Anyways, I've run
w/ mtl, and sometimes it actually worked once. But now I can't reproduce it and
it's throwing sig 7's, 11's, and 4's depending upon the number of procs I give
it. But now that you mention mapper, I take it that's what SEGV_MAPERR might be
referring to. I'm looking into the
$ mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 5 --mca mtl mx,self
./cpi
Process 4 of 5 is on node-2
Process 0 of 5 is on node-1
Process
1 of 5 is on node-1
Process 2 of 5 is on node-1
Process 3 of 5 is on
node-1
pi is approximately 3.1415926544231225, Error is
0.0000000008333294
wall clock time = 0.019305
Signal:11
info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at
addr:0x2b88243862be
mpirun noticed that job rank 0 with PID 0 on node node-1
exited on signal 1.
4 additional processes aborted (not
shown)
Or sometimes I'll get this error, just depending upon the
number of procs ...
mpirun --prefix /usr/local/openmpi-1.2b2 -x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH} --hostfile ./h1-3 -np 7 --mca mtl mx,self
./cpi
Signal:7 info.si_errno:0(Success) si_code:2()
Failing at
addr:0x2aaaaaaab000
[0]
func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0(opal_backtrace_print+0x1f)
[0x2b9b7fa52d1f]
[1] func:/usr/local/openmpi-1.2b2/lib/libopen-pal.so.0
[0x2b9b7fa51871]
[2] func:/lib/libpthread.so.0 [0x2b9b80013d00]
[3]
func:/usr/local/openmpi-1.2b2/lib/libmca_common_sm.so.0(mca_common_sm_mmap_init+0x1e3)
[0x2b9b8270ef83]
[4]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_mpool_sm.so
[0x2b9b8260d0ff]
[5]
func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(mca_mpool_base_module_create+0x70)
[0x2b9b7f7afac0]
[6]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_btl_sm.so(mca_btl_sm_add_procs_same_base_addr+0x907)
[0x2b9b83070517]
[7]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_bml_r2.so(mca_bml_r2_add_procs+0x206)
[0x2b9b82d5f576]
[8]
func:/usr/local/openmpi-1.2b2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_add_procs+0xe3)
[0x2b9b82a2d0a3]
[9]
func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(ompi_mpi_init+0x697)
[0x2b9b7f77be07]
[10]
func:/usr/local/openmpi-1.2b2/lib/libmpi.so.0(MPI_Init+0x83)
[0x2b9b7f79c943]
[11] func:./cpi(main+0x42) [0x400cd5]
[12]
func:/lib/libc.so.6(__libc_start_main+0xf4) [0x2b9b8013a134]
[13] func:./cpi
[0x400bd9]
*** End of error message ***
Process 4 of 7 is on
node-2
Process 5 of 7 is on node-2
Process 6 of 7 is on node-2
Process
0 of 7 is on node-1
Process 1 of 7 is on node-1
Process 2 of 7 is on
node-1
Process 3 of 7 is on node-1
pi is approximately 3.1415926544231239,
Error is 0.0000000008333307
wall clock time = 0.009331
Signal:11
info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at
addr:0x2b4ba33652be
Signal:11 info.si_errno:0(Success)
si_code:1(SEGV_MAPERR)
Failing at addr:0x2b8685aba2be
Signal:11
info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at
addr:0x2b304ffbe2be
mpirun noticed that job rank 0 with PID 0 on node node-1
exited on signal 1.
6 additional processes aborted (not shown)
Ok, so
I take it one is down. Would this be the cause for all the different errors I'm
seeing?
$ fm_status
FMS Fabric
status
17
hosts known
16 FMAs
found
3 un-ACKed alerts
Mapping is
complete, last map generated by node-20
Database generation not yet
complete.
Hi, Gary-
This looks like a config problem, and not a code
problem yet. Could you send the output of mx_info from node-1 and from
node-2? Also, forgive me counter-asking a possibly dumb OMPI question, but
is "-x LD_LIBRARY_PATH" really what you want, as opposed to "-x
LD_LIBRARY_PATH=${LD_LIBRARY_PATH}" ? (I would not be surprised if not
specifying a value defaults to this behavior, but have to ask).
Also, have you tried MX MTL as opposed to
BTL? --mca pml cm --mca mtl mx,self (it looks like you
did)
"[node-2:10464] mx_connect fail for node-2:0 with
key aaaaffff " makes it look like your fabric may not be fully mapped or that
you may have a down link.
thanks,
-reese
Myricom, Inc.
I was initially using 1.1.2 and moved to 1.2b2
because of a hang on MPI_Bcast() which 1.2b2 reports to fix, and seemed to
have done so. My compute nodes are 2 dual core xeons on myrinet with mx. The
problem is trying to get ompi running on mx only. My machine file is as
follows …
node-1 slots=4 max-slots=4
node-2 slots=4 max-slots=4
node-3 slots=4 max-slots=4
'mpirun' with the minimum number of processes in
order to get the error ...
mpirun
--prefix /usr/local/openmpi-1.2b2 -x LD_LIBRARY_PATH --hostfile ./h1-3 -np 2
--mca btl mx,self ./cpi
I don't believe there'a anything wrong w/ the
hardware as I can ping on mx between this failed node and the master fine. So
I tried a different set of 3 nodes and I got the same error, it always fails
on the 2nd node of any group of nodes I
choose.