Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :-
( I can't reproduce them manually, but they seem to only happen in a
very small fraction of overall MTT runs. I'm seeing at least 3
classes of errors:
1. btl_sm_add_procs.c:529 which is this:
j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j][my_smp_rank]
appears to have a valid value in it (i.e., .fifo = x, .fifo
 = x+offset, .fifo = x+2*offset, .fifo = x+3*offset.
But gdb says:
(gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
Cannot access memory at address 0x2a96b73050
I see a fair number of these errors. This is unbelievable to me; if
we have a problem in the startup of the sm btl, how on earth has it
escaped for so long?
2. btl_sm_component.c:430 which is this:
reg->cbfunc(&mca_btl_sm.super, hdr->tag, &(Frag.base),
reg->cbfunc == NULL in this case. I only see a few of these.
3. ompi_fifo.h:422 which is this:
return_value = ompi_cb_fifo_read_from_tail(&fifo->tail->cb_fifo,
fifo->tail points to memory that gdb says we cannot access. I only
see a few of these.
I'm running on RHEL4U6 with a variety of different classes of Xeon
machines. In one particular run, they were slightly older Xeon
machines, 2 core/2 socket machines.
I also found a segv in ibm/environment/finalize where a strlen() was
segv'ing, but I'm unable to diagnose that any further because the
char* argument passed to an asprintf() is the return value of a
function that should *never* be NULL. :-\
The one thing that these failures have in common is that they all
appear to be compiled by icc. Here's the configure line:
CC=icc CXX=icpc F77=ifort FC=ifort "CFLAGS=-g -wd188" --enable-
picky --enable-debug --enable-mpirun-prefix-by-default --disable-dlopen
Here's a run line, but the MCA parameters appear to vary wildly in
terms of which tests are failing (remember that I run 20+ variants of
each test at Cisco):
mpirun -np 8 --mca btl_openib_device_type ib --mca btl
Here's a slice of an MTT report that shows the problem:
(ignore any "svbu-mpiXXX - daemon did not report back when launched"
errors; that's SLURM mucking up)
I'm digging further... But help on this would be appreciated...