Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] 1.3.1 -- bad MTT from Cisco
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-10 21:13:23


Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 MTT. :-
( I can't reproduce them manually, but they seem to only happen in a
very small fraction of overall MTT runs. I'm seeing at least 3
classes of errors:

1. btl_sm_add_procs.c:529 which is this:

        if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
NULL) {

j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j][my_smp_rank]
appears to have a valid value in it (i.e., .fifo[3][0] = x, .fifo[3]
[1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x+3*offset.
But gdb says:

(gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
Cannot access memory at address 0x2a96b73050

I see a fair number of these errors. This is unbelievable to me; if
we have a problem in the startup of the sm btl, how on earth has it
escaped for so long?

2. btl_sm_component.c:430 which is this:

                 reg->cbfunc(&mca_btl_sm.super, hdr->tag, &(Frag.base),
                             reg->cbdata);

reg->cbfunc == NULL in this case. I only see a few of these.

3. ompi_fifo.h:422 which is this:

     return_value = ompi_cb_fifo_read_from_tail(&fifo->tail->cb_fifo,
             fifo->tail->cb_overflow, &queue_empty);

fifo->tail points to memory that gdb says we cannot access. I only
see a few of these.

I'm running on RHEL4U6 with a variety of different classes of Xeon
machines. In one particular run, they were slightly older Xeon
machines, 2 core/2 socket machines.

I also found a segv in ibm/environment/finalize where a strlen() was
segv'ing, but I'm unable to diagnose that any further because the
char* argument passed to an asprintf() is the return value of a
function that should *never* be NULL. :-\

The one thing that these failures have in common is that they all
appear to be compiled by icc. Here's the configure line:

     CC=icc CXX=icpc F77=ifort FC=ifort "CFLAGS=-g -wd188" --enable-
picky --enable-debug --enable-mpirun-prefix-by-default --disable-dlopen

Here's a run line, but the MCA parameters appear to vary wildly in
terms of which tests are failing (remember that I run 20+ variants of
each test at Cisco):

     mpirun -np 8 --mca btl_openib_device_type ib --mca btl
sm,openib,self pt2pt/allocmem

Here's a slice of an MTT report that shows the problem:

     http://www.open-mpi.org/mtt/index.php?do_redir=970

(ignore any "svbu-mpiXXX - daemon did not report back when launched"
errors; that's SLURM mucking up)

I'm digging further... But help on this would be appreciated...

-- 
Jeff Squyres
Cisco Systems