You are correct - the Sun errors are in a version prior to the
insertion of the SM changes. We didn't relabel the version to 1.3.2
until -after- those changes went in, so you have to look for anything
with an r number >= 20839.
The sif errors are all in that group - I would suggest starting there.
I suspect Josh or someone at IU could tell you the compiler. I would
be very surprised if it wasn't gcc, but I don't know what version. I
suspect they could even find a way to run some debugging on for you,
if that would help.
The Cisco errors were caused by some config/fabric problems - Jeff is
physically there today, so hopefully those will get fixed and we'll
see his tests. IIRC, he was seeing these problems before, so hopefully
we can see if they are still present.
On Mar 26, 2009, at 3:25 PM, Eugene Loh wrote:
> Ralph Castain wrote:
>> It looks like the SM revisions we inserted into 1.3.2 are a great
>> detector for shared memory init failures - it segfaulted 143 times
>> last night on IU's sif computer, 34 times on Sun/Linux, and 3 times
>> on Sun/SunOS...almost every single time due to "Address not
>> mapped" errors in the sm btl during init.
>> Might be worth someone looking at the MTT output stack traces -this
>> is something that now appears to be reproducible, and should be
>> addressed before we release 1.3.2 as it seems far more likely to
>> happen than in the past.
> Okay. I look at http://www.open-mpi.org/mtt/index.php?do_redir=973
> If we start with the 3 Sun/SunOS failures (row #7), these seem to be
> labeled 1.3.1 ("MPI Version"). So, not 1.3.2. And, I don't know
> how to make sense of the stack trace... there an
> "mca_common_sm_mmap_init" ftruncate problem and stuff apparently
> much later on. How can this be?
> The Sun/Linux problems must be row #6. Yes? Again, the "MPI
> Version" is labeled 1.3.1. Is that informative or misleading? Lots
> of stacks looking like this is happening during MPI_Init. I try
> running a code that just does MPI_Init on similar configs and seem
> unable to trigger this problem.
> How do I figure out the compiler used?
> I need help reproducing this problem.
> devel mailing list