Ralph Castain wrote:
> Hi folks
Er, perhaps pronounced "Eugene". :^(
> It looks like the SM revisions we inserted into 1.3.2 are a great
> detector for shared memory init failures
How delicately put! I appreciate the gentleness.
> - it segfaulted 143 times last night on IU's sif computer, 34 times
> on Sun/Linux, and 3 times on Sun/SunOS...almost every single time due
> to "Address not mapped" errors in the sm btl during init.
Any guess as to frequency or what it'd take for me to reproduce this? I
tried with 1.3.1... 200K times and no failures on np=8 MPI_Init() jobs.
I'm starting now with a single-queue version, but wouldn't be surprised
if, again, I can't turn anything up.
> Might be worth someone looking at the MTT output stack traces -this
> is something that now appears to be reproducible, and should be
> addressed before we release 1.3.2 as it seems far more likely to
> happen than in the past.
Great (in a weird way, I guess). Can you tell me how to look at the MTT
output stack traces? I found http://www.open-mpi.org/projects/mtt/ but
expect it'll take me awhile to wade through that.
|