Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenMPI locking up only on IB
From: Brock Palen (brockp_at_[hidden])
Date: 2008-07-03 11:20:32

Ok it looks like a bigger problem. The segfault is not related to
OMPI because when I go and rebuild 1.2 or another version we use with
IB all the time, it will now fail with a segfault when forcing IB.
The old libs of the same version still work. They of-course do not
have the flag to turn off early completion.

Was there an older version of OpenMPI that did not suffer from the
early completion problem? We have many installed and for a quick test
latest and greatest would not be of much concern while we track down
the problem on our end.

We are on RHEL4 using OFED provided by redhat. The error is
"address not mapped to object"

Brock Palen
Center for Advanced Computing

On Jul 3, 2008, at 8:38 AM, Jeff Squyres wrote:
> On Jul 2, 2008, at 11:51 PM, Pavel Shamis (Pasha) wrote:
>>> In trying to build 1.2.6 with the pgi compilers it makes an MPI
>>> library that works with tcp, sm. But it segfaults on openib.
>>> Both our intel compiler version and pgi version of 1.2.6 blow up
>>> like this when we force IB. So this is a new issue.
>> I have ompi 1.2.6 installed on my machines with Intel compiler
>> (version 10.1) and Pgi compiler (version 7.1-5), both of them works
>> with IB without any problem. BTW Mellanox provides Mellanox OFED
>> binary distribution that include Intel and Pgi Open MPI 1.2.6 build.
>> You can download it from here
>> ofed.php
>>> Is there a way to shut off early completion in 1.2.3?
>> Sure, just add "--mca |pml_ob1_use_early_completion 0" to your
>> command line.| ||
> Note that this flag was not added until v1.2.6; it has no effect in
> v1.2.3.
>>> Or the the above a known issues and i should use 1.2.7-pre or
>>> grab a 1.3 snap shot?
>> 1.2.6 should be ok.
> The upcoming v1.3 series works a little differently; there's no
> need to use this flag in the v1.3 series (i.e., this flag only
> exists in the v1.2 series starting with v1.2.6).
> --
> Jeff Squyres
> Cisco Systems