Finally I had a chance to look at log file.
Initially all qps are created on port 1, and in the same time alternative path loaded (ports 2, lids 4 and 2 ). I guess in some point you switch off port 1, APM even is reported because the alternative path is active now, and from some reason IB message is dropped.
You may ignore the APM warning. Essentially since the alternative path is active now, it is trying to see if OMPI may pre-load next good path for potential future failure on port 2. Since port 3 does not exist it reports the warning.
My educated guess is that from some reason it is no direct connection path between lid-2 and lid-4. To prove it we have to look and the OpenSM routing information.
On the mail list we have a representative from Mellanox that should be able to help us extract the routing information.
Can you please help ?
Pavel (Pasha) Shamis
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
On Feb 29, 2012, at 5:38 PM, Jeremy wrote:
> Hi Pasha,
>> On Wed, Feb 29, 2012 at 11:02 AM, Shamis, Pavel <shamisp_at_[hidden]> wrote:
>> I would like to see all the file.
>> 28MB is it the size after compression ?
>> I think gmail supports up to 25Mb.
>> You may try to create gzip file and then slice it using "split" command.
> See attached. At about line 151311 is when I unplugged the cable from
> Port 1. Then I see the APM error message at about line 178905.
> users mailing list