Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem with openmpi and infiniband
From: Biagio Lucini (B.Lucini_at_[hidden])
Date: 2009-01-07 18:28:52


The test was in fact ok, I have also verified it on 30 processors.
Meanwhile I tried OMPI1.3RC2, with which the application fails on
infiniband, I hope this will give some clue (or at least be useful to
finalise the release of OpenMPI 1.3). I remind the mailing list that I
use the OFED 1.2.5 release. The only change with respect the last time
is the use of OMPI1.3RC2 instead of 1.2.8. To avoid boring the mailing
list, I don't repeat details I have already provided (like the command
line parameters) on which we seem to have agreed that there is no
problem. However, if you want to know more, please ask.

The error file as produced by SGE is attached.

Thanks,
Biagio

Lenny Verkhovsky wrote:
> Hi, just to make sure,
>
> you wrote in the previous mail that you tested IMB-MPI1 and it
> "reports for the last test" ...., and the results are for
> "processes=6", since you have 4 and 8 core machines, this test could
> be run on the same 8 core machine over shared memory and not over
> Infiniband, as you suspected.
>
> You can rerun the IMB-MPI1 test with -mca btl self,openib to be sure
> that the test does not use shared memory or tcp.
>
> Lenny.
>
>
>
> On 12/24/08, Biagio Lucini <B.Lucini_at_[hidden]> wrote:
>
>> Pavel Shamis (Pasha) wrote:
>>
>>
>>> Biagio Lucini wrote:
>>>
>>>
>>>> Hello,
>>>>
>>>> I am new to this list, where I hope to find a solution for a problem
>>>> that I have been having for quite a longtime.
>>>>
>>>> I run various versions of openmpi (from 1.1.2 to 1.2.8) on a cluster
>>>> with Infiniband interconnects that I use and administer at the same
>>>> time. The openfabric stac is OFED-1.2.5, the compilers gcc 4.2 and
>>>> Intel. The queue manager is SGE 6.0u8.
>>>>
>>>>
>>> Do you use OpenMPI version that is included in OFED ? Did you was able
>>> to run basic OFED/OMPI tests/benchmarks between two nodes ?
>>>
>>>
>>>
>> Hi,
>>
>> yes to both questions: the OMPI version is the one that comes with OFED
>> (1.1.2-1) and the basic tests run fine. For instance, IMB-MPI1 (which is
>> more than basic, as far as I can see) reports for the last test:
>>
>> #---------------------------------------------------
>> # Benchmarking Barrier
>> # #processes = 6
>> #---------------------------------------------------
>> #repetitions t_min[usec] t_max[usec] t_avg[usec]
>> 1000 22.93 22.95 22.94
>>
>>
>> for the openib,self btl (6 processes, all processes on different nodes)
>> and
>>
>> #---------------------------------------------------
>> # Benchmarking Barrier
>> # #processes = 6
>> #---------------------------------------------------
>> #repetitions t_min[usec] t_max[usec] t_avg[usec]
>> 1000 191.30 191.42 191.34
>>
>> for the tcp,self btl (same test)
>>
>> No anomalies for other tests (ping-pong, all-to-all etc.)
>>
>> Thanks,
>> Biagio
>>
>>
>> --
>> =========================================================
>>
>> Dr. Biagio Lucini
>> Department of Physics, Swansea University
>> Singleton Park, SA2 8PP Swansea (UK)
>> Tel. +44 (0)1792 602284
>>
>> =========================================================
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[[5963,1],13][btl_openib_component.c:2893:handle_wc] from node24 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],12][btl_openib_component.c:2893:handle_wc] from node23 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],8][btl_openib_component.c:2893:handle_wc] from node9 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],11][btl_openib_component.c:2893:handle_wc] from node20 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],9][btl_openib_component.c:2893:handle_wc] from node18 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],4][btl_openib_component.c:2893:handle_wc] from node13 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],3][btl_openib_component.c:2893:handle_wc] from node12 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],6][btl_openib_component.c:2893:handle_wc] from node15 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],1][btl_openib_component.c:2893:handle_wc] from node10 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],7][btl_openib_component.c:2893:handle_wc] from node16 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],5][btl_openib_component.c:2893:handle_wc] from node14 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],10][btl_openib_component.c:2893:handle_wc] from node21 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],14][btl_openib_component.c:2893:handle_wc] from node19 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
[[5963,1],2][btl_openib_component.c:2893:handle_wc] from node10 to: node11 error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR status number 13 for wr_id 37779456 opcode 0 qp_idx 0
--------------------------------------------------------------------------
The OpenFabrics "receiver not ready" retry count on a per-peer
connection between two MPI processes has been exceeded. In general,
this should not happen because Open MPI uses flow control on per-peer
connections to ensure that receivers are always ready when data is
sent.

This error usually means one of two things:

1. There is something awry within the network fabric itself.
2. A bug in Open MPI has caused flow control to malfunction.
 
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Below is some information about the host that raised the error and the
peer to which it was connected:

  Local host: node24
  Local device: mthca0
  Peer host: node11

You may need to consult with your system administrator to get this
problem fixed.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 4 with PID 18133 on
node node13.cluster exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
forrtl: error (78): process killed (SIGTERM)
[node11:21331] 13 more processes have sent help message help-mpi-btl-openib.txt / pp rnr retry exceeded
[node11:21331] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages