Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenIB error messages: reporting the default or telling you what's happening?
From: Shamis, Pavel (shamisp_at_[hidden])
Date: 2011-09-14 10:16:17


You would recommend you to verify, that you don't have any bad cables or ports in your IB network. You may to use one of OFA network check tools, for example ibdiagnet.

Pavel (Pasha) Shamis

---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
On Sep 13, 2011, at 6:33 PM, <Kevin.Buckley_at_[hidden]> <Kevin.Buckley_at_[hidden]> wrote:
>> So the error output is not showing what you two think should be
>> the default value, 20, but then nor is it showing what I think I
>> have set it to globally, again, 20.
>> 
>> But anyroad, what I wanted from this is confirmation that the output
>> is telling me the value that the job was running with, 10.
>> 
>> Now to find out why it appears as 10, because,
>> 
>> a) that's not seemingly the default
>> b) it's not being set to 10 globally by me as the admin
>> c) it wasn't being set to anything by the user's submission script
>> 
>> I'll have a dig around and get back to the thread,
> 
> So, getting back,
> 
> there have been two runs of jobs that invoked the mpirun using these
> OpenMPI parameter setting flags (basically, these mimic what I have
> in the global config file)
> 
> -mca btl_openib_ib_timeout 20 -mca btl_openib_ib_min_rnr_timer 25
> 
> when both of the job failed, the error output was
> 
> -----8<----------8<----------8<----------8<----------8<-----
> 
> [[31705,1],77][btl_openib_component.c:2951:handle_wc] from
> scifachpc-c06n01 to: scifachpc-c06n03 error polling LP CQ with status
> RETRY EXCEEDED ERROR status number 12 for wr_id 294230912 opcode 1  vendor
> error 129 qp_idx 1
> --------------------------------------------------------------------------
> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
> 
>    The total number of times that the sender wishes the receiver to
>    retry timeout, packet sequence, etc. errors before posting a
>    completion error.
> 
> This error typically means that there is something awry within the
> InfiniBand fabric itself.  You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
> 
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
> 
> * btl_openib_ib_retry_count - The number of times the sender will
>  attempt to retry (defaulted to 7, the maximum value).
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>  to 10).  The actual timeout value used is calculated as:
> 
>     4.096 microseconds * (2^btl_openib_ib_timeout)
> 
>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> 
> Below is some information about the host that raised the error and the
> peer to which it was connected:
> 
>  Local host:   somename
>  Local device: mlx4_0
>  Peer host:    someothername
> 
> You may need to consult with your system administrator to get this
> problem fixed.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 77 with PID 14705 on
> node somename exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> --------------------------------------------------------------------------
> 
> 
> -----8<----------8<----------8<----------8<----------8<-----
> 
> 
> Note that the error output it still showing that mysterious "10"
> in there for btl_openib_ib_timeout value.
> 
> When I run ompi_info from a login shell on the node I see
> 
> -----8<----------8<----------8<----------8<----------8<-----
> 
> ompi_info --param btl openib | grep ib_timeout
>                 MCA btl: parameter "btl_openib_ib_timeout" (current
> value: "20", data source: file
> [/usr/lib64/openmpi/1.4-gcc/etc/openmpi-mca-params.conf])
>                          InfiniBand transmit timeout, plugged into
> formula: 4.096 microseconds *
> (2^btl_openib_ib_timeout)(must be >= 0 and <=
> 31)
> 
> -----8<----------8<----------8<----------8<----------8<-----
> 
> For info,
> 
> the underlying IB kit is Mellanox Connect-X HCA running on a stock
> RHEL5.6 OS with the following OpenMPI
> 
> openmpi-1 . 4-4 . el5
> 
> indeed, everything is pretty much out of the box here.
> 
> I have noticed that the same node is apearing in the error output
> each time, so I'll try taking that one out of the test PE that the
> jobs are executing in and seeing if I can tie it to hardware.
> 
> 
> -- 
> Kevin M. Buckley                                  Room:  CO327
> School of Engineering and                         Phone: +64 4 463 5971
> Computer Science
> Victoria University of Wellington
> New Zealand
> 
> _______________________________________________
> users mailing list
> users_at_[hidden]
> hxxp://www.open-mpi.org/mailman/listinfo.cgi/users
>