Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Job fails after hours of running on a specific node
From: Sangamesh B (forum.san_at_[hidden])
Date: 2009-12-07 04:14:13


Hello Pasha,

          As the error was not repeating frequently, I didn't look into the
issue from a long time. But now I started to diagnose it:

Initially I tested with ibv_rc_pingpong (Master node to all compute nodes
using a for loop). Its working for each of the nodes.

The files generated out of the command "ibdiagnet -v -r -o ." are attached
herewith. The ibcheckerrors shows following error message:

# ibcheckerrors
#warn: counter RcvSwRelayErrors = 408 (threshold 100)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port
all: FAILED
#warn: counter RcvSwRelayErrors = 179 (threshold 100)
Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port
7: FAILED
# Checked Switch: nodeguid 0x000b8cffff00551b with failure

## Summary: 25 nodes checked, 0 bad nodes found
## 48 ports checked, 1 ports have errors beyond threshold

Are these messages helpful to find the issue with node-0-2? Can you please
help us to diagnose further?

Thanks,
Sangamesh

On Mon, Sep 21, 2009 at 1:36 PM, Pavel Shamis (Pasha) <pashash_at_[hidden]>wrote:

> Sangamesh,
>
> The ib tunings that you added to your command line only delay the problem
> but doesn't resolve it.
> The node-0-2.local gets asynchronous event "IBV_EVENT_PORT_ERROR" and as
> result
> the processes fails to deliver packets to some remote hosts and as result
> you see bunch of IB errors.
>
> The IBV_EVENT_PORT_ERROR error means that the IB port gone from ACTIVE
> state do DOWN state.
> Or in other words you have problem with your IB networks that cause all
> these networks errors.
> Source cause of such issue maybe some bad cable or some problematic port on
> switch.
>
> For the IB network debug I propose you use Ibdiaget, it is open source IB
> network diagnostic tool :
> http://linux.die.net/man/1/ibdiagnet
> The tool is part of OFED distribution.
>
> Pasha.
>
>
> Sangamesh B wrote:
>
>> Dear all,
>> The CPMD application which is compiled with OpenMPI-1.3 (Intel 10.1
>> compilers) on CentOS-4.5, fails only, when a specific node i.e. node-0-2 is
>> involved. But runs well on other nodes.
>> Initially job failed after 5-10 mins (on node-0-2 + some other
>> nodes). After googling error, I added options "-mca
>> btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20" to mpirun
>> command in the SGE script:
>> $ cat cpmdrun.sh
>> #!/bin/bash
>> #$ -N cpmd-acw
>> #$ -S /bin/bash
>> #$ -cwd
>> #$ -e err.$JOB_ID.$JOB_NAME
>> #$ -o out.$JOB_ID.$JOB_NAME
>> #$ -pe ib 32
>> unset SGE_ROOT PP_LIBRARY=/home/user1/cpmdrun/wac/prod/PP
>> CPMD=/opt/apps/cpmd/3.11/ompi/SOURCE/cpmd311-ompi-mkl.x
>> MPIRUN=/opt/mpi/openmpi/1.3/intel/bin/mpirun
>> $MPIRUN -np $NSLOTS -hostfile $TMPDIR/machines -mca
>> btl_openib_ib_min_rnr_timer 25 -mca btl_openib_ib_timeout 20 $CPMD
>> wac_md26.in <http://wac_md26.in> $PP_LIBRARY > wac_md26.out
>>
>> After adding these options, job executed for 24+ hours then failed with
>> the same error as earlier. The error is:
>> $ cat err.6186.cpmd-acw
>> --------------------------------------------------------------------------
>> The OpenFabrics stack has reported a network error event. Open MPI
>> will try to continue, but your job may end up failing.
>> Local host: node-0-2.local
>> MPI process PID: 11840
>> Error number: 10 (IBV_EVENT_PORT_ERR)
>> This error may indicate connectivity problems within the fabric;
>> please contact your system administrator.
>> --------------------------------------------------------------------------
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] Set MCA parameter "orte_base_help_aggregate" to 0
>> to see all help / error messages
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 7 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 1 more process has sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 15 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 16 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [node-0-2.local:11836] 16 more processes have sent help message
>> help-mpi-btl-openib.txt / of error event
>> [[718,1],20][btl_openib_component.c:2902:handle_wc] from node-0-22.local
>> to: node-0-2
>> --------------------------------------------------------------------------
>> The InfiniBand retry count between two MPI processes has been
>> exceeded. "Retry count" is defined in the InfiniBand spec 1.2
>> (section 12.7.38):
>> The total number of times that the sender wishes the receiver to
>> retry timeout, packet sequence, etc. errors before posting a
>> completion error.
>> This error typically means that there is something awry within the
>> InfiniBand fabric itself. You should note the hosts on which this
>> error has occurred; it has been observed that rebooting or removing a
>> particular host from the job can sometimes resolve this issue.
>> Two MCA parameters can be used to control Open MPI's behavior with
>> respect to the retry count:
>> * btl_openib_ib_retry_count - The number of times the sender will
>> attempt to retry (defaulted to 7, the maximum value).
>> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>> to 10). The actual timeout value used is calculated as:
>> 4.096 microseconds * (2^btl_openib_ib_timeout)
>> See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>> Below is some information about the host that raised the error and the
>> peer to which it was connected:
>> Local host: node-0-22.local
>> Local device: mthca0
>> Peer host: node-0-2
>> You may need to consult with your system administrator to get this
>> problem fixed.
>> --------------------------------------------------------------------------
>> error polling LP CQ with status RETRY EXCEEDED ERROR status number 12 for
>> wr_id 66384128 opcode 128 qp_idx 3
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 20 with PID 10425 on
>> node ibc22 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>> rm: cannot remove `/tmp/6186.1.iblong.q/rsh': No such file or directory
>> The openibd service is running fine:
>> $ service openibd status
>> HCA driver loaded
>> Configured devices:
>> ib0
>> Currently active devices:
>> ib0
>> The following OFED modules are loaded:
>> rdma_ucm
>> ib_sdp
>> rdma_cm
>> ib_addr
>> ib_ipoib
>> mlx4_core
>> mlx4_ib
>> ib_mthca
>> ib_uverbs
>> ib_umad
>> ib_ucm
>> ib_sa
>> ib_cm
>> ib_mad
>> ib_core
>> But still the job is failing after hours of running, that to for a
>> particular node. What's the wrong with node-0-2? How can it be resolved?
>> Thanks,
>> Sangamesh
>> ------------------------------------------------------------------------
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>