Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Arif Ali (aali_at_[hidden])
Date: 2007-01-18 15:53:54


Hi List,

1. We have
HW
* 2xBladecenter H
* 2xCisco Infiniband Switch Modules
* 1xCisco Infiniband Switch
* 16x PPC64 JS21 blades each are 4 cores, with Cisco HCA

SW
* SLES 10
* OFED 1.1 w. OpenMPI 1.1.1

I am running the Intel MPI Benchmark (IMB) on the cluster as a part of
validation process for the customer.

I have tried the OpenMPI that comes with OFED 1.1, which gave spurious
"Not Enough Memory" error messages, after looking through FAQs (with the
help of Cisco) I was able to find the problems and fixes. I used the
FAQs to add unlimited soft and hard limits for memlock, turned RDMA off
by using "--mca btl_openib_flags 1". This still did not work, and still
got the Memory problems.

I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed miserably.

I then tried the released version of the OpenMPI-1.2b3, which got me
further than before. Now the benchmark goes through all the tests until
Allgatherv finishes, and it seems that it is waiting to start AlltoAll,
I have waited about 12 hours to see if this continues. I have since then
managed to run AlltoAll, and the rest of the benchmark separately.

I have tried a few tunable paramaters, that was suggested by Cisco,
which improved the results, but still hung. The parameters that I have
used to try and diagnose are below. I used the debug/verbose variables
to see if I could see if I could get error messages on the running of
the benchmark.

#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_flags=1
mpi_leave_pinned=1
mpool_base_use_mem_hooks=1

2. On another side note, I am having similar problems on another
customer's cluster, where the benchmark hangs but at a different place
each time.

HW specs
* 12x IBM 3455 2xdual Core machines, with Infinipath/pathscale HCAs
* 1x Voltaire Switch
SW
* master: RHEL 4 AS U3
* compute: RHEL 4 WS U3
* OFED 1.1.1 w. OpenMPI-1.1.2

A) In this case, I have also had errors, which I was able to turn off by
adding btl_openib_warn_no_hca_params_found to 0, but wasn't sure if this
was the right thing to do.
--------------------------------------------------------------------------
WARNING: No HCA parameters were found for the HCA that Open MPI
detected:

    Hostname: node004
    HCA vendor ID: 0x1fc1
    HCA vendor part ID: 13

Default HCA parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_hca_param_files MCA parameter to set values for your HCA.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_hca_params_found to 0.
--------------------------------------------------------------------------
b) The runs on this machine would also hang, so I tried to remove all
the unnecessary daemons, to see if that would improve it. In this case,
in 75% of the tim, it would runn longer until AlltoAll put would hang
there, or otherwise it would hang at various other places. At times I
also errors regarding retry count and timeout, for both which I
increased to 14 and 20 respectively, I tried similar steps to the PPC
cluster to fix the problem of freezing but had no luck. Below are all
the parameters that I have used in this case

#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_warn_no_hca_params_found=0
btl_openib_flags=1
#mpi_preconnect_all=1
mpi_leave_pinned=1
btl_openib_ib_retry_count=14
btl_openib_ib_timeout=20
mpool_base_use_mem_hooks=1

I hope I have included all the info, if there is anything else required,
then I should be able to provide that to you, without a problem

thanks a lot for your help in advance

-- 
regards,
Arif Ali
Software Engineer
OCF plc
Mobile: +44 (0)7970 148 122
Office: +44 (0)114 257 2200
Fax:    +44 (0)114 257 0022
Email:  aali_at_[hidden]
Web:    http://www.ocf.co.uk
Skype:  arif_ali80
MSN:    aali_at_[hidden]