Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Arif Ali (aali_at_[hidden])
Date: 2007-01-18 15:53:54


Hi List,

1. We have
HW
* 2xBladecenter H
* 2xCisco Infiniband Switch Modules
* 1xCisco Infiniband Switch
* 16x PPC64 JS21 blades each are 4 cores, with Cisco HCA

SW
* SLES 10
* OFED 1.1 w. OpenMPI 1.1.1

I am running the Intel MPI Benchmark (IMB) on the cluster as a part of
validation process for the customer.

I have tried the OpenMPI that comes with OFED 1.1, which gave spurious
"Not Enough Memory" error messages, after looking through FAQs (with the
help of Cisco) I was able to find the problems and fixes. I used the
FAQs to add unlimited soft and hard limits for memlock, turned RDMA off
by using "--mca btl_openib_flags 1". This still did not work, and still
got the Memory problems.

I tried the nightly snapshot of OpenMPI-1.2b4r13137, which failed miserably.

I then tried the released version of the OpenMPI-1.2b3, which got me
further than before. Now the benchmark goes through all the tests until
Allgatherv finishes, and it seems that it is waiting to start AlltoAll,
I have waited about 12 hours to see if this continues. I have since then
managed to run AlltoAll, and the rest of the benchmark separately.

I have tried a few tunable paramaters, that was suggested by Cisco,
which improved the results, but still hung. The parameters that I have
used to try and diagnose are below. I used the debug/verbose variables
to see if I could see if I could get error messages on the running of
the benchmark.

#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_flags=1
mpi_leave_pinned=1
mpool_base_use_mem_hooks=1

2. On another side note, I am having similar problems on another
customer's cluster, where the benchmark hangs but at a different place
each time.

HW specs
* 12x IBM 3455 2xdual Core machines, with Infinipath/pathscale HCAs
* 1x Voltaire Switch
SW
* master: RHEL 4 AS U3
* compute: RHEL 4 WS U3
* OFED 1.1.1 w. OpenMPI-1.1.2

A) In this case, I have also had errors, which I was able to turn off by
adding btl_openib_warn_no_hca_params_found to 0, but wasn't sure if this
was the right thing to do.
--------------------------------------------------------------------------
WARNING: No HCA parameters were found for the HCA that Open MPI
detected:

    Hostname: node004
    HCA vendor ID: 0x1fc1
    HCA vendor part ID: 13

Default HCA parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_hca_param_files MCA parameter to set values for your HCA.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_hca_params_found to 0.
--------------------------------------------------------------------------
b) The runs on this machine would also hang, so I tried to remove all
the unnecessary daemons, to see if that would improve it. In this case,
in 75% of the tim, it would runn longer until AlltoAll put would hang
there, or otherwise it would hang at various other places. At times I
also errors regarding retry count and timeout, for both which I
increased to 14 and 20 respectively, I tried similar steps to the PPC
cluster to fix the problem of freezing but had no luck. Below are all
the parameters that I have used in this case

#orte_debug=1
#btl_openib_verbose=1
#mca_verbose=1
#btl_base_debug=1
btl_openib_warn_no_hca_params_found=0
btl_openib_flags=1
#mpi_preconnect_all=1
mpi_leave_pinned=1
btl_openib_ib_retry_count=14
btl_openib_ib_timeout=20
mpool_base_use_mem_hooks=1

I hope I have included all the info, if there is anything else required,
then I should be able to provide that to you, without a problem

thanks a lot for your help in advance

-- 
regards,
Arif Ali
Software Engineer
OCF plc
Mobile: +44 (0)7970 148 122
Office: +44 (0)114 257 2200
Fax:    +44 (0)114 257 0022
Email:  aali_at_[hidden]
Web:    http://www.ocf.co.uk
Skype:  arif_ali80
MSN:    aali_at_[hidden]