Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Openib with > 32 cores per node
From: Robert Horton (r.horton_at_[hidden])
Date: 2011-05-20 10:19:55


Hi,

Thanks for getting back to me (and thanks to Jeff for the explanation
too).

On Thu, 2011-05-19 at 09:59 -0600, Samuel K. Gutierrez wrote:
> Hi,
>
> On May 19, 2011, at 9:37 AM, Robert Horton wrote
>
> > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> >> Hi,
> >>
> >> Try the following QP parameters that only use shared receive queues.
> >>
> >> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> >>
> >
> > Thanks for that. If I run the job over 2 x 48 cores it now works and the
> > performance seems reasonable (I need to do some more tuning) but when I
> > go up to 4 x 48 cores I'm getting the same problem:
> >
> > [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] error creating qp errno says Cannot allocate memory
> > [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> >
> > Any thoughts?
>
> How much memory does each node have? Does this happen at startup?

Each node has 64GB of RAM. The error happens fairly soon after the job
starts.

>
> Try adding:
>
> -mca btl_openib_cpc_include rdmacm

Ah - that looks much better. I can now run hpcc over all 15x48 cores. I
need to look at the performance in a bit more detail but it seems to be
"reasonable" at least :)

One thing is puzzling me - when I compile OpenMPI myself it seems to
lack rdmamc support - however the one compiled by the OFED install
process does include it. I'm compiling with:

'--prefix=/share/apps/openmpi/1.4.3/gcc' '--with-sge' '--with-openib' '--enable-openib-rdmacm'

Any idea what might be going on there?

> I'm not sure if your version of OFED supports this feature, but maybe using XRC may help. I **think** other tweaks are needed to get this going, but I'm not familiar with the details.

I'm using the QLogic (QLE7340) rather than Mellanox cards so that
doesn't seem to be an option to me (?). It would be interesting to know
how much difference it would make though...

Thanks again for your help and have a good weekend.

Rob

-- 
Robert Horton
System Administrator (Research Support) - School of Mathematical Sciences
Queen Mary, University of London
r.horton_at_[hidden]  -  +44 (0) 20 7882 7345