Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Openib with > 32 cores per node
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-05-20 10:49:33


If you're using QLogic, you might want to try the native PSM Open MPI support rather than the verbs support. QLogic cards only "sorta" support verbs in order to say that they're OFED-complaint; their native PSM interface is more performant than verbs for MPI.

Assuming you built OMPI with PSM support:

    mpirun --mca pml cm --mca mtl psm ....

(although probably just the pml/cm setting is sufficient -- the mtl/psm option will probably happen automatically)

See the OMPI README file for some more details about MTLs, PMLs, etc. (look for "psm"/i in the file)

On May 20, 2011, at 10:19 AM, Robert Horton wrote:

> Hi,
>
> Thanks for getting back to me (and thanks to Jeff for the explanation
> too).
>
> On Thu, 2011-05-19 at 09:59 -0600, Samuel K. Gutierrez wrote:
>> Hi,
>>
>> On May 19, 2011, at 9:37 AM, Robert Horton wrote
>>
>>> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
>>>> Hi,
>>>>
>>>> Try the following QP parameters that only use shared receive queues.
>>>>
>>>> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
>>>>
>>>
>>> Thanks for that. If I run the job over 2 x 48 cores it now works and the
>>> performance seems reasonable (I need to do some more tuning) but when I
>>> go up to 4 x 48 cores I'm getting the same problem:
>>>
>>> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] error creating qp errno says Cannot allocate memory
>>> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
>>> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
>>> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
>>> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>>>
>>> Any thoughts?
>>
>> How much memory does each node have? Does this happen at startup?
>
> Each node has 64GB of RAM. The error happens fairly soon after the job
> starts.
>
>>
>> Try adding:
>>
>> -mca btl_openib_cpc_include rdmacm
>
> Ah - that looks much better. I can now run hpcc over all 15x48 cores. I
> need to look at the performance in a bit more detail but it seems to be
> "reasonable" at least :)
>
> One thing is puzzling me - when I compile OpenMPI myself it seems to
> lack rdmamc support - however the one compiled by the OFED install
> process does include it. I'm compiling with:
>
> '--prefix=/share/apps/openmpi/1.4.3/gcc' '--with-sge' '--with-openib' '--enable-openib-rdmacm'
>
> Any idea what might be going on there?
>
>> I'm not sure if your version of OFED supports this feature, but maybe using XRC may help. I **think** other tweaks are needed to get this going, but I'm not familiar with the details.
>
> I'm using the QLogic (QLE7340) rather than Mellanox cards so that
> doesn't seem to be an option to me (?). It would be interesting to know
> how much difference it would make though...
>
> Thanks again for your help and have a good weekend.
>
> Rob
>
> --
> Robert Horton
> System Administrator (Research Support) - School of Mathematical Sciences
> Queen Mary, University of London
> r.horton_at_[hidden] - +44 (0) 20 7882 7345
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/