Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] users Digest, Vol 1275, Issue 2; btl_openib_connect_oob.c:459:qp_create_one:errorcreating qp
From: Jose Gracia (gracia_at_[hidden])
Date: 2009-07-08 02:41:58


>
> Today's Topics:
>
> 1. Re: btl_openib_connect_oob.c:459:qp_create_one: errorcreating
> qp (Jeff Squyres)
> 2. Re: [OMPI users]
> btl_openib_connect_oob.c:459:qp_create_one:errorcreating qp
> (Jeff Squyres)
>
>

------------------------------ Message: 2 Date: Wed, 1 Jul 2009 08:56:50
-0400 From: Jeff Squyres <jsquyres_at_[hidden]> Subject: Re: [OMPI users]
btl_openib_connect_oob.c:459:qp_create_one:errorcreating qp To: "Open
MPI Users" <users_at_[hidden]> Message-ID:
<DDC91A3F-AED7-4244-8FA4-A00D4A3454FD_at_[hidden]> Content-Type:
text/plain; charset=US-ASCII; format=flowed; delsp=yes On Jul 1, 2009,
at 8:01 AM, Jeff Squyres (jsquyres) wrote:

Thanks for the reply,

>> > >[n100501][[40339,1],6][../../../../../ompi/mca/btl/openib/connect/
>> > > btl_openib_connect_oob.c:459:qp_create_one]
>> > > error creating qp errno says Cannot allocate memory

> What kind of communication pattern does the application use? Does it
> use all-to-all?
I narrowed the location of the error down a bit. The application
calculates gravitational interaction between particles based on a tree
algorithm. The error is thrown in a loop over all levels, ie number of
tasks. Inside the loop each task potentially communicates via a single
call to MPI_Sendrecev, something like:

for(level = 0; level < nTasks; level++) {
    sendTask = ThisTask
    recvTask = ThisTask ^ level

    if (need_to_exchange_data()) {
        MPI_Sendrecv(buf1, count1, MPI_BYTE, recvTask, tag,
                      buf2, count2, MPI_BYTE, sendTask, tag,

                      MPI_COMM_WORLD), &status);
    }
}

Message sizes can be anything between 5KB and a couple of MB.
Typically, the error appears around level>=1030 (out of 2048).

>Open MPI makes OpenFabrics verbs (i.e., IB in your
>case) connections lazily, so you might not see these problems until
>OMPI is trying to make connections -- well past MPI_INIT -- and then
>failing when it runs out of HCA QP resources.

>> > > The cluster uses InfiniBand connections. I am aware only of the
>> > > following parameter changes (systemwide):
>> > > btl_openib_ib_min_rnr_timer = 25
>> > > btl_openib_ib_timeout = 20
>> > >
>> > > $> ulimit -l
>> > > unlimited
>> > >
>> > >
>> > > I attached:
>> > > 1) $> ompi_info --all > ompi_info.log
>> > > 2) stderr from the PBS: stderr.log
> >

>Open MPI v1.3 is actually quite flexible in how it creates and uses
>OpenFabrics QPs. By default, it likely creates 4 QPs (of varying
>buffer sizes) between each pair of MPI processes that connect to each
>other. You can tune this down to be 3, 2, or even 1 QP to reduce the
>number of QPs that are being opened (and therefore, presumably, not
>exhaust HCA QP resources).

>Alternatively / additionally, you may wish to enable XRC if you have
>recent enough Mellanox HCAs. This should also save on QP resources.

>You can set both of these things via the mca_btl_openib_receive_queues
>MCA parameter. It takes a colon-delimited list of receive queues
>(which directly imply how many QP's to create). There are 3 kinds of
>entries: per-peer QPs, shared receive queues, and XRC receive queues.
>Here's a description of each:

I played around with the number of queues, number of buffers, and buffer
size, but nothing really helped. The default is:

$ ompi_info --param btl openib --parsable | grep receive_queues

mca:btl:openib:param:btl_openib_receive_queues:value:
P,128,256,192,128:S,2048,256,128,32:S,12288,256,128,32:S,65536,256,128,32

I thought that running with
$ mpirun -np 2048 -mca mca_btl_openib_receive_queues
   P,128,3000:S,2048,3000:S,12288,3000:S,65536,3000

would do the trick, but it doesn't.

Any other idea?

> Hope this helps!
Yes, At least I understand the problem now ;-)

Cheers,
Jose