Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] OFED question
From: Barrett, Brian W (bwbarre_at_[hidden])
Date: 2011-01-27 20:09:46


Pasha -

Is there a way to tell which of the two happened or to check the number of QPs available per node? The app likely does talk to a large number of peers from each process, and the nodes are fairly "fat" - it's quad socket, quad core and they are running 16 MPI ranks for each node.

Brian

On Jan 27, 2011, at 6:17 PM, Shamis, Pavel wrote:

> Unfortunately verbose error reports are not so friendly...anyway , I may think about 2 issues:
>
> 1. You trying to open open too much QPs. By default ib devices support fairly large amount of QPs and it is quite hard to push it to this corner. But If your job is really huge it may be the case. Or for example, if you share the compute nodes with some other processes that create a lot of qps. The maximum amount of supported qps you may see in ibv_devinfo.
>
> 2. The memory limit for registered memory is too low, as result driver fails allocate and register memory for QP. This scenario is most common. Just happened to me recently, system folks pushed some crap into limits.conf.
>
> Regards,
>
> Pavel (Pasha) Shamis
> ---
> Application Performance Tools Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
>
>
>
>
>
>
> On Jan 27, 2011, at 5:56 PM, Barrett, Brian W wrote:
>
>> All -
>>
>> On one of our clusters, we're seeing the following on one of our applications, I believe using Open MPI 1.4.3:
>>
>> [xxx:27545] *** An error occurred in MPI_Scatterv
>> [xxx:27545] *** on communicator MPI COMMUNICATOR 5 DUP FROM 4
>> [xxx:27545] *** MPI_ERR_OTHER: known error not in list
>> [xxx:27545] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> [xxx][[31806,1],0][connect/btl_openib_connect_oob.c:857:qp_create_one] error creating qp errno says Resource temporarily unavailable
>> --------------------------------------------------------------------------
>> mpirun has exited due to process rank 0 with PID 27545 on
>> node rs1891 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpirun (as reported here).
>> --------------------------------------------------------------------------
>>
>>
>> The problem goes away if we modify the eager protocol msg sizes so that there are only two QPs necessary instead of the default 4. Is there a way to bump up the number of QPs that can be created on a node, assuming the issue is just running out of available QPs? If not, any other thoughts on working around the problem?
>>
>> Thanks,
>>
>> Brian
>>
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>>
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

--
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories