Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Big job, InfiniBand, MPI_Alltoallv and ibv_create_qp failed
From: Paul Kapinos (kapinos_at_[hidden])
Date: 2013-08-01 03:30:32


Vanilla Linux ofed from RPM's for Scientific Linux release 6.4 (Carbon) (= RHEL
6.4).
No ofed_info available :-(

On 07/31/13 16:59, Mike Dubman wrote:
> Hi,
> What OFED vendor and version do you use?
> Regards
> M
>
>
> On Tue, Jul 30, 2013 at 8:42 PM, Paul Kapinos <kapinos_at_[hidden]
> <mailto:kapinos_at_[hidden]>> wrote:
>
> Dear Open MPI experts,
>
> An user at our cluster has a problem running a kinda of big job:
> (- the job using 3024 processes (12 per node, 252 nodes) runs fine)
> - the job using 4032 processes (12 per node, 336 nodes) produce the error
> attached below.
>
> Well, the
> http://www.open-mpi.org/faq/?__category=openfabrics#ib-__locked-pages
> <http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages> is
> well-known one; both recommended tweakables (user limits and registered
> memory size) are at MAX now, nevertheless someone queue pair could not be
> created.
>
> Our blind guess is the number of completion queues is exhausted.
>
> What happen' when raising the value from standard to max?
> What max size of Open MPI jobs have been seen at all?
> What max size of Open MPI jobs *using MPI_Alltoallv* have been seen at all?
> Is there a way to manage the size/the number of queue pairs? (XRC not availabe)
> Is there a way to tell MPI_Alltoallv to use less queue pairs, even when this
> could lead to slow-down?
>
> There is a suspicious parameter in the mlx4_core module:
> $ modinfo mlx4_core | grep log_num_cq
> parm: log_num_cq:log maximum number of CQs per HCA (int)
>
> Is this the tweakable parameter?
> What is the default, and max value?
>
> Any help would be welcome...
>
> Best,
>
> Paul Kapinos
>
> P.S. There should be no connection problen somewhere between the nodes; a
> test job with 1x process on each node has been ran sucessfully just before
> starting the actual job, which also ran OK for a while - until calling
> MPI_Alltoallv.
>
>
>
>
>
>
> ------------------------------__------------------------------__--------------
> A process failed to create a queue pair. This usually means either
> the device has run out of queue pairs (too many connections) or
> there are insufficient resources available to allocate a queue pair
> (out of memory). The latter can happen if either 1) insufficient
> memory is available, or 2) no more physical memory can be registered
> with the device.
>
> For more information on memory registration see the Open MPI FAQs at:
> http://www.open-mpi.org/faq/?__category=openfabrics#ib-__locked-pages
> <http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages>
>
> Local host: linuxbmc1156.rz.RWTH-Aachen.DE
> <http://linuxbmc1156.rz.RWTH-Aachen.DE>
> Local device: mlx4_0
> Queue pair type: Reliable connected (RC)
> ------------------------------__------------------------------__--------------
> [linuxbmc1156.rz.RWTH-Aachen.__DE
> <http://linuxbmc1156.rz.RWTH-Aachen.DE>][[3703,1],4021][connect/__btl_openib_connect_oob.c:867:__rml_recv_cb]
> error in endpoint reply start connect
> [linuxbmc1156.rz.RWTH-Aachen.__DE:9632
> <http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] *** An error occurred in
> MPI_Alltoallv
> [linuxbmc1156.rz.RWTH-Aachen.__DE:9632
> <http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] *** on communicator MPI_COMM_WORLD
> [linuxbmc1156.rz.RWTH-Aachen.__DE:9632
> <http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] *** MPI_ERR_OTHER: known error
> not in list
> [linuxbmc1156.rz.RWTH-Aachen.__DE:9632
> <http://linuxbmc1156.rz.RWTH-Aachen.DE:9632>] *** MPI_ERRORS_ARE_FATAL: your
> MPI job will now abort
> [linuxbmc1156.rz.RWTH-Aachen.__DE
> <http://linuxbmc1156.rz.RWTH-Aachen.DE>][[3703,1],4024][connect/__btl_openib_connect_oob.c:867:__rml_recv_cb]
> error in endpoint reply start connect
> [linuxbmc1156.rz.RWTH-Aachen.__DE
> <http://linuxbmc1156.rz.RWTH-Aachen.DE>][[3703,1],4027][connect/__btl_openib_connect_oob.c:867:__rml_recv_cb]
> error in endpoint reply start connect
> [linuxbmc0840.rz.RWTH-Aachen.__DE
> <http://linuxbmc0840.rz.RWTH-Aachen.DE>][[3703,1],10][connect/btl___openib_connect_oob.c:867:rml___recv_cb]
> error in endpoint reply start connect
> [linuxbmc0840.rz.RWTH-Aachen.__DE
> <http://linuxbmc0840.rz.RWTH-Aachen.DE>][[3703,1],1][connect/btl___openib_connect_oob.c:867:rml___recv_cb]
> error in endpoint reply start connect
> [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
> <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] [[3703,0],0]-[[3703,1],10]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
> <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] [[3703,0],0]-[[3703,1],8]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
> <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] [[3703,0],0]-[[3703,1],9]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
> <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] [[3703,0],0]-[[3703,1],1]
> mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
> <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] 9 more processes have sent
> help message help-mpi-btl-openib-cpc-base.__txt / ibv_create_qp failed
> [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
> <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] Set MCA parameter
> "orte_base_help_aggregate" to 0 to see all help / error messages
> [linuxbmc0840.rz.RWTH-Aachen.__DE:17696
> <http://linuxbmc0840.rz.RWTH-Aachen.DE:17696>] 3 more processes have sent
> help message help-mpi-errors.txt / mpi_errors_are_fatal
>
> --
> Dipl.-Inform. Paul Kapinos - High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23, D 52074 Aachen (Germany)
> Tel: +49 241/80-24915 <tel:%2B49%20241%2F80-24915>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden] <mailto:users_at_[hidden]>
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915