Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] OFED question
From: Barrett, Brian W (bwbarre_at_[hidden])
Date: 2011-01-27 17:56:14

All -

On one of our clusters, we're seeing the following on one of our applications, I believe using Open MPI 1.4.3:

[xxx:27545] *** An error occurred in MPI_Scatterv
[xxx:27545] *** on communicator MPI COMMUNICATOR 5 DUP FROM 4
[xxx:27545] *** MPI_ERR_OTHER: known error not in list
[xxx:27545] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[xxx][[31806,1],0][connect/btl_openib_connect_oob.c:857:qp_create_one] error creating qp errno says Resource temporarily unavailable
mpirun has exited due to process rank 0 with PID 27545 on
node rs1891 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

The problem goes away if we modify the eager protocol msg sizes so that there are only two QPs necessary instead of the default 4. Is there a way to bump up the number of QPs that can be created on a node, assuming the issue is just running out of available QPs? If not, any other thoughts on working around the problem?



  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories