Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] OpenIB problem: error polling HP CQ...
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-06-04 14:27:28


We have made a *lot* of changes to the run-time support for spawn and
some changes to the FLUSH support in the openib BTL for the upcoming
v1.3 series.

Would it be possible for you to try a trunk nightly tarball snapshot,
perchance?

     http://www.open-mpi.org/nightly/trunk/

On May 29, 2008, at 3:50 AM, Matt Hughes wrote:

> I have a program which uses MPI::Comm::Spawn to start processes on
> compute nodes (c0-0, c0-1, etc). The communication between the
> compute nodes consists of ISend and IRecv pairs, while communication
> between the compute nodes consists of gather and bcast operations.
> After executing ~80 successful loops (gather/bcast pairs), I get this
> error message from the head node process during a gather call:
>
> [0,1,0][btl_openib_component.c:1332:btl_openib_component_progress]
> from headnode.local to: c0-0 error polling HP CQ with status WORK
> REQUEST FLUSHED ERROR status number 5 for wr_id 18504944 opcode 1
>
> The relevant environment variables:
> OMPI_MCA_btl_openib_rd_num=128
> OMPI_MCA_btl_openib_verbose=1
> OMPI_MCA_btl_base_verbose=1
> OMPI_MCA_btl_openib_rd_low=75
> OMPI_MCA_btl_base_debug=1
> OMPI_MCA_btl_openib_warn_no_hca_params_found=0
> OMPI_MCA_btl_openib_warn_default_gid_prefix=0
> OMPI_MCA_btl=self,openib
>
> If rd_low and rd_num are left at their default values, the program
> simply hangs in the gather call after about 20 iterations (a gather
> and a bcast).
>
> Can anyone shed any light on what this error message means or what
> might be done about it?
>
> Thanks,
> mch
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
Cisco Systems