Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

From: Jeff Squyres \(jsquyres\) (jsquyres_at_[hidden])
Date: 2006-06-02 23:08:29


Troy and I talked about this one off-list as well and resolved the issue
as problems with his local IB fabric.

The moral of the lesson here is that Open MPI's error messages need to
be a bit more descriptive (in this case, they should have said, "Help!
The sky is falling, the sky is falling!").
 

> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of Troy Telford
> Sent: Thursday, June 01, 2006 3:35 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Open MPI 1.0.2 and np >=64
>
> > Did you happen to have a chance to try to run the 1.0.3 or 1.1
> > nightly tarballs? I'm 50/50 on whether we've fixed these issues
> > already.
>
> OK, for ticket #40:
>
> With Open MPI 1.0.3 (nightly downloaded/built May 31st)
> (This time using presta's 'laten', since the source code +
> comments is <
> 1k lines of code)
>
> One note: There doesn't seem to be a specific number of
> nodes in which
> the error crops up. It almost seems like a case of
> probability: With -np
> 142, the test will succeed ~75% of the time. Lower -np
> values result in
> higher success rates. Larger values of -np increase the
> probability of
> failure. -np 148 fails > 90% of the time. -np 128 works
> pretty much all
> the time.
>
> Fiddling with the machinefile (to try to narrow it down to
> misbehaving
> hardware) -- for instance, using only a specific set of
> nodes, etc. had no
> effect;
>
> On to the results:
>
> [root_at_zartan1 tmp]# mpirun -v -prefix $MPIHOME -mca btl
> openib,sm,self -np
> 148 -machinefile machines /tmp/laten -o 10
>
> MPI Bidirectional latency test (Send/Recv)
> Processes Max Latency (us)
> ------------------------------------------
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 12 for wr_id 47120798794424 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121337969156 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121338002208 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121338035260 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121338068312 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121338101364 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121338134416 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121338167468 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121338200520 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121338233572 opcode 0
>
> [0,1,144][btl_openib_component.c:587:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 47121340387456 opcode 0
>
> If I use -np 145, (actually, any odd number of nodes; that
> may just be a
> case of running 'laten' incorrectly)
>
> MPI Bidirectional latency test (Send/Recv)
> Processes Max Latency (us)
> ------------------------------------------
> 2 8.249
> 4 15.795
> 8 21.803
> 16 23.353
> 32 21.601
> 64 31.900
> [zartan75:06723] *** An error occurred in MPI_Group_incl
> [zartan75:06723] *** on communicator MPI_COMM_WORLD
> [zartan75:06723] *** MPI_ERR_RANK: invalid rank
> [zartan75:06723] *** MPI_ERRORS_ARE_FATAL (goodbye)
>
> ***and more of the same, with different nodes)
>
> 1 additional process aborted (not shown)
>
> ***************************
> With Open MPI 1.1:
> mpirun -v -np 150 -prefix $MPIHOME -mca btl openib,sm,self
> -machinefile
> machines laten -o 10
> MPI Bidirectional latency test (Send/Recv)
> Processes Max Latency (us)
> ------------------------------------------
> 2 21.648
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 12 for wr_id 5775790 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 5865600 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 7954692 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 7967282 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 7979872 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 7992462 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 8005052 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 8017642 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 8030232 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 8042822 opcode 0
>
> [0,1,144][btl_openib_component.c:782:mca_btl_openib_component_
> progress]
> error polling HP CQ with status 5 for wr_id 8055412 opcode 0
> --
> Troy Telford
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>