Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Troy Telford (ttelford_at_[hidden])
Date: 2006-06-01 15:34:59


> Did you happen to have a chance to try to run the 1.0.3 or 1.1
> nightly tarballs? I'm 50/50 on whether we've fixed these issues
> already.

OK, for ticket #40:

With Open MPI 1.0.3 (nightly downloaded/built May 31st)
(This time using presta's 'laten', since the source code + comments is <
1k lines of code)

One note: There doesn't seem to be a specific number of nodes in which
the error crops up. It almost seems like a case of probability: With -np
142, the test will succeed ~75% of the time. Lower -np values result in
higher success rates. Larger values of -np increase the probability of
failure. -np 148 fails > 90% of the time. -np 128 works pretty much all
the time.

Fiddling with the machinefile (to try to narrow it down to misbehaving
hardware) -- for instance, using only a specific set of nodes, etc. had no
effect;

On to the results:

[root_at_zartan1 tmp]# mpirun -v -prefix $MPIHOME -mca btl openib,sm,self -np
148 -machinefile machines /tmp/laten -o 10

MPI Bidirectional latency test (Send/Recv)
              Processes Max Latency (us)
------------------------------------------

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 12 for wr_id 47120798794424 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121337969156 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121338002208 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121338035260 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121338068312 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121338101364 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121338134416 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121338167468 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121338200520 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121338233572 opcode 0

[0,1,144][btl_openib_component.c:587:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 47121340387456 opcode 0

If I use -np 145, (actually, any odd number of nodes; that may just be a
case of running 'laten' incorrectly)

MPI Bidirectional latency test (Send/Recv)
              Processes Max Latency (us)
------------------------------------------
                      2 8.249
                      4 15.795
                      8 21.803
                     16 23.353
                     32 21.601
                     64 31.900
[zartan75:06723] *** An error occurred in MPI_Group_incl
[zartan75:06723] *** on communicator MPI_COMM_WORLD
[zartan75:06723] *** MPI_ERR_RANK: invalid rank
[zartan75:06723] *** MPI_ERRORS_ARE_FATAL (goodbye)

***and more of the same, with different nodes)

1 additional process aborted (not shown)

***************************
With Open MPI 1.1:
mpirun -v -np 150 -prefix $MPIHOME -mca btl openib,sm,self -machinefile
machines laten -o 10
MPI Bidirectional latency test (Send/Recv)
              Processes Max Latency (us)
------------------------------------------
                      2 21.648
[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 12 for wr_id 5775790 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 5865600 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 7954692 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 7967282 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 7979872 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 7992462 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 8005052 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 8017642 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 8030232 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 8042822 opcode 0

[0,1,144][btl_openib_component.c:782:mca_btl_openib_component_progress]
error polling HP CQ with status 5 for wr_id 8055412 opcode 0

--
Troy Telford