Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Bad MPI_Bcast behaviour when running over openib
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-09-11 08:24:41


Cisco is no longer an IB vendor, but I seem to recall that these kinds
of errors typically indicated a fabric problem. Have you run layer 0
and 1 diagnostics to ensure that the fabric is clean?

On Sep 11, 2009, at 8:09 AM, Rolf Vandevaart wrote:

> Hi, how exactly do you run this to get this error? I tried and it
> worked for me.
>
> burl-ct-x2200-16 50 =>mpirun -mca btl_openib_warn_default_gid_prefix 0
> -mca btl self,sm,openib -np 2 -host burl-ct-x2200-16,burl-ct-x2200-17
> -mca btl_openib_ib_timeout 16 a.out
> I am 0 at 1252670691
> I am 1 at 1252670559
> I am 0 at 1252670692
> I am 1 at 1252670559
> burl-ct-x2200-16 51 =>
>
> Rolf
>
> On 09/11/09 07:18, Ake Sandgren wrote:
> > Hi!
> >
> > The following code shows a bad behaviour when running over openib.
> >
> > Openmpi: 1.3.3
> > With openib it dies with "error polling HP CQ with status WORK
> REQUEST
> > FLUSHED ERROR status number 5 ", with tcp or shmem it works as
> expected.
> >
> >
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <time.h>
> > #include "mpi.h"
> >
> > int main(int argc, char *argv[])
> > {
> > int rank;
> > int n;
> >
> > MPI_Init( &argc, &argv );
> >
> > MPI_Comm_rank( MPI_COMM_WORLD, &rank );
> >
> > fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
> > fflush(stderr);
> >
> > n = 4;
> > MPI_Bcast(&n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);
> > fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
> > fflush(stderr);
> > if (rank == 0) {
> > sleep(60);
> > }
> > MPI_Barrier(MPI_COMM_WORLD);
> >
> > MPI_Finalize( );
> > exit(0);
> > }
> >
> > I know about the internal openmpi reason for it do behave as it
> does.
> > But i think that it should be allowed to behave as it does.
> >
> > This example is a bit engineered but there are codes where a similar
> > situation can occur, i.e. the Bcast sender doing lots of other work
> > after the Bcast before the next MPI call. VASP is a candidate for
> this.
> >
>
>
> --
>
> =========================
> rolf.vandevaart_at_[hidden]
> 781-442-3043
> =========================
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

-- 
Jeff Squyres
jsquyres_at_[hidden]