Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SIGV at MPI_Cart_sub
From: Anas Al-Trad (anas.altrad_at_[hidden])
Date: 2012-01-10 12:10:05


it is a good question I asked it myself at the first but then I said it
should be correct but anyway I want to confirm that:
her is the code snippet of the program:
...
int ranks[size];
    for(i=0; i < size; ++i)
    {
ranks[i] = i;
    }
...

for(p=8; p <= (size); p+=4)
    {
      MPI_Barrier(MPI_COMM_WORLD);
      if(!grid_init(p, 1)) continue;
      if( (p>=m) || (p>=k) || (p>=n) )
break;

      MPI_Group_incl(world_group, p, ranks, &working_group);
      MPI_Comm_create(MPI_COMM_WORLD, working_group, &working_comm);

      if(working_comm != MPI_COMM_NULL)
      {
...
variant_run(&variant5, C, m, k, n, my_rank, p, working_comm);
...
MPI_Group_free(&working_group);
MPI_Comm_free(&working_comm);
}

Inside variant_run, it calls this function where the error is:
void Compute_SUMMA1(Matrix* A, Matrix* B, Matrix *C, size_t M, size_t K,
size_t N, size_t my_rank, size_t size, MPI_Comm comm)
{
    C->block_matrix = gsl_matrix_calloc(A->block_matrix->size1,
B->block_matrix->size2);
    C->distribution_type = TwoD_Block;

    MPI_Comm grid_comm;
    int dim[2], period[2], reorder = 0, ndims = 2;
    int coord[2], id;

    dim[0] = global.PR; dim[1] = global.PC;
    period[0] = 0; period[1] = 0;

    int ss, rr;
    MPI_Group comm_group;
    MPI_Comm_group(comm, &comm_group );
    MPI_Group_size( comm_group, &ss);
    MPI_Group_rank( comm_group, &rr);
    if(ss == 6)
    {
//printf("M %d K %d N %d
//printf("my_rank in comm %d my_rank in world_comm %d\n", rr, my_rank);
//printf(" comm size %d my_rank in comm %d my_rank in world_comm %d\n",
ss, rr, my_rank);
//printf("SUMMA ... PR %d PC %d\n", global.PR, global.PC);
    }
    //MPI_Barrier(comm);
// if(my_rank == 0)
// printf("my_rank %d ndims %d dim[0] %d dim[1] %d period[0] %d
 period[1] %d reorder %d\n",
// my_rank, ndims, dim[0], dim[1], period[0], period[1], reorder);
// if(comm == MPI_COMM_NULL)
// printf("my_rank %d comm is empty\n", my_rank);
//
    MPI_Cart_create(comm, ndims, dim, period, reorder, &grid_comm);

    MPI_Comm Acomm, Bcomm;

    // create column subgrids
    int remain[2]; //, mdims, dims[2], row_coords[2];
    remain[0] = 1;
    remain[1] = 0;
    MPI_Cart_sub(grid_comm, remain, &Bcomm);

    remain[0] = 0;
    remain[1] = 1;
    MPI_Cart_sub(grid_comm, remain, &Acomm);
...
}

As you can see, all ranks will call grid_init which is a global func that
returns the grid dims, if it is executed for ranks 24 will produce 4X6 and
for 96 produce 8X12 and will store the result in global structure with PR
and PC. As it is executed by all prcesses and I checked for rank 0 and some
other processes and the result is correct so I assume it should be correct
for all other processes.

So the grid_comm is correct which is an input to MPI_Cart_sub. The ranks in
the working_comm and in MPI_COMM_WORLD should be the same and this should
be correct and it is according to filling the rank array at the beginning
of this code snippet.

On Tue, Jan 10, 2012 at 5:25 PM, Jeff Squyres <jsquyres_at_[hidden]> wrote:

> This may be a dumb question, but are you 100% sure that the input values
> are correct?
>
> On Jan 10, 2012, at 8:16 AM, Anas Al-Trad wrote:
>
> > Hi Ralph, I changed the intel icc module from 12.1.0 to 11.1.069, the
> previous default one used at a Neolith Cluster. I submitted the job and I
> still waiting for the result. Here is the message of the segmentation fault:
> >
> > [n764:29867] *** Process received signal ***
> > [n764:29867] Signal: Floating point exception (8)
> > [n764:29867] Signal code: Integer divide-by-zero (1)
> > [n764:29867] Failing at address: 0x2ba640e74627
> > [n764:29867] [ 0] /lib64/libc.so.6 [0x2ba641e162d0]
> > [n764:29867] [ 1]
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_coords+0x43)
> [0x2ba640e74627]
> > [n764:29867] [ 2]
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(mca_topo_base_cart_sub+0x1d5)
> [0x2ba640e74acd]
> > [n764:29867] [ 3]
> /software/mpi/openmpi/1.4.1/i101011/lib/libmpi.so.0(MPI_Cart_sub+0x35)
> [0x2ba640e472d9]
> > [n764:29867] [ 4]
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(Compute_SUMMA1+0x226)
> [0x4088da]
> > [n764:29867] [ 5]
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(variant_run+0xb2)
> [0x409058]
> > [n764:29867] [ 6]
> /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o(main+0xf90) [0x40eeba]
> > [n764:29867] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2ba641e03994]
> > [n764:29867] [ 8] /home/x_anaal/thesis/cimple/tst_chng_p/v5/r2/output.o
> [0x403fd9]
> > [n764:29867] *** End of error message ***
> >
> > when I run my application, sometimes I get this error and sometimes it
> is stuck in the middle.
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres_at_[hidden]
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>