Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
From: Simone Pellegrini (spellegrini_at_[hidden])
Date: 2009-05-04 04:37:00


Hi,
sorry for the delay but I did some additional experiments to found out
whether the problem was openmpi or gcc!

In attach u will find the program that causes the problem before mentioned.
I compile the program with the following line:

$HOME/openmpi-1.3.2-gcc44/bin/mpicc -O3 -g -Wall -fmessage-length=0 -m64
bug.c -o bug

When I run the program using mpi 1.3.2 compiled with gcc44 in the
following way:

$HOME/openmpi-1.3.2-gcc44/bin/mpirun --mca btl self,sm --np 32 ./bug 1024

The program just hangs... and never terminates! I am running on a SMP
machine with 32 cores, actually it is a Sun Fire X4600 X2. (8 quad-core
Barcelona AMD chips), the OS is CentOS 5 and the kernel is
2.6.18-92.el5.src-PAPI (patched with PAPI).
I use a N of 1024, and if I print out the value of the iterator i,
sometimes it stops around 165, other times around 520... and it doesn't
make any sense.

If I run the program (and it's important to notice I don't recompile it,
I just use another mpirun from a different mpi version) the program
works fine. I did some experiments during the weekend and if I use
openmpi-1.3.2 compiled with gcc433 everything works fine.

So I really think the problem is strictly related to the usage of
gcc-4.4.0! ...and it doesn't depends from OpenMPI as the program hangs
even when I use gcc 1.3.1 compiled with gcc 4.4!

I hope everything is clear now.

regards, Simone

Eugene Loh wrote:
> So far, I'm unable to reproduce this problem. I haven't exactly
> reproduced your test conditions, but then I can't. At a minimum, I
> don't have exactly the code you ran (and not convinced I want to!). So:
>
> *) Can you reproduce the problem with the stand-alone test case I sent
> out?
> *) Does the problem correlate with OMPI version? (I.e., 1.3.1 versus
> 1.3.2.)
> *) Does the problem occur at lower np?
> *) Does the problem correlate with the compiler version? (I.e., GCC
> 4.4 versus 4.3.3.)
> *) What is the failure rate? How many times should I expect to run to
> see failures?
> *) How large is N?
>
> Eugene Loh wrote:
>
>> Simone Pellegrini wrote:
>>
>>> Dear all,
>>> I have successfully compiled and installed openmpi 1.3.2 on a 8
>>> socket quad-core machine from Sun.
>>>
>>> I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase
>>> but when I try to run simple MPI programs processes hangs. Actually
>>> this is the kernel of the application I am trying to run:
>>>
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> total = MPI_Wtime();
>>> for(i=0; i<N-1; i++){
>>> // printf("%d\n", i);
>>> if(i>0)
>>> MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N,
>>> MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);
>>> for(k=0; k<N; k++)
>>> A[i][k] = (A[i][k] + A[i+1][k] + row[k])/3;
>>> }
>>> MPI_Barrier(MPI_COMM_WORLD);
>>> total = MPI_Wtime() - total;
>>
>>
>> Do you know if this kernel is sufficient to reproduce the problem?
>> How large is N? Evidently, it's greater than 1600, but I'm still
>> curious how big. What are top and bottom? Are they rank+1 and rank-1?
>>
>>> Sometimes the program terminates correctly, sometimes don't!
>>
>>
>> Roughly, what fraction of runs hang? 50%? 1%? <0.1%?
>>
>>> I am running the program using the shared memory module because I am
>>> using just one multi-core with the following command:
>>>
>>> mpirun --mca btl self,sm --np 32 ./my_prog prob_size
>>
>>
>> Any idea if this fails at lower np?
>>
>>> If I print the index number during the program execution I can see
>>> that program stop running around index value 1600... but it actually
>>> doesn't crash. It just stops! :(
>>>
>>> I run the program under strace to see what's going on and this is
>>> the output:
>>> [...]
>>> futex(0x2b20c02d9790, FUTEX_WAKE, 1) = 1
>>> futex(0x2aaaaafcf2b0, FUTEX_WAKE, 1) = 0
>>> readv(100,
>>> [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"...,
>>> 36}], 1) = 36
>>> readv(100,
>>> [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}],
>>> 1) = 28
>>> futex(0x19e93fd8, FUTEX_WAKE, 1) = 1
>>> futex(0x2aaaaafcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource
>>> temporarily unavailable)
>>> futex(0x2aaaaafcf5e0, FUTEX_WAKE, 1) = 0
>>> writev(102,
>>> [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"...,
>>> 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0",
>>> 28}], 2) = 64
>>> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
>>> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN},
>>> {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25,
>>> events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN},
>>> {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44,
>>> events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN},
>>> {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61,
>>> events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN},
>>> {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81,
>>> events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN},
>>> {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99,
>>> events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN},
>>> {fd=0, events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN},
>>> ...], 39, 1000) = 1
>>> readv(100,
>>> [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"...,
>>> 36}], 1) = 36
>>> readv(100,
>>> [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}],
>>> 1) = 28
>>> futex(0x19e93fd8, FUTEX_WAKE, 1) = 1
>>> writev(109,
>>> [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"...,
>>> 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0",
>>> 28}], 2) = 64
>>> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
>>> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN},
>>> {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25,
>>> events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN},
>>> {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44,
>>> events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN},
>>> {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61,
>>> events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN},
>>> {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81,
>>> events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN},
>>> {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99,
>>> events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN},
>>> {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1
>>> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
>>> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN},
>>> {fd=11, events=POLLIN}, {fd=21, events=POLLIN}, {fd=25,
>>> events=POLLIN}, {fd=27, events=POLLIN}, {fd=33, events=POLLIN},
>>> {fd=37, events=POLLIN}, {fd=39, events=POLLIN}, {fd=44,
>>> events=POLLIN}, {fd=48, events=POLLIN}, {fd=50, events=POLLIN},
>>> {fd=55, events=POLLIN}, {fd=59, events=POLLIN}, {fd=61,
>>> events=POLLIN}, {fd=66, events=POLLIN}, {fd=70, events=POLLIN},
>>> {fd=72, events=POLLIN}, {fd=77, events=POLLIN}, {fd=81,
>>> events=POLLIN}, {fd=83, events=POLLIN}, {fd=88, events=POLLIN},
>>> {fd=92, events=POLLIN}, {fd=94, events=POLLIN}, {fd=99,
>>> events=POLLIN}, {fd=103, events=POLLIN}, {fd=105, events=POLLIN},
>>> {fd=0, events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1
>>>
>>> and the program keep printing this poll() call till I stop it!
>>>
>>> The program runs perfectly with my old configuration which was
>>> OpenMPI 1.3.1 compiled with Gcc-4.4. Actually I see the same problem
>>> when I compile Openmpi-1.3.1 with Gcc 4.4. Is there any conflict
>>> which arise when gcc-4.4 is used?
>>
>>
>> I don't understand this. It runs well with 1.3.1/4.4, but you see
>> the same problem with 1.3.1/4.4? I'm confused: you do or don't see
>> the problem with 1.3.1/4.4? What do you think is the crucial factor
>> here? OMPI rev or GCC rev?
>>
>> I'm not sure I can replicate all of your test system (hardware,
>> etc.), but some sanity tests on what I have so far have turned up
>> clean. I run:
>>
>> #include <stdio.h>
>> #include <mpi.h>
>>
>> #define N 40000
>> #define M 40000
>>
>> int main(int argc, char **argv) {
>> int np, me, i, top, bottom;
>> float sbuf[N], rbuf[N];
>> MPI_Status status;
>>
>> MPI_Init(&argc,&argv);
>> MPI_Comm_size(MPI_COMM_WORLD,&np);
>> MPI_Comm_rank(MPI_COMM_WORLD,&me);
>>
>> top = me + 1; if ( top >= np ) top -= np;
>> bottom = me - 1; if ( bottom < 0 ) bottom += np;
>>
>> for ( i = 0; i < N; i++ ) sbuf[i] = 0;
>> for ( i = 0; i < N; i++ ) rbuf[i] = 0;
>>
>> MPI_Barrier(MPI_COMM_WORLD);
>> for ( i = 0; i < M - 1; i++ )
>> MPI_Sendrecv(sbuf, N, MPI_FLOAT, top , 0,
>> rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD,
>> &status);
>> MPI_Barrier(MPI_COMM_WORLD);
>>
>> MPI_Finalize();
>> return 0;
>> }
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users



  • text/x-csrc attachment: bug.c