Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-05-01 12:21:50


So far, I'm unable to reproduce this problem. I haven't exactly
reproduced your test conditions, but then I can't. At a minimum, I
don't have exactly the code you ran (and not convinced I want to!). So:

*) Can you reproduce the problem with the stand-alone test case I sent out?
*) Does the problem correlate with OMPI version? (I.e., 1.3.1 versus
1.3.2.)
*) Does the problem occur at lower np?
*) Does the problem correlate with the compiler version? (I.e., GCC 4.4
versus 4.3.3.)
*) What is the failure rate? How many times should I expect to run to
see failures?
*) How large is N?

Eugene Loh wrote:

> Simone Pellegrini wrote:
>
>> Dear all,
>> I have successfully compiled and installed openmpi 1.3.2 on a 8
>> socket quad-core machine from Sun.
>>
>> I have used both Gcc-4.4 and Gcc-4.3.3 during the compilation phase
>> but when I try to run simple MPI programs processes hangs. Actually
>> this is the kernel of the application I am trying to run:
>>
>> MPI_Barrier(MPI_COMM_WORLD);
>> total = MPI_Wtime();
>> for(i=0; i<N-1; i++){
>> // printf("%d\n", i);
>> if(i>0)
>> MPI_Sendrecv(A[i-1], N, MPI_FLOAT, top, 0, row, N,
>> MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);
>> for(k=0; k<N; k++)
>> A[i][k] = (A[i][k] + A[i+1][k] + row[k])/3;
>> }
>> MPI_Barrier(MPI_COMM_WORLD);
>> total = MPI_Wtime() - total;
>
>
> Do you know if this kernel is sufficient to reproduce the problem?
> How large is N? Evidently, it's greater than 1600, but I'm still
> curious how big. What are top and bottom? Are they rank+1 and rank-1?
>
>> Sometimes the program terminates correctly, sometimes don't!
>
>
> Roughly, what fraction of runs hang? 50%? 1%? <0.1%?
>
>> I am running the program using the shared memory module because I am
>> using just one multi-core with the following command:
>>
>> mpirun --mca btl self,sm --np 32 ./my_prog prob_size
>
>
> Any idea if this fails at lower np?
>
>> If I print the index number during the program execution I can see
>> that program stop running around index value 1600... but it actually
>> doesn't crash. It just stops! :(
>>
>> I run the program under strace to see what's going on and this is the
>> output:
>> [...]
>> futex(0x2b20c02d9790, FUTEX_WAKE, 1) = 1
>> futex(0x2aaaaafcf2b0, FUTEX_WAKE, 1) = 0
>> readv(100,
>> [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"...,
>> 36}], 1) = 36
>> readv(100,
>> [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\4\0\0\0jj\0\0\0\1\0\0\0", 28}],
>> 1) = 28
>> futex(0x19e93fd8, FUTEX_WAKE, 1) = 1
>> futex(0x2aaaaafcf5e0, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource
>> temporarily unavailable)
>> futex(0x2aaaaafcf5e0, FUTEX_WAKE, 1) = 0
>> writev(102,
>> [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\4\0\0\0\4\0\0\0\34"...,
>> 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0",
>> 28}], 2) = 64
>> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
>> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11,
>> events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN},
>> {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37,
>> events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN},
>> {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55,
>> events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN},
>> {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72,
>> events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN},
>> {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92,
>> events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN},
>> {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0,
>> events=POLLIN}, {fd=100, events=POLLIN, revents=POLLIN}, ...], 39,
>> 1000) = 1
>> readv(100,
>> [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0\0\0\0\4\0\0\0\34"...,
>> 36}], 1) = 36
>> readv(100,
>> [{"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0", 28}],
>> 1) = 28
>> futex(0x19e93fd8, FUTEX_WAKE, 1) = 1
>> writev(109,
>> [{"n\267\0\1\0\0\0\0n\267\0\0\0\0\0\0n\267\0\1\0\0\0\7\0\0\0\4\0\0\0\34"...,
>> 36}, {"n\267\0\1\0\0\0\0n\267\0\1\0\0\0\7\0\0\0jj\0\0\0\1\0\0\0",
>> 28}], 2) = 64
>> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
>> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11,
>> events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN},
>> {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37,
>> events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN},
>> {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55,
>> events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN},
>> {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72,
>> events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN},
>> {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92,
>> events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN},
>> {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0,
>> events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1
>> poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}, {fd=7,
>> events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=11,
>> events=POLLIN}, {fd=21, events=POLLIN}, {fd=25, events=POLLIN},
>> {fd=27, events=POLLIN}, {fd=33, events=POLLIN}, {fd=37,
>> events=POLLIN}, {fd=39, events=POLLIN}, {fd=44, events=POLLIN},
>> {fd=48, events=POLLIN}, {fd=50, events=POLLIN}, {fd=55,
>> events=POLLIN}, {fd=59, events=POLLIN}, {fd=61, events=POLLIN},
>> {fd=66, events=POLLIN}, {fd=70, events=POLLIN}, {fd=72,
>> events=POLLIN}, {fd=77, events=POLLIN}, {fd=81, events=POLLIN},
>> {fd=83, events=POLLIN}, {fd=88, events=POLLIN}, {fd=92,
>> events=POLLIN}, {fd=94, events=POLLIN}, {fd=99, events=POLLIN},
>> {fd=103, events=POLLIN}, {fd=105, events=POLLIN}, {fd=0,
>> events=POLLIN}, {fd=100, events=POLLIN}, ...], 39, 1000) = 1
>>
>> and the program keep printing this poll() call till I stop it!
>>
>> The program runs perfectly with my old configuration which was
>> OpenMPI 1.3.1 compiled with Gcc-4.4. Actually I see the same problem
>> when I compile Openmpi-1.3.1 with Gcc 4.4. Is there any conflict
>> which arise when gcc-4.4 is used?
>
>
> I don't understand this. It runs well with 1.3.1/4.4, but you see the
> same problem with 1.3.1/4.4? I'm confused: you do or don't see the
> problem with 1.3.1/4.4? What do you think is the crucial factor
> here? OMPI rev or GCC rev?
>
> I'm not sure I can replicate all of your test system (hardware, etc.),
> but some sanity tests on what I have so far have turned up clean. I run:
>
> #include <stdio.h>
> #include <mpi.h>
>
> #define N 40000
> #define M 40000
>
> int main(int argc, char **argv) {
> int np, me, i, top, bottom;
> float sbuf[N], rbuf[N];
> MPI_Status status;
>
> MPI_Init(&argc,&argv);
> MPI_Comm_size(MPI_COMM_WORLD,&np);
> MPI_Comm_rank(MPI_COMM_WORLD,&me);
>
> top = me + 1; if ( top >= np ) top -= np;
> bottom = me - 1; if ( bottom < 0 ) bottom += np;
>
> for ( i = 0; i < N; i++ ) sbuf[i] = 0;
> for ( i = 0; i < N; i++ ) rbuf[i] = 0;
>
> MPI_Barrier(MPI_COMM_WORLD);
> for ( i = 0; i < M - 1; i++ )
> MPI_Sendrecv(sbuf, N, MPI_FLOAT, top , 0,
> rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, &status);
> MPI_Barrier(MPI_COMM_WORLD);
>
> MPI_Finalize();
> return 0;
> }