Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Deadlock in MPI_File_write_all on Infiniband
From: Dorian Krause (doriankrause_at_[hidden])
Date: 2009-10-13 14:49:08


Hi Edgar,

this sounds reasonable. Looking at the program with the debugger, I can
see that 15/16 processes wait in PMPI_Allreduce whereas the other one is
in PMPI_Wait.

Also, the program works with mvapich and I guess the ADIO source tree is
more or less the same (correct me if I'm wrong)?!

So, I stick to MPI_File_write and wait for 1.3.4 ...

Thanks,
Dorian

Edgar Gabriel wrote:
> I am wondering whether this is really due to the usage of
> File_write_all. We had a bug in in 1.3 series so far (which will be
> fixed in 1.3.4) where we lost message segments and thus had a deadlock
> in Comm_dup if there was communication occurring *right after* the
> Comm_dup. File_open executes a comm_dup internally.
>
> If you replace write_all by write, you are avoiding the communication.
> If you replace ib by tcp, your entire timing is different and you
> might accidentally not see the deadlock...
>
> Just my $0.02 ...
>
> Thanks
> Edgar
>
> Dorian Krause wrote:
>> Dear list,
>>
>> the attached program deadlocks in MPI_File_write_all when run with 16
>> processes on two 8 core nodes of an Infiniband cluster. It runs fine
>> when I
>>
>> a) use tcp
>> or
>> b) replace MPI_File_write_all by MPI_File_write
>>
>> I'm using openmpi V. 1.3.2 (but I checked that the problem is also
>> occurs with version 1.3.3). The OFED version is 1.4 (installed via
>> Rocks). The Operating system is CentOS 5.2
>>
>> I compile with gcc-4.1.2. The openmpi configure flags are
>>
>> ../../configure --prefix=/share/apps/openmpi/1.3.2/gcc-4.1.2/
>> --with-io-romio-flags=--with-file-system=nfs+ufs+pvfs2
>> --with-wrapper-ldflags=-L/share/apps/pvfs2/lib
>> CPPFLAGS=-I/share/apps/pvfs2/include/ LDFLAGS=-L/share/apps/pvfs2/lib
>> LIBS=-lpvfs2 -lpthread
>>
>> The user home directories are mounted via nfs.
>>
>> Is it a problem with the user code, the system or with openmpi?
>>
>> Thanks,
>> Dorian
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>