Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] strange bug
From: Anton Starikov (ant.starikov_at_[hidden])
Date: 2009-05-12 08:30:49


hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
support)

It happens with or without rankfile.
Started with
mpirun -np 16 ./somecode

mca parameters:

btl = self,sm,openib
mpi_maffinity_alone = 1
rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
doesn't change it)

I tested with both: "btl=self,sm" on 16c-core nodes and
"btl=self,sm,openib" on 8x dual-core nodes , result is the same.

It looks like it always occurs exactly at the same point in the
execution, not at the beginning, it is not first MPI_Comm_dup in the
code.

I can't say too much about particular piece of the code, where it is
happening, because it is in the 3rd-party library (MUMPS). When error
occurs, MPI_Comm_dup in every task deals with single-task communicator
(MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
groups, 1 process per group). And I can guess that before this error,
MPI_Comm_dup is called something like 100 of times by the same piece
of code on the same communicators without any problems.

I can say that it used to work correctly with all previous versions of
openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works
correctly on other platforms/MPI implementations.

All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
I recompiled code and 3rd-party libraries with this version of OMPI.


--
Anton Starikov.
Computational Material Science,
Faculty of Science and Technology,
University of Twente.
Phone: +31 (0)53 489 2986
Fax: +31 (0)53 489 2910
On May 12, 2009, at 12:35 PM, Jeff Squyres wrote:
> Can you send all the information listed here:
>
>    http://www.open-mpi.org/community/help/
>
>
>
> On May 11, 2009, at 10:03 PM, Anton Starikov wrote:
>
>> By the way, this if fortran code, which uses F77 bindings.
>>
>> --
>> Anton Starikov.
>>
>>
>> On May 12, 2009, at 3:06 AM, Anton Starikov wrote:
>>
>> > Due to rankfile fixes I switched to SVN r21208, now my code dies
>> > with error
>> >
>> > [node037:20519] *** An error occurred in MPI_Comm_dup
>> > [node037:20519] *** on communicator MPI COMMUNICATOR 32 SPLIT  
>> FROM 4
>> > [node037:20519] *** MPI_ERR_INTERN: internal error
>> > [node037:20519] *** MPI_ERRORS_ARE_FATAL (your MPI job will now  
>> abort)
>> >
>> > --
>> > Anton Starikov.
>> >
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> -- 
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users