Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] strange bug
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-05-12 09:57:24


Hey Edgar --

Could this have anything to do with your recent fixes?

On May 12, 2009, at 8:30 AM, Anton Starikov wrote:

> hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
> support)
>
> It happens with or without rankfile.
> Started with
> mpirun -np 16 ./somecode
>
> mca parameters:
>
> btl = self,sm,openib
> mpi_maffinity_alone = 1
> rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
> doesn't change it)
>
> I tested with both: "btl=self,sm" on 16c-core nodes and
> "btl=self,sm,openib" on 8x dual-core nodes , result is the same.
>
> It looks like it always occurs exactly at the same point in the
> execution, not at the beginning, it is not first MPI_Comm_dup in the
> code.
>
> I can't say too much about particular piece of the code, where it is
> happening, because it is in the 3rd-party library (MUMPS). When error
> occurs, MPI_Comm_dup in every task deals with single-task communicator
> (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
> groups, 1 process per group). And I can guess that before this error,
> MPI_Comm_dup is called something like 100 of times by the same piece
> of code on the same communicators without any problems.
>
> I can say that it used to work correctly with all previous versions of
> openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works
> correctly on other platforms/MPI implementations.
>
> All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
> I recompiled code and 3rd-party libraries with this version of OMPI.
>
>
>
> <config.log.gz><ompi-info.txt.gz><ATT9775601.txt><ATT9775603.txt>

-- 
Jeff Squyres
Cisco Systems