I would say the probability is large that it is due to the recent 'fix'.
I will try to create a testcase similar to what you suggested. Could
you give us maybe some hints on which functionality of MUMPS you are
using, or even share the code/ a code fragment?
Jeff Squyres wrote:
> Hey Edgar --
> Could this have anything to do with your recent fixes?
> On May 12, 2009, at 8:30 AM, Anton Starikov wrote:
>> hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
>> It happens with or without rankfile.
>> Started with
>> mpirun -np 16 ./somecode
>> mca parameters:
>> btl = self,sm,openib
>> mpi_maffinity_alone = 1
>> rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
>> doesn't change it)
>> I tested with both: "btl=self,sm" on 16c-core nodes and
>> "btl=self,sm,openib" on 8x dual-core nodes , result is the same.
>> It looks like it always occurs exactly at the same point in the
>> execution, not at the beginning, it is not first MPI_Comm_dup in the
>> I can't say too much about particular piece of the code, where it is
>> happening, because it is in the 3rd-party library (MUMPS). When error
>> occurs, MPI_Comm_dup in every task deals with single-task communicator
>> (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
>> groups, 1 process per group). And I can guess that before this error,
>> MPI_Comm_dup is called something like 100 of times by the same piece
>> of code on the same communicators without any problems.
>> I can say that it used to work correctly with all previous versions of
>> openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works
>> correctly on other platforms/MPI implementations.
>> All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
>> I recompiled code and 3rd-party libraries with this version of OMPI.
Parallel Software Technologies Lab http://pstl.cs.uh.edu
Department of Computer Science University of Houston
Philip G. Hoffman Hall, Room 524 Houston, TX-77204, USA
Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335