Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] strange bug
From: Edgar Gabriel (gabriel_at_[hidden])
Date: 2009-05-12 12:57:38


hm, so I am out of ideas. I created multiple variants of test-programs
which did what you basically described, and they all passed and did not
generate problems. I compiled the MUMPS library and ran the tests that
they have in the examples directory, and they all worked.

Additionally, I checked in the source code of Open MPI. In comm_dup
there is only a single location where we raise the error MPI_ERR_INTERN
(which was reported in your email). I am fairly positive, that this can
not occur, else we would segfault prior to that (it is a stupid check,
don't ask). Furthermore, the code segment that has been modified does
not raise anywhere MPI_ERR_INTERN. Of course, it could be a secondary
effect and be created somewhere else (PML_ADD or collective module
selection) and comm_dup just passes the error code up.

One way or the other, I need more hints on what the code does. Any
chance of getting a smaller code fragment which replicates the problem?
It could use the MUMPS library, I am fine with that since I just
compiled and installed it with the current ompi trunk...

Thanks
Edgar

Edgar Gabriel wrote:
> I would say the probability is large that it is due to the recent 'fix'.
> I will try to create a testcase similar to what you suggested. Could
> you give us maybe some hints on which functionality of MUMPS you are
> using, or even share the code/ a code fragment?
>
> Thanks
> Edgar
>
> Jeff Squyres wrote:
>> Hey Edgar --
>>
>> Could this have anything to do with your recent fixes?
>>
>> On May 12, 2009, at 8:30 AM, Anton Starikov wrote:
>>
>>> hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
>>> support)
>>>
>>> It happens with or without rankfile.
>>> Started with
>>> mpirun -np 16 ./somecode
>>>
>>> mca parameters:
>>>
>>> btl = self,sm,openib
>>> mpi_maffinity_alone = 1
>>> rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
>>> doesn't change it)
>>>
>>> I tested with both: "btl=self,sm" on 16c-core nodes and
>>> "btl=self,sm,openib" on 8x dual-core nodes , result is the same.
>>>
>>> It looks like it always occurs exactly at the same point in the
>>> execution, not at the beginning, it is not first MPI_Comm_dup in the
>>> code.
>>>
>>> I can't say too much about particular piece of the code, where it is
>>> happening, because it is in the 3rd-party library (MUMPS). When error
>>> occurs, MPI_Comm_dup in every task deals with single-task communicator
>>> (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
>>> groups, 1 process per group). And I can guess that before this error,
>>> MPI_Comm_dup is called something like 100 of times by the same piece
>>> of code on the same communicators without any problems.
>>>
>>> I can say that it used to work correctly with all previous versions of
>>> openmpi we used (1.2.8-1.3.2 and some earlier versions). It also works
>>> correctly on other platforms/MPI implementations.
>>>
>>> All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
>>> I recompiled code and 3rd-party libraries with this version of OMPI.
>>>
>>>
>>>
>>> <config.log.gz><ompi-info.txt.gz><ATT9775601.txt><ATT9775603.txt>
>>
>>
>

-- 
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335