Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] strange bug
From: Anton Starikov (ant.starikov_at_[hidden])
Date: 2009-05-12 18:35:58


I will try to prepare test-case.

--
Anton Starikov.
On May 12, 2009, at 6:57 PM, Edgar Gabriel wrote:
> hm, so I am out of ideas. I created multiple variants of test- 
> programs which did what you basically described, and they all passed  
> and did not generate problems. I compiled the MUMPS library and ran  
> the tests that they have in the examples directory, and they all  
> worked.
>
> Additionally, I checked in the source code of Open MPI. In comm_dup  
> there is only a single location where we raise the error  
> MPI_ERR_INTERN (which was reported in your email). I am fairly  
> positive, that this can not occur, else we would segfault prior to  
> that (it is a stupid check, don't ask). Furthermore, the code  
> segment that has been modified does not raise anywhere  
> MPI_ERR_INTERN. Of course, it could be a secondary effect and be  
> created somewhere else (PML_ADD or collective module selection) and  
> comm_dup just passes the error code up.
>
> One way or the other, I need more hints on what the code does. Any  
> chance of getting a smaller code fragment which replicates the  
> problem? It could use the MUMPS library, I am fine with that since I  
> just compiled and installed it with the current ompi trunk...
>
> Thanks
> Edgar
>
> Edgar Gabriel wrote:
>> I would say the probability is large that it is due to the recent  
>> 'fix'.  I will try to create a testcase similar to what you  
>> suggested. Could you give us maybe some hints on which  
>> functionality of MUMPS you are using, or even share the code/ a  
>> code fragment?
>> Thanks
>> Edgar
>> Jeff Squyres wrote:
>>> Hey Edgar --
>>>
>>> Could this have anything to do with your recent fixes?
>>>
>>> On May 12, 2009, at 8:30 AM, Anton Starikov wrote:
>>>
>>>> hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
>>>> support)
>>>>
>>>> It happens with or without rankfile.
>>>> Started with
>>>> mpirun -np 16 ./somecode
>>>>
>>>> mca parameters:
>>>>
>>>> btl = self,sm,openib
>>>> mpi_maffinity_alone = 1
>>>> rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
>>>> doesn't change it)
>>>>
>>>> I tested with both: "btl=self,sm" on 16c-core nodes and
>>>> "btl=self,sm,openib" on 8x dual-core nodes , result is the same.
>>>>
>>>> It looks like it always occurs exactly at the same point in the
>>>> execution, not at the beginning, it is not first MPI_Comm_dup in  
>>>> the
>>>> code.
>>>>
>>>> I can't say too much about particular piece of the code, where it  
>>>> is
>>>> happening, because it is in the 3rd-party library (MUMPS).  When  
>>>> error
>>>> occurs, MPI_Comm_dup in every task deals with single-task  
>>>> communicator
>>>> (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
>>>> groups, 1 process per group). And I  can guess that before this  
>>>> error,
>>>> MPI_Comm_dup is called something like 100 of times by the same  
>>>> piece
>>>> of code on the same communicators without any problems.
>>>>
>>>> I can say that it used to work correctly with all previous  
>>>> versions of
>>>> openmpi we used (1.2.8-1.3.2 and some earlier versions). It also  
>>>> works
>>>> correctly on other platforms/MPI implementations.
>>>>
>>>> All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
>>>> I recompiled code and 3rd-party libraries with this version of  
>>>> OMPI.
>>>>
>>>>
>>>>
>>>> <config.log.gz><ompi-info.txt.gz><ATT9775601.txt><ATT9775603.txt>
>>>
>>>
>
> -- 
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab      http://pstl.cs.uh.edu
> Department of Computer Science          University of Houston
> Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
> Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users