Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] strange bug
From: Anton Starikov (ant.starikov_at_[hidden])
Date: 2009-05-12 18:35:58


I will try to prepare test-case.

--
Anton Starikov.
On May 12, 2009, at 6:57 PM, Edgar Gabriel wrote:
> hm, so I am out of ideas. I created multiple variants of test- 
> programs which did what you basically described, and they all passed  
> and did not generate problems. I compiled the MUMPS library and ran  
> the tests that they have in the examples directory, and they all  
> worked.
>
> Additionally, I checked in the source code of Open MPI. In comm_dup  
> there is only a single location where we raise the error  
> MPI_ERR_INTERN (which was reported in your email). I am fairly  
> positive, that this can not occur, else we would segfault prior to  
> that (it is a stupid check, don't ask). Furthermore, the code  
> segment that has been modified does not raise anywhere  
> MPI_ERR_INTERN. Of course, it could be a secondary effect and be  
> created somewhere else (PML_ADD or collective module selection) and  
> comm_dup just passes the error code up.
>
> One way or the other, I need more hints on what the code does. Any  
> chance of getting a smaller code fragment which replicates the  
> problem? It could use the MUMPS library, I am fine with that since I  
> just compiled and installed it with the current ompi trunk...
>
> Thanks
> Edgar
>
> Edgar Gabriel wrote:
>> I would say the probability is large that it is due to the recent  
>> 'fix'.  I will try to create a testcase similar to what you  
>> suggested. Could you give us maybe some hints on which  
>> functionality of MUMPS you are using, or even share the code/ a  
>> code fragment?
>> Thanks
>> Edgar
>> Jeff Squyres wrote:
>>> Hey Edgar --
>>>
>>> Could this have anything to do with your recent fixes?
>>>
>>> On May 12, 2009, at 8:30 AM, Anton Starikov wrote:
>>>
>>>> hostfile from torque PBS_NODEFILE (OMPI is compilled with torque
>>>> support)
>>>>
>>>> It happens with or without rankfile.
>>>> Started with
>>>> mpirun -np 16 ./somecode
>>>>
>>>> mca parameters:
>>>>
>>>> btl = self,sm,openib
>>>> mpi_maffinity_alone = 1
>>>> rmaps_base_no_oversubscribe = 1 (rmaps_base_no_oversubscribe = 0
>>>> doesn't change it)
>>>>
>>>> I tested with both: "btl=self,sm" on 16c-core nodes and
>>>> "btl=self,sm,openib" on 8x dual-core nodes , result is the same.
>>>>
>>>> It looks like it always occurs exactly at the same point in the
>>>> execution, not at the beginning, it is not first MPI_Comm_dup in  
>>>> the
>>>> code.
>>>>
>>>> I can't say too much about particular piece of the code, where it  
>>>> is
>>>> happening, because it is in the 3rd-party library (MUMPS).  When  
>>>> error
>>>> occurs, MPI_Comm_dup in every task deals with single-task  
>>>> communicator
>>>> (MPI_Comm_split of initial MPI_Comm_world for 16 processes into 16
>>>> groups, 1 process per group). And I  can guess that before this  
>>>> error,
>>>> MPI_Comm_dup is called something like 100 of times by the same  
>>>> piece
>>>> of code on the same communicators without any problems.
>>>>
>>>> I can say that it used to work correctly with all previous  
>>>> versions of
>>>> openmpi we used (1.2.8-1.3.2 and some earlier versions). It also  
>>>> works
>>>> correctly on other platforms/MPI implementations.
>>>>
>>>> All environmental variables (PATH, LD_LIBRARY_PATH) are correct.
>>>> I recompiled code and 3rd-party libraries with this version of  
>>>> OMPI.
>>>>
>>>>
>>>>
>>>> <config.log.gz><ompi-info.txt.gz><ATT9775601.txt><ATT9775603.txt>
>>>
>>>
>
> -- 
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab      http://pstl.cs.uh.edu
> Department of Computer Science          University of Houston
> Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
> Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users