Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Brian Barrett (bbarrett_at_[hidden])
Date: 2007-08-16 12:16:13


George -

I think that check should be in *BOTH* the MPI layer and the group
code. The MPI layer should always be available, and the group code
should only do the check in OMPI_ENABLE_DEBUG (like it does today).
Having that check in the debug builds helped a bunch in debugging
some one-sided issues I used to have...

Brian

On Aug 16, 2007, at 10:07 AM, George Bosilca wrote:

> Can this patch be moved out of the ompi internals ? If possible it
> should go in the MPI level functions somewhere. I looked into the mpi/
> c/group_incl.c and we have a bunch of tests, but this particular one
> is missing.
>
> Long ago we decided to do most of the checking outside the internals
> in order to have a clean [and fast] execution path once we get inside
> the ompi functions. Of course, we suppose that all our internal calls
> will never generate any problems (i.e. all the required checks will
> be done prior to calling another function).
>
> george.
>
> On Aug 16, 2007, at 11:53 AM, Tim Prins wrote:
>
>> Mohamad,
>>
>> 2 process was plenty. Like I said, when running in debug mode, it
>> tends
>> to 'work' since memory is initialized to \0 and we fall through.
>> In an
>> optimized build, looking at the mtt results it looks like it
>> segfaults
>> about 10% of the time.
>>
>> But if you apply the patch I sent, it will tell you when an invaild
>> lookup happens, which should be every time it runs.
>>
>> Tim
>>
>> Mohamad Chaarawi wrote:
>>> Hey Tim,
>>>
>>> I understand what you are talking about.
>>> Im trying to reproduce the problem. How many processes are your
>>> running
>>> with, because with the default (4 for the group) it's passing..
>>>
>>> Thanks,
>>> Mohamad
>>>
>>> On Thu, August 16, 2007 7:49 am, Tim Prins wrote:
>>>> Sorry, I pushed the wrong button and sent this before it was
>>>> ready....
>>>>
>>>> Tim Prins wrote:
>>>>> Hi folks,
>>>>>
>>>>> I am running into a problem with the ibm test 'group'. I will
>>>>> try to
>>>>> explain what I think is going on, but I do not really understand
>>>>> the
>>>>> group code so please forgive me if it is wrong...
>>>>>
>>>>> The test creates a group based on MPI_COMM_WORLD (group1), and a
>>>>> group
>>>>> that has half the procs in group1 (newgroup). Next, all the
>>>>> processes
>>>>> do:
>>>>>
>>>>> MPI_Group_intersection(newgroup,group1,&group2)
>>>>>
>>>>> ompi_group_intersection figures out what procs are needed for
>>>>> group2,
>>>>> then calls
>>>>>
>>>>> ompi_group_incl, passing 'newgroup' and '&group2'
>>>>>
>>>>> This then calls (since I am not using sparse groups)
>>>>> ompi_group_incl_plist
>>>>>
>>>>> However, ompi_group_plist assumes that the current process is a
>>>>> member
>>>>> of the passed group ('newgroup'). Thus when it calls
>>>>> ompi_group_peer_lookup on 'newgroup', half of the processes get
>>>>> garbage
>>>>> back since they are not in 'newgroup'. In most cases, memory is
>>>>> initialized to \0 and things fall through, but we get intermittent
>>>>> segfaults in optimized builds.
>>>>>
>>>> Here is a patch to a error check which highlights the problem:
>>>> Index: group/group.h
>>>> ===================================================================
>>>> --- group/group.h (revision 15869)
>>>> +++ group/group.h (working copy)
>>>> @@ -308,7 +308,7 @@
>>>> static inline struct ompi_proc_t* ompi_group_peer_lookup
>>>> (ompi_group_t
>>>> *group, int peer_id)
>>>> {
>>>> #if OMPI_ENABLE_DEBUG
>>>> - if (peer_id >= group->grp_proc_count) {
>>>> + if (peer_id >= group->grp_proc_count || peer_id < 0) {
>>>> opal_output(0, "ompi_group_lookup_peer: invalid peer
>>>> index
>>>> (%d)", peer_id);
>>>>
>>>>> Thanks,
>>>>>
>>>>> Tim
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel