Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r25015
From: Ralph Castain (rhc_at_[hidden])
Date: 2011-08-08 19:55:02


I will interject at this point in the thread. IMO, Brian has raised a valid point. We are contorting around with the epoch values, as I commented in a separate thread.

The problem really stems from something Wes mentioned a while back in a different set of communications. When the epoch code was written, it wasn't entirely clear to him where the data required for supporting the epoch-related functionality should go. It doesn't really fit the ess, it didn't really belong in modex, etc. So he wound up sticking things here and there, trying to disturb the existing APIs as little as possible.

What we probably should do is (a) try to better understand what Wes is trying to accomplish, and then (b) design the code base to support it. The current functional APIs and nidmap code weren't designed to support the extended name, and so the fit is sometimes wrong. Getting things designed properly would help ease the problem.

In fairness, remember that Wes and I had already agreed to review a bunch of this (e.g., where epoch info is stored) when we finally get back to dealing with the orte state engine. Hopefully, we can address some or all of these issues then.

Meantime, what's there is certainly "weird" - but if it works for now, then perhaps we can get things correctly integrated in the near future.

HTH
Ralph

On Aug 8, 2011, at 10:20 AM, Barrett, Brian W wrote:

> Ok, that makes some sense. It's a really weird way of doing
> initialization, though. Some of it is probably how Ralph chose to share
> nidpid map code, so perhaps it's unavoidable. But since the name isn't
> use in proc_get_epoch() (or, better not be used in there), I'm not
> entirely sure why that made valgrind happy...
>
> Brian
>
> On 8/8/11 9:53 AM, "Wesley Bland" <wbland_at_[hidden]> wrote:
>
>> It's not just copying the value from one to the other.
>> The value is initialized on the first line. The proc_name is passed into
>> the function where the jobid and vpid are used (not the epoch). The
>> function returns a new (the correct) value for the epoch based on the
>> passed in jobid and vpid which is assigned to the process name.
>>
>> This is the same thing as initializing the value to NULL. We do that all
>> the time to avoid compiler warnings. I just can't do that here because
>> this isn't a pointer. If it would make the code look better I can move
>> the first assignment to the top of the function where the other
>> initializations are.
>>
>> On Mon, Aug 8, 2011 at 11:41 AM, Barrett, Brian W <bwbarre_at_[hidden]>
>> wrote:
>>
>> On 8/8/11 9:34 AM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>
>>> On Aug 8, 2011, at 11:30 AM, Wesley Bland wrote:
>>>
>>>> The reason is because valgrind was complaining about uninitialized
>>>> values that were passed into proc_get_epoch. I saw the same warnings
>>>> from valgrind when I ran it. I added the code to initialize the values
>>>> to what really should be the default value and the warnings went away.
>>>> Since the process_name_t struct isn't an object, it doesn't have an
>>>> initialization function like so many of the other objects in the code.
>>>> This is what we have.
>>>
>>> Ah, I see -- you are passing peer_name into proc_get_epoch(). I missed
>>> that.
>>>
>>> Thanks!
>>>
>>>> peer_name.jobid = ORTE_PROC_MY_NAME->jobid;
>>>> peer_name.vpid = peer_idx;
>>>> + peer_name.epoch = ORTE_EPOCH_INVALID;
>>>> peer_name.epoch = orte_ess.proc_get_epoch(&peer_name);
>>
>>
>> I'm still not convinced this is a rational coding style. It seems to me
>> that if you're just going to copy the epoch from peer_name to peer_name,
>> just assign the epoch to INVALID and be done with it. There's no need for
>> both the assignment added in this patch and the line after it.
>>
>> Now, normally I'd say that this is fixing a bug and having to fix other
>> people's bad code shouldn't be your problem. But since you wrote both
>> lines, fixing it to make sense seems reasonable to me ;).
>>
>> Brian
>>
>> --
>> Brian W. Barrett
>> Dept. 1423: Scalable System Software
>> Sandia National Laboratories
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> Brian W. Barrett
> Dept. 1423: Scalable System Software
> Sandia National Laboratories
>
>
>
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel