Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-06-06 11:45:55


On 6/6/07 9:21 AM, "Tim Prins" <tprins_at_[hidden]> wrote:

> Actually, the tests are quite painful to run, since there are things in
> there that aren't real tests (such as spin, no-op, loob-child, etc) and
> I really don't know what the expected output should be.

Actually, they are tests - you just have to know how to use them. The RTE
needs to test things that are somewhat difficult to automate, and frankly,
nobody has had the time to go back and try to develop more automatic
versions. So that is the best we've got - let's at least use them as best we
can.

After all, people have complained to me more than once about why things in
ORTE keep getting repeatedly broken (you included ;-) ). This is why -
nobody tests a range of RTE functionality before committing things that have
unfortunate side effects....only to have them finally detected when a user
hits a code path after we do a release.

>
> Anyways, I have made my way through these things, and I could not see
> any failures. This should clear the way for these changesets to be being
> brought in.

That's fine - thanks!

>
> George: Do you want to bring this over? If you do, remember to also
> remove test/class/orte_bitmap.c
>
> Thanks,
>
> Tim
>
>
> Ralph H Castain wrote:
>> Sigh...is it really so much to ask that we at least run the tests in
>> orte/test/system and orte/test/mpi using both mpirun and singleton (where
>> appropriate) instead of just relying on "well I ran hello_world"?
>>
>> That is all I have ever asked, yet it seems to be viewed as a huge
>> impediment. Is it really that much to ask for when modifying a core part of
>> the system? :-/
>>
>> If you have done those tests, then my apology - but your note only indicates
>> that you ran "hello_world" and are basing your recommendation *solely* on
>> that test.
>>
>>
>> On 6/6/07 7:51 AM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>
>>
>>> I hate to go back to this, but...
>>>
>>> The original commits also included changes to gpr_replica_dict_fn.c
>>> (r14331 and r14336). This change shows some performance improvement for
>>> me (about %8 on mpi hello, 123 nodes, 4ppn), and cleans up some ugliness
>>> in the gpr. Again, this is a algorithmic change so as the job scales the
>>> performance improvement would be more noticeable.
>>>
>>> I vote that this be put back in.
>>>
>>> On a related topic, a small memory leak was fixed in r14328, and then
>>> reverted. This change should be put back in.
>>>
>>> Tim
>>>
>>> George Bosilca wrote:
>>>
>>>> Commit r14791 apply this patch to the trunk. Let me know if you
>>>> encounter any kind of troubles.
>>>>
>>>> Thanks,
>>>> george.
>>>>
>>>> On May 29, 2007, at 2:28 PM, Ralph Castain wrote:
>>>>
>>>>
>>>>> After some work off-list with Tim, it appears that something has been
>>>>> broken
>>>>> again on the OMPI trunk with respect to comm_spawn. It was working
>>>>> two weeks
>>>>> ago, but...sigh.
>>>>>
>>>>> Anyway, it doesn't appear to have any bearing either way on George's
>>>>> patch(es), so whomever wants to commit them is welcome to do so.
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>>
>>>>> On 5/29/07 11:44 AM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> On 5/29/07 11:02 AM, "Tim Prins" <tprins_at_[hidden]> wrote:
>>>>>>
>>>>>>
>>>>>>> Well, after fixing many of the tests...
>>>>>>>
>>>>>> Interesting - they worked fine for me. Perhaps a difference in
>>>>>> environment.
>>>>>>
>>>>>>
>>>>>>> It passes all the tests
>>>>>>> except the spawn tests. However, the spawn tests are seriously broken
>>>>>>> without this patch as well, and the ibm mpi spawn tests seem to work
>>>>>>> fine.
>>>>>>>
>>>>>> Then something is seriously wrong. The spawn tests were working as
>>>>>> of my
>>>>>> last commit - that is a test I religiously run. If the spawn test here
>>>>>> doesn't work, then it is hard to understand how the mpi spawn can
>>>>>> work since
>>>>>> the call is identical.
>>>>>>
>>>>>> Let me see what's wrong first...
>>>>>>
>>>>>>
>>>>>>> As far as I'm concerned, this should assuage any fear of problems
>>>>>>> with these changes and they should now go in.
>>>>>>>
>>>>>>> Tim
>>>>>>>
>>>>>>> On May 29, 2007, at 11:34 AM, Ralph Castain wrote:
>>>>>>>
>>>>>>>
>>>>>>>> Well, I'll be the voice of caution again...
>>>>>>>>
>>>>>>>> Tim: did you run all of the orte tests in the orte/test/system
>>>>>>>> directory? If
>>>>>>>> so, and they all run correctly, then I have no issue with doing the
>>>>>>>> commit.
>>>>>>>> If not, then I would ask that we not do the commit until that has
>>>>>>>> been done.
>>>>>>>>
>>>>>>>> In running those tests, you need to run them on a multi-node
>>>>>>>> system, both
>>>>>>>> using mpirun and as singletons (you'll have to look at the tests to
>>>>>>>> see
>>>>>>>> which ones make sense in the latter case). This will ensure that we
>>>>>>>> have at
>>>>>>>> least some degree of coverage.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 5/29/07 9:23 AM, "George Bosilca" <bosilca_at_[hidden]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'd be happy to commit the patch into the trunk. But after what
>>>>>>>>> happened last time, I'm more than cautious. If the community think
>>>>>>>>> the patch is worth having it, let me know and I'll push it in the
>>>>>>>>> trunk asap.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> george.
>>>>>>>>>
>>>>>>>>> On May 29, 2007, at 10:56 AM, Tim Prins wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I think both patches should be put in immediately. I have done some
>>>>>>>>>> simple testing, and with 128 nodes of odin, with 1024 processes
>>>>>>>>>> running mpi hello, these decrease our running time from about 14.2
>>>>>>>>>> seconds to 10.9 seconds. This is a significant decrease, and as the
>>>>>>>>>> scale increases there should be increasing benefit.
>>>>>>>>>>
>>>>>>>>>> I'd be happy to commit these changes if no one objects.
>>>>>>>>>>
>>>>>>>>>> Tim
>>>>>>>>>>
>>>>>>>>>> On May 24, 2007, at 8:39 AM, Ralph H Castain wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Thanks - I'll take a look at this (and the prior ones!) in the
>>>>>>>>>>> next
>>>>>>>>>>> couple
>>>>>>>>>>> of weeks when time permits and get back to you.
>>>>>>>>>>>
>>>>>>>>>>> Ralph
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 5/23/07 1:11 PM, "George Bosilca" <bosilca_at_[hidden]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> Attached is another patch to the ORTE layer, more specifically
>>>>>>>>>>>> the
>>>>>>>>>>>> replica. The idea is to decrease the number of strcmp by using a
>>>>>>>>>>>> small hash function before doing the strcmp. The hask key for
>>>>>>>>>>>> each
>>>>>>>>>>>> registry entry is computed when it is added to the registry. When
>>>>>>>>>>>> we're doing a query, instead of comparing the 2 strings we first
>>>>>>>>>>>> check if the hash key match, and if they do match then we compare
>>>>>>>>>>>> the
>>>>>>>>>>>> 2 strings in order to make sure we eliminate collisions from our
>>>>>>>>>>>> answers.
>>>>>>>>>>>>
>>>>>>>>>>>> There is some benefit in terms of performance. It's hardly
>>>>>>>>>>>> visible
>>>>>>>>>>>> for few processes, but it start showing up when the number of
>>>>>>>>>>>> processes increase. In fact the number of strcmp in the trace
>>>>>>>>>>>> file
>>>>>>>>>>>> drastically decrease. The main reason it works well, is because
>>>>>>>>>>>> most
>>>>>>>>>>>> of the keys start with basically the same chars (such as orte-
>>>>>>>>>>>> blahblah) which transform the strcmp on a loop over few chars.
>>>>>>>>>>>>
>>>>>>>>>>>> Ralph, please consider it for inclusion on the ORTE layer.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> george.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> devel mailing list
>>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> devel_at_[hidden]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel