Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] DDT and spawn issue?
From: George Bosilca (bosilca_at_[hidden])
Date: 2009-07-15 15:57:16


Actually I don't think this will help. I looked on MTT and there are
no errors related to this (logically all reductions should have
failed) ... and MTT is supposed to run on several platforms. What
happens inside is really strange, but as we do the same mistake when
we look-up the op as hen we store it, this works on most cases.
Moreover, even with the op corrected we still see segfaults, and it
looks more and more as some memory overwrite problem... Before the
commit we even test it on a Sicortex machine (which is clearly a
different architecture than the x86_64) and this didn't trigger any
errors either.

Regarding the latency issue, there is not much to say about. The
platform we tested on is clearly older than what other people test on,
but this is all about. The two versions (before and after the data-
type move) have the same latency, there is no reason to focus on the
latency number.

   george.

On Jul 15, 2009, at 12:18 , Jeff Squyres wrote:

> Perhaps we should add a requirement for testing on 2-3 different
> systems before long-term (or "big change") branches like this come
> to the trunk? I say this because it seems like at least some of
> these problems were based on bad luck -- i.e., the stuff worked on
> the platform that it was being tested and developed on, even though
> there are bugs left. Having fallen victim to this myself many times
> ("worked for me on Cisco machines! I dunno why it's failing for
> you... :-("), I think we all recognize the value of just running the
> same code on someone else's systems -- it has a good tendency to
> turn up issues that don't show up on yours. I'm not trying to say
> that every little trunk commit needs to be validated -- but "big"
> changes like this could certainly benefit from multiple validations.
>
> Cisco is very willing to be a 2nd platform for testing for stuff
> that we can run without too much trouble, especially via MTT (e.g.,
> I already have the right kind of networks to test, etc.).
>
> BTW, is anyone going to comment about the latency issue that I asked
> about?
>
> (in case you can't tell, I'm moderately displeased about how this
> whole branch came to the trunk... :-\ )
>
>
>
> On Jul 15, 2009, at 12:04 PM, Rainer Keller wrote:
>
>> Hi Jeff,
>> Ralph and Edgar send fwd an email about this.
>> We (George and myselve) are currently looking into this.
>>
>> With the changes we have I can get IBM/spawn to work "sometimes", aka
>> sometimes, it segfaults.
>>
>> Thanks,
>> Rainer
>>
>>
>>
>>
>> On Wednesday 15 July 2009 11:50:13 am Jeff Squyres wrote:
>> > I [very briefly] read about the DDT spawn issues, so I went to
>> look at
>> > ompi/op/op.c. I notice that there's a new comment above the op
>> > datatype<-->op map construction area that says:
>> >
>> > /* XXX TODO */
>> >
>> > svn blame says:
>> >
>> > 21641 rusraink /* XXX TODO */
>> >
>> > r21641 is the big merge from the past weekend where the DDT split
>> came
>> > in.
>> >
>> > Has this area been looked at and the comment is out of date? Or
>> does
>> > it need to be updated with new mappings? (I honestly have not
>> looked
>> > any farther than this -- the new comment caught my eye)
>>
>> --
>> ------------------------------------------------------------------------
>> Rainer Keller, PhD Tel: +1 (865) 241-6293
>> Oak Ridge National Lab Fax: +1 (865) 241-4811
>> PO Box 2008 MS 6164 Email: keller_at_[hidden]
>> Oak Ridge, TN 37831-2008 AIM/Skype: rusraink
>>
>>
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel