Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] autoconf warnings: openib BTL
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-03-24 11:46:40


All true - and yet, becoming more common in larger clusters :-/

On Mar 24, 2014, at 7:42 AM, Kenneth A. Lloyd <kenneth.lloyd_at_[hidden]> wrote:

> Vasily,
>
> The problem you've identified of differing kernel versions is exacerbated by
> also computing across hybrid, heterogeneous hardware architectures (i.e.
> SMP& NUMA, different streaming processor architectures, or different shared
> memory architectures).
>
> ==========================
> Kenneth A. Lloyd, Jr.
> CEO - Director, Systems Science
> Watt Systems Technologies Inc.
> Albuquerque, NM USA
> www.wattsys.com
> kenneth.lloyd_at_[hidden]
>
> This e-mail is covered by the Electronic Communications Privacy Act, 18
> U.S.C. 2510-2521, and is intended only for the addressee named above. It may
> contain privileged or confidential information. If you are not the addressee
> you must not copy, distribute, disclose or use any of the information in
> this transmission. If you received it in error, please delete it and
> immediately notify the sender.
>
>
>
> -----Original Message-----
> From: devel [mailto:devel-bounces_at_[hidden]] On Behalf Of Vasily Filipov
> Sent: Monday, March 24, 2014 7:44 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] autoconf warnings: openib BTL
>
> Actually I think if you build your job with one kernel version and run it on
> nodes that have another version so rdmacm will be the smallest your problem.
> Anyway, here is the revision fixes the issue.
>
> ------------------------------------------------------------------------
> r31194 | vasily | 2014-03-24 15:36:04 +0200 (Mon, 24 Mar 2014) | 3 lines
>
> BTL/OPENIB: remove AC_RUN_IFELSE from configure and check AF_IB support by
> lib rdmacm during component_init.
>
>
> ------------------------------------------------------------------------
>
> Thank you,
> Vasily.
>
> On 13-Mar-14 15:44, Ralph Castain wrote:
>> I think the critical point is this one:
>>
>>> To be clear: whether AF_IB works or not is a determination to make on the
> machines on which you *run* -- NOT on the machine on which you *build*.
>> Many of our users compile on the frontend node of their cluster, which
> doesn't even have an IB NIC installed (they only have the libraries present
> so it can compile). You need to test this at run time to ensure you are on a
> machine where someone actually is able to run rdmacm.
>>
>>
>> On Mar 13, 2014, at 5:53 AM, Jeff Squyres (jsquyres) <jsquyres_at_[hidden]>
> wrote:
>>
>>> On Mar 13, 2014, at 4:59 AM, Mike Dubman <miked_at_[hidden]>
> wrote:
>>>
>>>>>>> Right? If so, I don't see why you need the AC_TRY_RUN -- if RDMACM
> is easily detectable as to which way it is compiled (because it has, for
> example, different fields), then AC_CHECK_DECLS should be enough, right?
>>>> RDMACM API has different implementation requirements for its providers:
> tcp, af_ib (different structs/fields should be used/passed. different
> APIs/hooks should be called for bring-up).
>>> Yes, this was said before. Which is why I don't understand why
> AC_CHECK_DECLS isn't enough -- it's a compile-time check, right?
>>>
>>> Let me get this straight:
>>>
>>> 1. AF_IB may or may not be present.
>>> 2. If AF_IB is present, it may or may not work (i.e., support for AF_IB
> may or may not work in the kernel).
>>> 3. If AF_IB is present, you can only compile with the AF_IB fields and
> methods.
>>> 4. If AF_IB is not present, you can only compile with the non-AF_IB
> fields and methods.
>>>
>>> I think #2 is not relevant for configure -- only #1, #3, and #4 are
> relevant. So you should have code something like this:
>>>
>>> #if HAVE_DECL_AF_IB
>>> ret = do_the_stuff_with_af_ib(...);
>>> if (OMPI_SUCCESS != ret) {
>>> opal_show_help(...AF_IB doesn't work...);
>>> return ret;
>>> }
>>> #else
>>> ret = do_the_stuff_without_af_ib(...);
>>> if (OMPI_SUCCESS != ret) {
>>> opal_show_help(...non-AF_IB doesn't work...);
>>> return ret;
>>> }
>>> #endif
>>>
>>> To be clear: whether AF_IB works or not is a determination to make on the
> machines on which you *run* -- NOT on the machine on which you *build*.
>>>
>>> This is one of the key reasons that OMPI prefers run-time detection for
> run-time characteristics over configure-time detection for run-time
> characteristics (because you may run OMPI on different machines than where
> you built OMPI).
>>>
>>>> Currently, the RDMACM provider can be selected at compile time only and
> mpirun becomes incompatible to other RDMACM providers.
>>> What does mpirun have to do with this? We're talking about the openib
> BTL, right?
>>>
>>>> AC_TRY_RUN is a protection that selected provider will be able to
> run,otherwise no fallback to other provider will be available for user at
> runtime.
>>> I can't parse this statement...?
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14342.php
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14343.php
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/03/14381.php
>
>
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2014.0.4336 / Virus Database: 3722/7238 - Release Date: 03/23/14
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/03/14382.php