Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r29040 - in trunk: ompi/mca/bml/r2 ompi/mca/btl/base ompi/mca/btl/openib ompi/mca/btl/openib/connect ompi/mca/btl/tcp ompi/mca/btl/udapl ompi/mca/btl/ugni ompi/mca/btl/usnic ompi/mca/common/ofacm ompi/mca/m...
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-08-20 10:35:03


The error messages already output the name of the other proc, so that should be available. Besides, I just spent all yesterday afternoon auditing our MPI layers memory usage byte-by-byte and getting my ears burned about the need to reduce that footprint - not really thrilled about adding to it.

I think the key here is to only do this reduction when directed to do so. It only benefits really big scale, which is the exception and not the rule. And if someone in that scenario wants the error output, they can just ask for it (assuming their sys admin defaulted it to not include the hostname).

On Aug 20, 2013, at 3:18 AM, George Bosilca <bosilca_at_[hidden]> wrote:

> If we don't want to lose the usefulness of the error messages (and don't care that much about the memory requirements), we can initialize this value with the string of the rank of the process in MPI_COMM_WORLD (instead of NULL). We will at least get an idea where to start looking in case of troubles …
>
> George.
>
> On Aug 20, 2013, at 04:20 , Ralph Castain <rhc_at_[hidden]> wrote:
>
>>
>> On Aug 19, 2013, at 6:07 PM, "Jeff Squyres (jsquyres)" <jsquyres_at_[hidden]> wrote:
>>
>>> On Aug 19, 2013, at 8:02 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>>> That's how it works now. My concern is with the error message scenario. IIRC, Jeff's issue was that the error message only contains the hostname of the proc that generates it - it doesn't tell you the hostname of the remote proc. Hence, we included that info in the proc_t.
>>>
>>> This is quite important for getting useful error messages.
>>>
>>>> However, IIRC we also provided an option to *not* send that info due to scaling concerns way back when. I wonder if we can resolve this simply by having Nathan set that option in his platform .conf files, and then removing ompi_proc_get_hostname completely. Since the IP-based comm channels will call modex_recv anyway, we'll get the hostname at that time. Otherwise, the errors print "NULL" for proc->hostname.
>>>>
>>>> Yes, that means that users of direct-launched apps on Nathan's systems will get less informative error messages - but they can always override Nathan's default param if they want better info. After all, the vast majority of users aren't running such big jobs as to care about this optimization.
>>>
>>> I'm good with it. It could also be (might already be) a run-time MCA param...?
>>
>> I think it is - I'll check tonight
>>
>>>
>>> We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N procs, send the hostname around, otherwise, don't send it (we can argue over the value of N -- e.g., 1024 or 2048).
>>
>> That makes the most sense to me - for small jobs, the time difference is too tiny to measure.
>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquyres_at_[hidden]
>>> For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel