On Aug 19, 2013, at 8:02 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> That's how it works now. My concern is with the error message scenario. IIRC, Jeff's issue was that the error message only contains the hostname of the proc that generates it - it doesn't tell you the hostname of the remote proc. Hence, we included that info in the proc_t.
This is quite important for getting useful error messages.
> However, IIRC we also provided an option to *not* send that info due to scaling concerns way back when. I wonder if we can resolve this simply by having Nathan set that option in his platform .conf files, and then removing ompi_proc_get_hostname completely. Since the IP-based comm channels will call modex_recv anyway, we'll get the hostname at that time. Otherwise, the errors print "NULL" for proc->hostname.
> Yes, that means that users of direct-launched apps on Nathan's systems will get less informative error messages - but they can always override Nathan's default param if they want better info. After all, the vast majority of users aren't running such big jobs as to care about this optimization.
I'm good with it. It could also be (might already be) a run-time MCA param...?
We could also default the value to -1 (vs. 0 or 1), meaning: with np <= N procs, send the hostname around, otherwise, don't send it (we can argue over the value of N -- e.g., 1024 or 2048).
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/