Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-10 17:59:20


I *believe* oob can now support virtual interfaces, but can't swear to it - only very lightly tested on my box.

I'll mark this in for resolving in 1.7.5

On Jan 10, 2014, at 1:55 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> Ralph,
>
> Since this turned out to be a matter of an unsupported system configuration, it is my opinion that this doesn't need to be addressed for 1.7.4 if it would cause any further delay.
>
> Also, I noticed this system has lo and lo:0.
> I know the TCP BTL doesn't support virtual interfaces (trac ticket 3339).
> So, I mention it here in case oob:tcp has similar issues.
>
> -Paul
>
>
> On Fri, Jan 10, 2014 at 1:02 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> On Jan 10, 2014, at 12:59 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
>> Ralph,
>>
>> This is the front end of a production cluster at NERSC.
>> So, I would not be surprised if there is a fairly restrictive firewall configuration in place.
>> However, I could't find a way to query the configuration.
>>
>
> Aha - indeed, that is the problem.
>
>> The verbose output with (only) "-mca oob_base_verbose 10" is attached.
>>
>> On a hunch, I tried adding "-mca oob_tcp_if_include lo" and IT WORKS!
>> Is there some reason why the loopback interface is not being used automatically for the single-host case?
>> That would seem to be a straight-forward solution to this issue.
>
> Yeah, we should do a better job of that - I'll take a look and see what can be done in the near term.
>
> Thanks!
> Ralph
>
>>
>> -Paul
>>
>>
>> On Fri, Jan 10, 2014 at 12:43 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> Bingo - the proc can't send a message to the daemon to tell it "i'm alive and need my nidmap data". I suspect we'll find that your headnode isn't allowing us to open a socket for communication between two processes on it, and we don't have (yet) a pipe-like mechanism to replace it.
>>
>> Can verify that by putting "-mca oob_base_verbose 10" on the cmd line - should see the oob indicate that it fails to make the connection back to the daemon
>>
>>
>> On Jan 10, 2014, at 12:33 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>
>>> Ralph,
>>>
>>> Configuring using a proper --with-tm=... I find that I *can* run a singleton in an allocation ("qsub -I -l nodes=1 ....").
>>> The case of a singleton on the front end is still failing.
>>>
>>> The verbose output using "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca odls_base_verbose 5" is attached.
>>>
>>> -Paul
>>>
>>>
>>> On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>> On Jan 10, 2014, at 11:04 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>>
>>>> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>>>
>>>> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>> ??? that was it? Was this built with --enable-debug?
>>>>
>>>> Nope, I missed --enable-debug. Will try again.
>>>>
>>>>
>>>> OK, Take-2 below.
>>>> There is an obvious "recipient list is empty!" in the output.
>>>
>>> That one is correct and expected - all it means is that you are running on only one node, so mpirun doesn't need to relay messages to another daemon
>>>
>>>>
>>>> -Paul
>>>>
>>>> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca orte_nidmap_verbose 10 examples/ring_c'
>>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
>>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
>>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
>>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
>>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 1
>>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
>>>> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is empty!
>>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>>> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 bytes
>>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
>>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
>>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
>>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
>>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
>>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
>>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
>>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
>>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
>>>> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
>>>> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c at line 503
>>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
>>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
>>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
>>>> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
>>>> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c at line 503
>>>
>>>
>>> This is very weird - it appears that your procs are looking for hostname data prior to receiving the necessary data. Let's try jacking up the debug, I guess - add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca odls_base_verbose 5"
>>>
>>> Sorry that will be rather wordy, but I don't understand the ordering you show above. It's like your procs are skipping a bunch of steps in the startup procedure.
>>>
>>> Out of curiosity, if you do have an allocation on run on it, does it work?
>>>
>>>>
>>>>
>>>> --
>>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>>> Future Technologies Group
>>>> Computer and Data Sciences Department Tel: +1-510-495-2352
>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> --
>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>> Future Technologies Group
>>> Computer and Data Sciences Department Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>> <log-fe.bz2>_______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> <log-fe-2.bz2>_______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel