Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-10 17:59:20


I *believe* oob can now support virtual interfaces, but can't swear to it - only very lightly tested on my box.

I'll mark this in for resolving in 1.7.5

On Jan 10, 2014, at 1:55 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> Ralph,
>
> Since this turned out to be a matter of an unsupported system configuration, it is my opinion that this doesn't need to be addressed for 1.7.4 if it would cause any further delay.
>
> Also, I noticed this system has lo and lo:0.
> I know the TCP BTL doesn't support virtual interfaces (trac ticket 3339).
> So, I mention it here in case oob:tcp has similar issues.
>
> -Paul
>
>
> On Fri, Jan 10, 2014 at 1:02 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>
> On Jan 10, 2014, at 12:59 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
>> Ralph,
>>
>> This is the front end of a production cluster at NERSC.
>> So, I would not be surprised if there is a fairly restrictive firewall configuration in place.
>> However, I could't find a way to query the configuration.
>>
>
> Aha - indeed, that is the problem.
>
>> The verbose output with (only) "-mca oob_base_verbose 10" is attached.
>>
>> On a hunch, I tried adding "-mca oob_tcp_if_include lo" and IT WORKS!
>> Is there some reason why the loopback interface is not being used automatically for the single-host case?
>> That would seem to be a straight-forward solution to this issue.
>
> Yeah, we should do a better job of that - I'll take a look and see what can be done in the near term.
>
> Thanks!
> Ralph
>
>>
>> -Paul
>>
>>
>> On Fri, Jan 10, 2014 at 12:43 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>> Bingo - the proc can't send a message to the daemon to tell it "i'm alive and need my nidmap data". I suspect we'll find that your headnode isn't allowing us to open a socket for communication between two processes on it, and we don't have (yet) a pipe-like mechanism to replace it.
>>
>> Can verify that by putting "-mca oob_base_verbose 10" on the cmd line - should see the oob indicate that it fails to make the connection back to the daemon
>>
>>
>> On Jan 10, 2014, at 12:33 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>
>>> Ralph,
>>>
>>> Configuring using a proper --with-tm=... I find that I *can* run a singleton in an allocation ("qsub -I -l nodes=1 ....").
>>> The case of a singleton on the front end is still failing.
>>>
>>> The verbose output using "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca odls_base_verbose 5" is attached.
>>>
>>> -Paul
>>>
>>>
>>> On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>
>>> On Jan 10, 2014, at 11:04 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>>
>>>> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>>>>
>>>> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain <rhc_at_[hidden]> wrote:
>>>> ??? that was it? Was this built with --enable-debug?
>>>>
>>>> Nope, I missed --enable-debug. Will try again.
>>>>
>>>>
>>>> OK, Take-2 below.
>>>> There is an obvious "recipient list is empty!" in the output.
>>>
>>> That one is correct and expected - all it means is that you are running on only one node, so mpirun doesn't need to relay messages to another daemon
>>>
>>>>
>>>> -Paul
>>>>
>>>> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca orte_nidmap_verbose 10 examples/ring_c'
>>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
>>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
>>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
>>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
>>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 1
>>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
>>>> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is empty!
>>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>>> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 bytes
>>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
>>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
>>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
>>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
>>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
>>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
>>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
>>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
>>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
>>>> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
>>>> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c at line 503
>>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
>>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
>>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
>>>> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
>>>> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c at line 503
>>>
>>>
>>> This is very weird - it appears that your procs are looking for hostname data prior to receiving the necessary data. Let's try jacking up the debug, I guess - add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca odls_base_verbose 5"
>>>
>>> Sorry that will be rather wordy, but I don't understand the ordering you show above. It's like your procs are skipping a bunch of steps in the startup procedure.
>>>
>>> Out of curiosity, if you do have an allocation on run on it, does it work?
>>>
>>>>
>>>>
>>>> --
>>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>>> Future Technologies Group
>>>> Computer and Data Sciences Department Tel: +1-510-495-2352
>>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>>
>>> --
>>> Paul H. Hargrove PHHargrove_at_[hidden]
>>> Future Technologies Group
>>> Computer and Data Sciences Department Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>> <log-fe.bz2>_______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> <log-fe-2.bz2>_______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel