Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-10 15:12:11


On Jan 10, 2014, at 11:04 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> ??? that was it? Was this built with --enable-debug?
>
> Nope, I missed --enable-debug. Will try again.
>
>
> OK, Take-2 below.
> There is an obvious "recipient list is empty!" in the output.

That one is correct and expected - all it means is that you are running on only one node, so mpirun doesn't need to relay messages to another daemon

>
> -Paul
>
> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca orte_nidmap_verbose 10 examples/ring_c'
> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 1
> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is empty!
> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 bytes
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c at line 503
> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c at line 503

This is very weird - it appears that your procs are looking for hostname data prior to receiving the necessary data. Let's try jacking up the debug, I guess - add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca odls_base_verbose 5"

Sorry that will be rather wordy, but I don't understand the ordering you show above. It's like your procs are skipping a bunch of steps in the startup procedure.

Out of curiosity, if you do have an allocation on run on it, does it work?

>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel