Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-10 15:12:11


On Jan 10, 2014, at 11:04 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain <rhc_at_[hidden]> wrote:
> ??? that was it? Was this built with --enable-debug?
>
> Nope, I missed --enable-debug. Will try again.
>
>
> OK, Take-2 below.
> There is an obvious "recipient list is empty!" in the output.

That one is correct and expected - all it means is that you are running on only one node, so mpirun doesn't need to relay messages to another daemon

>
> -Paul
>
> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca orte_nidmap_verbose 10 examples/ring_c'
> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 1
> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is empty!
> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 bytes
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c at line 503
> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set priority to 10
> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c at line 503

This is very weird - it appears that your procs are looking for hostname data prior to receiving the necessary data. Let's try jacking up the debug, I guess - add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca odls_base_verbose 5"

Sorry that will be rather wordy, but I don't understand the ordering you show above. It's like your procs are skipping a bunch of steps in the startup procedure.

Out of curiosity, if you do have an allocation on run on it, does it work?

>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel