Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-09 23:35:59


From your ompi_info output, it looks like this is a slurm system - yes? Wouldn't really matter anyway as we run fine on a head node without an allocation, but worth clarifying.

What the message is indicating is a failure of the modex - we are missing an expected piece of data. I don't see anything obvious as the source of the problem - works fine for me on all my machines, including on front end of a slurm cluster.

Only possibly relevant thing I see is that this was built with PGI - any chance you could try a gcc based build? All my tests are done with gcc, so I'm wondering if PGI is the source of the trouble here.

On Jan 9, 2014, at 6:17 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> I've now seen this same failure mode on another Linux system.
> I forgot to mention before that the job is hung after issuing the error message.
> Singleton runs fail in the same manner.
>
> Both are front-end machines and perhaps that is related to this failure; for instance expecting an allocation because of the batch system detected at configure time. However, I would have expected a more informative error message for that case.
>
> -Paul
>
>
> On Thu, Jan 9, 2014 at 5:03 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
> Trying to run on the front-end of one of our production Linux systems I see the following:
>
> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
> [cvrsvc01:17692] [[42051,1],0] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c at line 505
> [cvrsvc01:17693] [[42051,1],1] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c at line 505
>
> The "ompi_info --all" output is attached.
>
> Please let me know what MCA param(s) to set to collect any additional info needed to track down the problem.
>
> -Paul
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel