Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2014-01-09 23:56:11


Ralph,

The problem has occurred with two builds (both PGI-based) on head nodes of
two clusters managed by TORQUE, not by SLURM. Somehow configure on the
first picked up SLURM headers and libs, but not TM. While the second
picked up the TM headers and libs.

I'll try a gcc-based build on one of the systems ASAP.
Is there no way (w/o source mods) to know what datum is missing?

-Paul

On Thu, Jan 9, 2014 at 8:35 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> From your ompi_info output, it looks like this is a slurm system - yes?
> Wouldn't really matter anyway as we run fine on a head node without an
> allocation, but worth clarifying.
>
> What the message is indicating is a failure of the modex - we are missing
> an expected piece of data. I don't see anything obvious as the source of
> the problem - works fine for me on all my machines, including on front end
> of a slurm cluster.
>
> Only possibly relevant thing I see is that this was built with PGI - any
> chance you could try a gcc based build? All my tests are done with gcc, so
> I'm wondering if PGI is the source of the trouble here.
>
>
> On Jan 9, 2014, at 6:17 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> I've now seen this same failure mode on another Linux system.
> I forgot to mention before that the job is hung after issuing the error
> message.
> Singleton runs fail in the same manner.
>
> Both are front-end machines and perhaps that is related to this failure;
> for instance expecting an allocation because of the batch system detected
> at configure time. However, I would have expected a more informative error
> message for that case.
>
> -Paul
>
>
> On Thu, Jan 9, 2014 at 5:03 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
>> Trying to run on the front-end of one of our production Linux systems I
>> see the following:
>>
>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>> [cvrsvc01:17692] [[42051,1],0] ORTE_ERROR_LOG: Data for specified key not
>> found in file
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c
>> at line 505
>> [cvrsvc01:17693] [[42051,1],1] ORTE_ERROR_LOG: Data for specified key not
>> found in file
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c
>> at line 505
>>
>> The "ompi_info --all" output is attached.
>>
>> Please let me know what MCA param(s) to set to collect any additional
>> info needed to track down the problem.
>>
>> -Paul
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900