Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-01-10 00:04:22


It's missing the hostname from the other process - should have been included in the data passed into each proc at startup, which is why it's so puzzling.

On Jan 9, 2014, at 8:56 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:

> Ralph,
>
> The problem has occurred with two builds (both PGI-based) on head nodes of two clusters managed by TORQUE, not by SLURM. Somehow configure on the first picked up SLURM headers and libs, but not TM. While the second picked up the TM headers and libs.
>
> I'll try a gcc-based build on one of the systems ASAP.
> Is there no way (w/o source mods) to know what datum is missing?
>
> -Paul
>
>
>
> On Thu, Jan 9, 2014 at 8:35 PM, Ralph Castain <rhc_at_[hidden]> wrote:
> From your ompi_info output, it looks like this is a slurm system - yes? Wouldn't really matter anyway as we run fine on a head node without an allocation, but worth clarifying.
>
> What the message is indicating is a failure of the modex - we are missing an expected piece of data. I don't see anything obvious as the source of the problem - works fine for me on all my machines, including on front end of a slurm cluster.
>
> Only possibly relevant thing I see is that this was built with PGI - any chance you could try a gcc based build? All my tests are done with gcc, so I'm wondering if PGI is the source of the trouble here.
>
>
> On Jan 9, 2014, at 6:17 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
>> I've now seen this same failure mode on another Linux system.
>> I forgot to mention before that the job is hung after issuing the error message.
>> Singleton runs fail in the same manner.
>>
>> Both are front-end machines and perhaps that is related to this failure; for instance expecting an allocation because of the batch system detected at configure time. However, I would have expected a more informative error message for that case.
>>
>> -Paul
>>
>>
>> On Thu, Jan 9, 2014 at 5:03 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>> Trying to run on the front-end of one of our production Linux systems I see the following:
>>
>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>> [cvrsvc01:17692] [[42051,1],0] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c at line 505
>> [cvrsvc01:17693] [[42051,1],1] ORTE_ERROR_LOG: Data for specified key not found in file /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c at line 505
>>
>> The "ompi_info --all" output is attached.
>>
>> Please let me know what MCA param(s) to set to collect any additional info needed to track down the problem.
>>
>> -Paul
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>>
>>
>> --
>> Paul H. Hargrove PHHargrove_at_[hidden]
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel