Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] -hostfile ignored in 1.6.1 / SGE integration broken
From: Reuti (reuti_at_[hidden])
Date: 2012-09-03 18:50:16


Am 04.09.2012 um 00:07 schrieb Ralph Castain:

> I'm leaning towards fixing it - it came due to discussions on how to handle hostfile when there is an allocation. For now, though, that should work.

Oh, did I miss this on the list? If there is a hostfile given as argument, it should override the default hostfile IMO.

>>
>>
>>>> ==
>>>>
>>>> SGE issue
>>>>
>>>> I usually don't install new versions instantly, so I only noticed right now, that in 1.4.5 I get a wrong allocation inside SGE (always one process less than requested with `qsub -pe orted N ...`. This I tried only, as with 1.6.1 I get:
>>>>
>>>> --------------------------------------------------------------------------
>>>> There are no nodes allocated to this job.
>>>> --------------------------------------------------------------------------
>>>>
>>>> all the time.
>>>
>>> Weird - I'm not sure I understand what you are saying. Is this happening with 1.6.1 as well? Or just with 1.4.5?
>>
>> 1.6.1 = no nodes allocated
>> 1.4.5 = one process less than requested
>> 1.4.1 = works as it should
>>
>
> Well that seems strange! Can you run 1.6.1 with the following on the mpirun cmd line:
>
> -mca ras_gridengine_debug 1 -mca ras_gridengine_verbose 10 -mca ras_base_verbose 10

[pc15381:06250] mca: base: components_open: Looking for ras components
[pc15381:06250] mca: base: components_open: opening ras components
[pc15381:06250] mca: base: components_open: found loaded component cm
[pc15381:06250] mca: base: components_open: component cm has no register function
[pc15381:06250] mca: base: components_open: component cm open function successful
[pc15381:06250] mca: base: components_open: found loaded component gridengine
[pc15381:06250] mca: base: components_open: component gridengine has no register function
[pc15381:06250] mca: base: components_open: component gridengine open function successful
[pc15381:06250] mca: base: components_open: found loaded component loadleveler
[pc15381:06250] mca: base: components_open: component loadleveler has no register function
[pc15381:06250] mca: base: components_open: component loadleveler open function successful
[pc15381:06250] mca: base: components_open: found loaded component slurm
[pc15381:06250] mca: base: components_open: component slurm has no register function
[pc15381:06250] mca: base: components_open: component slurm open function successful
[pc15381:06250] mca:base:select: Auto-selecting ras components
[pc15381:06250] mca:base:select:( ras) Querying component [cm]
[pc15381:06250] mca:base:select:( ras) Skipping component [cm]. Query failed to return a module
[pc15381:06250] mca:base:select:( ras) Querying component [gridengine]
[pc15381:06250] mca:base:select:( ras) Query of component [gridengine] set priority to 100
[pc15381:06250] mca:base:select:( ras) Querying component [loadleveler]
[pc15381:06250] mca:base:select:( ras) Skipping component [loadleveler]. Query failed to return a module
[pc15381:06250] mca:base:select:( ras) Querying component [slurm]
[pc15381:06250] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module
[pc15381:06250] mca:base:select:( ras) Selected component [gridengine]
[pc15381:06250] mca: base: close: unloading component cm
[pc15381:06250] mca: base: close: unloading component loadleveler
[pc15381:06250] mca: base: close: unloading component slurm
[pc15381:06250] ras:gridengine: JOB_ID: 4636
[pc15381:06250] ras:gridengine: PE_HOSTFILE: /var/spool/sge/pc15381/active_jobs/4636.1/pe_hostfile
[pc15381:06250] ras:gridengine: pc15381: PE_HOSTFILE shows slots=1
[pc15381:06250] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
--------------------------------------------------------------------------
There are no nodes allocated to this job.
--------------------------------------------------------------------------
[pc15381:06250] mca: base: close: component gridengine closed
[pc15381:06250] mca: base: close: unloading component gridengine

The actual hostfile contains:

pc15381 1 all.q_at_pc15381 UNDEFINED
pc15370 2 extra.q_at_pc15370 UNDEFINED
pc15381 1 extra.q_at_pc15381 UNDEFINED

and it was submitted with `qsub -pe orted 4 ...`.

Aha, I remember an issue on the list, if a job gets slots from several queues that they weren't added. This was the issue in 1.4.5, ok. Wasn't it fixed lateron? But here it's getting no allocation at all.

==

If I force it to get jobs only from one queue:

[pc15370:30447] mca: base: components_open: Looking for ras components
[pc15370:30447] mca: base: components_open: opening ras components
[pc15370:30447] mca: base: components_open: found loaded component cm
[pc15370:30447] mca: base: components_open: component cm has no register function
[pc15370:30447] mca: base: components_open: component cm open function successful
[pc15370:30447] mca: base: components_open: found loaded component gridengine
[pc15370:30447] mca: base: components_open: component gridengine has no register function
[pc15370:30447] mca: base: components_open: component gridengine open function successful
[pc15370:30447] mca: base: components_open: found loaded component loadleveler
[pc15370:30447] mca: base: components_open: component loadleveler has no register function
[pc15370:30447] mca: base: components_open: component loadleveler open function successful
[pc15370:30447] mca: base: components_open: found loaded component slurm
[pc15370:30447] mca: base: components_open: component slurm has no register function
[pc15370:30447] mca: base: components_open: component slurm open function successful
[pc15370:30447] mca:base:select: Auto-selecting ras components
[pc15370:30447] mca:base:select:( ras) Querying component [cm]
[pc15370:30447] mca:base:select:( ras) Skipping component [cm]. Query failed to return a module
[pc15370:30447] mca:base:select:( ras) Querying component [gridengine]
[pc15370:30447] mca:base:select:( ras) Query of component [gridengine] set priority to 100
[pc15370:30447] mca:base:select:( ras) Querying component [loadleveler]
[pc15370:30447] mca:base:select:( ras) Skipping component [loadleveler]. Query failed to return a module
[pc15370:30447] mca:base:select:( ras) Querying component [slurm]
[pc15370:30447] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module
[pc15370:30447] mca:base:select:( ras) Selected component [gridengine]
[pc15370:30447] mca: base: close: unloading component cm
[pc15370:30447] mca: base: close: unloading component loadleveler
[pc15370:30447] mca: base: close: unloading component slurm
[pc15370:30447] ras:gridengine: JOB_ID: 4638
[pc15370:30447] ras:gridengine: PE_HOSTFILE: /var/spool/sge/pc15370/active_jobs/4638.1/pe_hostfile
[pc15370:30447] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
[pc15370:30447] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2

But: it starts only 2 processes instead of 4:

Total: 2
Universe: 4
Hello World from Rank 0.
Hello World from Rank 1.

Yes, I can add `mpiexec -np $NSLOTS ..` to get 4, but all will be on pc15370, the pc15381 is ignored completely.

==

If I go back to 1.4.1:

[pc15370:31052] mca: base: components_open: Looking for ras components
[pc15370:31052] mca: base: components_open: opening ras components
[pc15370:31052] mca: base: components_open: found loaded component gridengine
[pc15370:31052] mca: base: components_open: component gridengine has no register function
[pc15370:31052] mca: base: components_open: component gridengine open function successful
[pc15370:31052] mca: base: components_open: found loaded component slurm
[pc15370:31052] mca: base: components_open: component slurm has no register function
[pc15370:31052] mca: base: components_open: component slurm open function successful
[pc15370:31052] mca:base:select: Auto-selecting ras components
[pc15370:31052] mca:base:select:( ras) Querying component [gridengine]
[pc15370:31052] mca:base:select:( ras) Query of component [gridengine] set priority to 100
[pc15370:31052] mca:base:select:( ras) Querying component [slurm]
[pc15370:31052] mca:base:select:( ras) Skipping component [slurm]. Query failed to return a module
[pc15370:31052] mca:base:select:( ras) Selected component [gridengine]
[pc15370:31052] mca: base: close: unloading component slurm
[pc15370:31052] ras:gridengine: JOB_ID: 4640
[pc15370:31052] ras:gridengine: PE_HOSTFILE: /var/spool/sge/pc15370/active_jobs/4640.1/pe_hostfile
[pc15370:31052] ras:gridengine: pc15370: PE_HOSTFILE shows slots=2
[pc15370:31052] ras:gridengine: pc15381: PE_HOSTFILE shows slots=2

Total: 4
Universe: 4
Hello World from Rank 0.
Hello World from Rank 1.
Hello World from Rank 2.
Hello World from Rank 3.

And no "-np $NSLOTS" in the command, just a plain `mpiexec ./mpihello`.

-- Reuti

> My guess is that something in the pe_hostfile syntax may have changed and we didn't pick up on it.
>
>
>> -- Reuti
>>
>>
>>>
>>>>
>>>> ==
>>>>
>>>> I configured with:
>>>>
>>>> ./configure --prefix=$HOME/local/... --enable-static --disable-shared --with-sge
>>>>
>>>> and adjusted my PATHs accordingly (at least: I hope so).
>>>>
>>>> -- Reuti
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users