Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
From: Reuti (reuti_at_[hidden])
Date: 2013-03-19 11:52:15


Hi,

Am 19.03.2013 um 08:00 schrieb tmishima_at_[hidden]:

> I didn't have much time to test this morning. So, I checked it again
> now. Then, the trouble seems to depend on the number of nodes to use.
>
> This works(nodes < 4):
> mpiexec -bynode -np 4 ./my_program && #PBS -l nodes=2:ppn=8
> (OMP_NUM_THREADS=4)
>
> This causes error(nodes >= 4):
> mpiexec -bynode -np 8 ./my_program && #PBS -l nodes=4:ppn=8
> (OMP_NUM_THREADS=4)

we don't use Torque/PBS on our own, but AFAIK the request "-l nodes=4:ppn=8" can give you 4 nodes with 8 slots each, or even some nodes twice or more often when slots are available and it's set up this way. To allow or disallow this behavior is a global setting in Torque/PBS.

Did you get different nodes or some nodes at least twice?

I don't know whether this is related to this issue, but at least worth to be mentioned in this context.

-- Reuti

> Regards,
> Tetsuya Mishima
>
>> Oy; that's weird.
>>
>> I'm afraid we're going to have to wait for Ralph to answer why that is
> happening -- sorry!
>>
>>
>> On Mar 18, 2013, at 4:45 PM, <tmishima_at_[hidden]> wrote:
>>
>>>
>>>
>>> Hi Correa and Jeff,
>>>
>>> Thank you for your comments. I quickly checked your suggestion.
>>>
>>> As a result, my simple example case worked well.
>>> export OMP_NUM_THREADS=4
>>> mpiexec -bynode -np 2 ./my_program && #PBS -l nodes=2:ppn=4
>>>
>>> But, practical case that more than 1 process was allocated to a node
> like
>>> below did not work.
>>> export OMP_NUM_THREADS=4
>>> mpiexec -bynode -np 4 ./my_program && #PBS -l nodes=2:ppn=8
>>>
>>> The error message is as follows:
>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is
>>> attempting to be sent to a process whose contact infor
>>> mation is unknown in file rml_oob_send.c at line 316
>>> [node08.cluster:11946] [[30666,0],3] unable to find address for
>>> [[30666,0],1]
>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is
>>> attempting to be sent to a process whose contact infor
>>> mation is unknown in file base/grpcomm_base_rollup.c at line 123
>>>
>>> Here is our openmpi configuration:
>>> ./configure \
>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
>>> --with-tm \
>>> --with-verbs \
>>> --disable-ipv6 \
>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \
>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
>>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>>> On Mar 17, 2013, at 10:55 PM, Gustavo Correa <gus_at_[hidden]>
>>> wrote:
>>>>
>>>>> In your example, have you tried not to modify the node file,
>>>>> launch two mpi processes with mpiexec, and request a "-bynode"
>>> distribution of processes:
>>>>>
>>>>> mpiexec -bynode -np 2 ./my_program
>>>>
>>>> This should work in 1.7, too (I use these kinds of options with SLURM
> all
>>> the time).
>>>>
>>>> However, we should probably verify that the hostfile functionality in
>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty sure that
>>> what you described should work. However, Ralph, our
>>>> run-time guy, is on vacation this week. There might be a delay in
>>> checking into this.
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users