Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] modified hostfile does not work with openmpi1.7rc8
From: Gus Correa (gus_at_[hidden])
Date: 2013-03-19 12:14:04


Hi Tetsuya Mishima

Mpiexec offers you a number of possibilities that you could try:
--bynode,
--pernode,
--npernode,
--bysocket,
--bycore,
--cpus-per-proc,
--cpus-per-rank,
--rankfile
and more.

Most likely one or more of them will fit your needs.

There are also associated flags to bind processes to cores,
to sockets, etc, to report the bindings, and so on.

Check the mpiexec man page for details.

Nevertheless, I am surprised that modifying the
$PBS_NODEFILE doesn't work for you in OMPI 1.7.
I have done this many times in older versions of OMPI.

Would it work for you to go back to the stable OMPI 1.6.X,
or does it lack any special feature that you need?

I hope this helps,
Gus Correa

On 03/19/2013 03:00 AM, tmishima_at_[hidden] wrote:
>
>
> Hi Jeff,
>
> I didn't have much time to test this morning. So, I checked it again
> now. Then, the trouble seems to depend on the number of nodes to use.
>
> This works(nodes< 4):
> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8
> (OMP_NUM_THREADS=4)
>
> This causes error(nodes>= 4):
> mpiexec -bynode -np 8 ./my_program&& #PBS -l nodes=4:ppn=8
> (OMP_NUM_THREADS=4)
>
> Regards,
> Tetsuya Mishima
>
>> Oy; that's weird.
>>
>> I'm afraid we're going to have to wait for Ralph to answer why that is
> happening -- sorry!
>>
>>
>> On Mar 18, 2013, at 4:45 PM,<tmishima_at_[hidden]> wrote:
>>
>>>
>>>
>>> Hi Correa and Jeff,
>>>
>>> Thank you for your comments. I quickly checked your suggestion.
>>>
>>> As a result, my simple example case worked well.
>>> export OMP_NUM_THREADS=4
>>> mpiexec -bynode -np 2 ./my_program&& #PBS -l nodes=2:ppn=4
>>>
>>> But, practical case that more than 1 process was allocated to a node
> like
>>> below did not work.
>>> export OMP_NUM_THREADS=4
>>> mpiexec -bynode -np 4 ./my_program&& #PBS -l nodes=2:ppn=8
>>>
>>> The error message is as follows:
>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is
>>> attempting to be sent to a process whose contact infor
>>> mation is unknown in file rml_oob_send.c at line 316
>>> [node08.cluster:11946] [[30666,0],3] unable to find address for
>>> [[30666,0],1]
>>> [node08.cluster:11946] [[30666,0],3] ORTE_ERROR_LOG: A message is
>>> attempting to be sent to a process whose contact infor
>>> mation is unknown in file base/grpcomm_base_rollup.c at line 123
>>>
>>> Here is our openmpi configuration:
>>> ./configure \
>>> --prefix=/home/mishima/opt/mpi/openmpi-1.7rc8-pgi12.9 \
>>> --with-tm \
>>> --with-verbs \
>>> --disable-ipv6 \
>>> CC=pgcc CFLAGS="-fast -tp k8-64e" \
>>> CXX=pgCC CXXFLAGS="-fast -tp k8-64e" \
>>> F77=pgfortran FFLAGS="-fast -tp k8-64e" \
>>> FC=pgfortran FCFLAGS="-fast -tp k8-64e"
>>>
>>> Regards,
>>> Tetsuya Mishima
>>>
>>>> On Mar 17, 2013, at 10:55 PM, Gustavo Correa<gus_at_[hidden]>
>>> wrote:
>>>>
>>>>> In your example, have you tried not to modify the node file,
>>>>> launch two mpi processes with mpiexec, and request a "-bynode"
>>> distribution of processes:
>>>>>
>>>>> mpiexec -bynode -np 2 ./my_program
>>>>
>>>> This should work in 1.7, too (I use these kinds of options with SLURM
> all
>>> the time).
>>>>
>>>> However, we should probably verify that the hostfile functionality in
>>> batch jobs hasn't been broken in 1.7, too, because I'm pretty sure that
>>> what you described should work. However, Ralph, our
>>>> run-time guy, is on vacation this week. There might be a delay in
>>> checking into this.
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquyres_at_[hidden]
>>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquyres_at_[hidden]
>> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users