Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Strange behaviour of SGE+OpenMPI
From: PN (poknam_at_[hidden])
Date: 2009-04-01 12:37:57


Thanks.

$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
/opt/openmpi-gcc/bin/mpirun --display-allocation --display-map -v -np
$NSLOTS --host node0001,node0002 hostname

$ cat HPL_8cpu_GB.o46

====================== ALLOCATED NODES ======================

 Data for node: Name: node0001 Num slots: 4 Max slots: 0
 Data for node: Name: node0002.v5cluster.com Num slots: 4 Max slots: 0

=================================================================

 ======================== JOB MAP ========================

 Data for node: Name: node0001 Num procs: 8
        Process OMPI jobid: [10982,1] Process rank: 0
        Process OMPI jobid: [10982,1] Process rank: 1
        Process OMPI jobid: [10982,1] Process rank: 2
        Process OMPI jobid: [10982,1] Process rank: 3
        Process OMPI jobid: [10982,1] Process rank: 4
        Process OMPI jobid: [10982,1] Process rank: 5
        Process OMPI jobid: [10982,1] Process rank: 6
        Process OMPI jobid: [10982,1] Process rank: 7

 =============================================================
node0001
node0001
node0001
node0001
node0001
node0001
node0001
node0001

I'm not sure why node0001 miss the domain name, is this related?
However the result is correct when I run "qconf -sel"

$ qconf -sel
node0001.v5cluster.com
node0002.v5cluster.com

2009/4/1 Ralph Castain <rhc_at_[hidden]>

> Rolf has correctly reminded me that display-allocation occurs prior to host
> filtering, so you will see all of the allocated nodes. You'll see the impact
> of the host specifications in display-map,
>
> Sorry for the confusion - thanks to Rolf for pointing it out.
> Ralph
>
>
> On Apr 1, 2009, at 7:40 AM, Ralph Castain wrote:
>
> As an FYI: you can debug allocation issues more easily by:
>>
>> mpirun --display-allocation --do-not-launch -n 1 foo
>>
>> This will read the allocation, do whatever host filtering you specify with
>> -host and -hostfile options, report out the result, and then terminate
>> without trying to launch anything. I found it most useful for debugging
>> these situations.
>>
>> If you want to know where the procs would have gone, then you can do:
>>
>> mpirun --display-allocation --display-map --do-not-launch -n 8 foo
>>
>> In this case, the #procs you specify needs to be the number you actually
>> wanted so that the mapper will properly run. However, the executable can be
>> bogus and nothing will actually launch. It's the closest you can come to a
>> dry run of a job.
>>
>> HTH
>> Ralph
>>
>>
>> On Apr 1, 2009, at 7:10 AM, Rolf Vandevaart wrote:
>>
>> It turns out that the use of --host and --hostfile act as a filter of
>>> which nodes to run on when you are running under SGE. So, listing them
>>> several times does not affect where the processes land. However, this still
>>> does not explain why you are seeing what you are seeing. One thing you can
>>> try is to add this to the mpirun command.
>>>
>>> -mca ras_gridengine_verbose 100
>>>
>>> This will provide some additional information as to what Open MPI is
>>> seeing as nodes and slots from SGE. (Is there any chance that node0002
>>> actually has 8 slots?)
>>>
>>> I just retried on my cluster of 2 CPU sparc solaris nodes. When I run
>>> with np=2, the two MPI processes will all land on a single node, because
>>> that node has two slots. When I go up to np=4, then they move on to the
>>> other node. The --host acts as a filter to where they should run.
>>>
>>> In terms of the using "IB bonding", I do not know what that means
>>> exactly. Open MPI does stripe over multiple IB interfaces, so I think the
>>> answer is yes.
>>>
>>> Rolf
>>>
>>> PS: Here is what my np=4 job script looked like. (I just changed np=2
>>> for the other run)
>>>
>>> burl-ct-280r-0 148 =>more run.sh
>>> #! /bin/bash
>>> #$ -S /bin/bash
>>> #$ -V
>>> #$ -cwd
>>> #$ -N Job1
>>> #$ -pe orte 200
>>> #$ -j y
>>> #$ -l h_rt=00:20:00 # Run time (hh:mm:ss) - 10 min
>>>
>>> echo $NSLOTS
>>> /opt/SUNWhpc/HPC8.2/sun/bin/mpirun -mca ras_gridengine_verbose 100 -v -np
>>> 4 -host burl-ct-280r-1,burl-ct-280r-0 -mca btl self,sm,tcp hostname
>>>
>>> Here is the output (somewhat truncated)
>>> burl-ct-280r-0 150 =>more Job1.o199
>>> 200
>>> [burl-ct-280r-2:22132] ras:gridengine: JOB_ID: 199
>>> [burl-ct-280r-2:22132] ras:gridengine: PE_HOSTFILE:
>>> /ws/ompi-tools/orte/sge/sge6_2u1/default/spool/burl-ct-280r-2/active_jobs/199.1/pe_hostfile
>>> [..snip..]
>>> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-0: PE_HOSTFILE shows
>>> slots=2
>>> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-1: PE_HOSTFILE shows
>>> slots=2
>>> [..snip..]
>>> burl-ct-280r-1
>>> burl-ct-280r-1
>>> burl-ct-280r-0
>>> burl-ct-280r-0
>>> burl-ct-280r-0 151 =>
>>>
>>>
>>> On 03/31/09 22:39, PN wrote:
>>>
>>>> Dear Rolf,
>>>> Thanks for your reply.
>>>> I've created another PE and changed the submission script, explicitly
>>>> specify the hostname with "--host".
>>>> However the result is the same.
>>>> # qconf -sp orte
>>>> pe_name orte
>>>> slots 8
>>>> user_lists NONE
>>>> xuser_lists NONE
>>>> start_proc_args /bin/true
>>>> stop_proc_args /bin/true
>>>> allocation_rule $fill_up
>>>> control_slaves TRUE
>>>> job_is_first_task FALSE
>>>> urgency_slots min
>>>> accounting_summary TRUE
>>>> $ cat hpl-8cpu-test.sge
>>>> #!/bin/bash
>>>> #
>>>> #$ -N HPL_8cpu_GB
>>>> #$ -pe orte 8
>>>> #$ -cwd
>>>> #$ -j y
>>>> #$ -S /bin/bash
>>>> #$ -V
>>>> #
>>>> cd /home/admin/hpl-2.0
>>>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS --host
>>>> node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002
>>>> ./bin/goto-openmpi-gcc/xhpl
>>>> # pdsh -a ps ax --width=200|grep hpl
>>>> node0002: 18901 ? S 0:00 /opt/openmpi-gcc/bin/mpirun -v -np
>>>> 8 --host
>>>> node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002
>>>> ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18902 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18903 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18904 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18905 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18906 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18907 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18908 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> node0002: 18909 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>>> Any hint to debug this situation?
>>>> Also, if I have 2 IB ports in each node, which IB bonding was done, will
>>>> Open MPI automatically benefit from the double bandwidth?
>>>> Thanks a lot.
>>>> Best Regards,
>>>> PN
>>>> 2009/4/1 Rolf Vandevaart <Rolf.Vandevaart_at_[hidden] <mailto:
>>>> Rolf.Vandevaart_at_[hidden]>>
>>>> On 03/31/09 11:43, PN wrote:
>>>> Dear all,
>>>> I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2
>>>> I have 2 compute nodes for testing, each node has a single quad
>>>> core CPU.
>>>> Here is my submission script and PE config:
>>>> $ cat hpl-8cpu.sge
>>>> #!/bin/bash
>>>> #
>>>> #$ -N HPL_8cpu_IB
>>>> #$ -pe mpi-fu 8
>>>> #$ -cwd
>>>> #$ -j y
>>>> #$ -S /bin/bash
>>>> #$ -V
>>>> #
>>>> cd /home/admin/hpl-2.0
>>>> # For IB
>>>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS -machinefile
>>>> $TMPDIR/machines ./bin/goto-openmpi-gcc/xhpl
>>>> I've tested the mpirun command can be run correctly in command
>>>> line.
>>>> $ qconf -sp mpi-fu
>>>> pe_name mpi-fu
>>>> slots 8
>>>> user_lists NONE
>>>> xuser_lists NONE
>>>> start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
>>>> stop_proc_args /opt/sge/mpi/stopmpi.sh
>>>> allocation_rule $fill_up
>>>> control_slaves TRUE
>>>> job_is_first_task FALSE
>>>> urgency_slots min
>>>> accounting_summary TRUE
>>>> I've checked the $TMPDIR/machines after submit, it was correct.
>>>> node0002
>>>> node0002
>>>> node0002
>>>> node0002
>>>> node0001
>>>> node0001
>>>> node0001
>>>> node0001
>>>> However, I found that if I explicitly specify the "-machinefile
>>>> $TMPDIR/machines", all 8 mpi processes were spawned within a
>>>> single node, i.e. node0002.
>>>> However, if I omit "-machinefile $TMPDIR/machines" in the line
>>>> mpirun, i.e.
>>>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS
>>>> ./bin/goto-openmpi-gcc/xhpl
>>>> The mpi processes can start correctly, 4 processes in node0001
>>>> and 4 processes in node0002.
>>>> Is this normal behaviour of Open MPI?
>>>> I just tried it both ways and I got the same result both times. The
>>>> processes are split between the nodes. Perhaps to be extra sure,
>>>> you can just run hostname? And for what it is worth, as you have
>>>> seen, you do not need to specify a machines file. Open MPI will use
>>>> the ones that were allocated by SGE. You can also change your
>>>> parallel queue to not run any scripts. Like this:
>>>> start_proc_args /bin/true
>>>> stop_proc_args /bin/true
>>>> Also, I wondered if I have IB interface, for example, the
>>>> hostname of IB become node0001-clust and node0002-clust, will
>>>> Open MPI automatically use the IB interface?
>>>> Yes, it should use the IB interface.
>>>> How about if I have 2 IB ports in each node, which IB bonding
>>>> was done, will Open MPI automatically benefit from the double
>>>> bandwidth?
>>>> Thanks a lot.
>>>> Best Regards,
>>>> PN
>>>>
>>>> ------------------------------------------------------------------------
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> -- =========================
>>>> rolf.vandevaart_at_[hidden] <mailto:rolf.vandevaart_at_[hidden]>
>>>> 781-442-3043
>>>> =========================
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> ------------------------------------------------------------------------
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>
>>>
>>> --
>>>
>>> =========================
>>> rolf.vandevaart_at_[hidden]
>>> 781-442-3043
>>> =========================
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>