Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Strange behaviour of SGE+OpenMPI
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-04-01 10:04:21


Rolf has correctly reminded me that display-allocation occurs prior to
host filtering, so you will see all of the allocated nodes. You'll see
the impact of the host specifications in display-map,

Sorry for the confusion - thanks to Rolf for pointing it out.
Ralph

On Apr 1, 2009, at 7:40 AM, Ralph Castain wrote:

> As an FYI: you can debug allocation issues more easily by:
>
> mpirun --display-allocation --do-not-launch -n 1 foo
>
> This will read the allocation, do whatever host filtering you
> specify with -host and -hostfile options, report out the result, and
> then terminate without trying to launch anything. I found it most
> useful for debugging these situations.
>
> If you want to know where the procs would have gone, then you can do:
>
> mpirun --display-allocation --display-map --do-not-launch -n 8 foo
>
> In this case, the #procs you specify needs to be the number you
> actually wanted so that the mapper will properly run. However, the
> executable can be bogus and nothing will actually launch. It's the
> closest you can come to a dry run of a job.
>
> HTH
> Ralph
>
>
> On Apr 1, 2009, at 7:10 AM, Rolf Vandevaart wrote:
>
>> It turns out that the use of --host and --hostfile act as a filter
>> of which nodes to run on when you are running under SGE. So,
>> listing them several times does not affect where the processes
>> land. However, this still does not explain why you are seeing what
>> you are seeing. One thing you can try is to add this to the mpirun
>> command.
>>
>> -mca ras_gridengine_verbose 100
>>
>> This will provide some additional information as to what Open MPI
>> is seeing as nodes and slots from SGE. (Is there any chance that
>> node0002 actually has 8 slots?)
>>
>> I just retried on my cluster of 2 CPU sparc solaris nodes. When I
>> run with np=2, the two MPI processes will all land on a single
>> node, because that node has two slots. When I go up to np=4, then
>> they move on to the other node. The --host acts as a filter to
>> where they should run.
>>
>> In terms of the using "IB bonding", I do not know what that means
>> exactly. Open MPI does stripe over multiple IB interfaces, so I
>> think the answer is yes.
>>
>> Rolf
>>
>> PS: Here is what my np=4 job script looked like. (I just changed
>> np=2 for the other run)
>>
>> burl-ct-280r-0 148 =>more run.sh
>> #! /bin/bash
>> #$ -S /bin/bash
>> #$ -V
>> #$ -cwd
>> #$ -N Job1
>> #$ -pe orte 200
>> #$ -j y
>> #$ -l h_rt=00:20:00 # Run time (hh:mm:ss) - 10 min
>>
>> echo $NSLOTS
>> /opt/SUNWhpc/HPC8.2/sun/bin/mpirun -mca ras_gridengine_verbose 100 -
>> v -np 4 -host burl-ct-280r-1,burl-ct-280r-0 -mca btl self,sm,tcp
>> hostname
>>
>> Here is the output (somewhat truncated)
>> burl-ct-280r-0 150 =>more Job1.o199
>> 200
>> [burl-ct-280r-2:22132] ras:gridengine: JOB_ID: 199
>> [burl-ct-280r-2:22132] ras:gridengine: PE_HOSTFILE: /ws/ompi-tools/
>> orte/sge/sge6_2u1/default/spool/burl-ct-280r-2/active_jobs/199.1/
>> pe_hostfile
>> [..snip..]
>> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-0: PE_HOSTFILE
>> shows slots=2
>> [burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-1: PE_HOSTFILE
>> shows slots=2
>> [..snip..]
>> burl-ct-280r-1
>> burl-ct-280r-1
>> burl-ct-280r-0
>> burl-ct-280r-0
>> burl-ct-280r-0 151 =>
>>
>>
>> On 03/31/09 22:39, PN wrote:
>>> Dear Rolf,
>>> Thanks for your reply.
>>> I've created another PE and changed the submission script,
>>> explicitly specify the hostname with "--host".
>>> However the result is the same.
>>> # qconf -sp orte
>>> pe_name orte
>>> slots 8
>>> user_lists NONE
>>> xuser_lists NONE
>>> start_proc_args /bin/true
>>> stop_proc_args /bin/true
>>> allocation_rule $fill_up
>>> control_slaves TRUE
>>> job_is_first_task FALSE
>>> urgency_slots min
>>> accounting_summary TRUE
>>> $ cat hpl-8cpu-test.sge
>>> #!/bin/bash
>>> #
>>> #$ -N HPL_8cpu_GB
>>> #$ -pe orte 8
>>> #$ -cwd
>>> #$ -j y
>>> #$ -S /bin/bash
>>> #$ -V
>>> #
>>> cd /home/admin/hpl-2.0
>>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS --host
>>> node0001
>>> ,node0001,node0001,node0001,node0002,node0002,node0002,node0002 ./
>>> bin/goto-openmpi-gcc/xhpl
>>> # pdsh -a ps ax --width=200|grep hpl
>>> node0002: 18901 ? S 0:00 /opt/openmpi-gcc/bin/mpirun -
>>> v -np 8 --host
>>> node0001
>>> ,node0001,node0001,node0001,node0002,node0002,node0002,node0002 ./
>>> bin/goto-openmpi-gcc/xhpl
>>> node0002: 18902 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
>>> node0002: 18903 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
>>> node0002: 18904 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>> node0002: 18905 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>> node0002: 18906 ? RLl 0:29 ./bin/goto-openmpi-gcc/xhpl
>>> node0002: 18907 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>> node0002: 18908 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>> node0002: 18909 ? RLl 0:28 ./bin/goto-openmpi-gcc/xhpl
>>> Any hint to debug this situation?
>>> Also, if I have 2 IB ports in each node, which IB bonding was
>>> done, will Open MPI automatically benefit from the double bandwidth?
>>> Thanks a lot.
>>> Best Regards,
>>> PN
>>> 2009/4/1 Rolf Vandevaart <Rolf.Vandevaart_at_[hidden] <mailto:Rolf.Vandevaart_at_[hidden]
>>> >>
>>> On 03/31/09 11:43, PN wrote:
>>> Dear all,
>>> I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2
>>> I have 2 compute nodes for testing, each node has a single
>>> quad
>>> core CPU.
>>> Here is my submission script and PE config:
>>> $ cat hpl-8cpu.sge
>>> #!/bin/bash
>>> #
>>> #$ -N HPL_8cpu_IB
>>> #$ -pe mpi-fu 8
>>> #$ -cwd
>>> #$ -j y
>>> #$ -S /bin/bash
>>> #$ -V
>>> #
>>> cd /home/admin/hpl-2.0
>>> # For IB
>>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS -machinefile
>>> $TMPDIR/machines ./bin/goto-openmpi-gcc/xhpl
>>> I've tested the mpirun command can be run correctly in
>>> command line.
>>> $ qconf -sp mpi-fu
>>> pe_name mpi-fu
>>> slots 8
>>> user_lists NONE
>>> xuser_lists NONE
>>> start_proc_args /opt/sge/mpi/startmpi.sh -catch_rsh
>>> $pe_hostfile
>>> stop_proc_args /opt/sge/mpi/stopmpi.sh
>>> allocation_rule $fill_up
>>> control_slaves TRUE
>>> job_is_first_task FALSE
>>> urgency_slots min
>>> accounting_summary TRUE
>>> I've checked the $TMPDIR/machines after submit, it was
>>> correct.
>>> node0002
>>> node0002
>>> node0002
>>> node0002
>>> node0001
>>> node0001
>>> node0001
>>> node0001
>>> However, I found that if I explicitly specify the "-
>>> machinefile
>>> $TMPDIR/machines", all 8 mpi processes were spawned within a
>>> single node, i.e. node0002.
>>> However, if I omit "-machinefile $TMPDIR/machines" in the line
>>> mpirun, i.e.
>>> /opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS
>>> ./bin/goto-openmpi-gcc/xhpl
>>> The mpi processes can start correctly, 4 processes in node0001
>>> and 4 processes in node0002.
>>> Is this normal behaviour of Open MPI?
>>> I just tried it both ways and I got the same result both times.
>>> The
>>> processes are split between the nodes. Perhaps to be extra sure,
>>> you can just run hostname? And for what it is worth, as you have
>>> seen, you do not need to specify a machines file. Open MPI will
>>> use
>>> the ones that were allocated by SGE. You can also change your
>>> parallel queue to not run any scripts. Like this:
>>> start_proc_args /bin/true
>>> stop_proc_args /bin/true
>>> Also, I wondered if I have IB interface, for example, the
>>> hostname of IB become node0001-clust and node0002-clust, will
>>> Open MPI automatically use the IB interface?
>>> Yes, it should use the IB interface.
>>> How about if I have 2 IB ports in each node, which IB bonding
>>> was done, will Open MPI automatically benefit from the double
>>> bandwidth?
>>> Thanks a lot.
>>> Best Regards,
>>> PN
>>>
>>> ------------------------------------------------------------------------
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> -- =========================
>>> rolf.vandevaart_at_[hidden] <mailto:rolf.vandevaart_at_[hidden]>
>>> 781-442-3043
>>> =========================
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden] <mailto:users_at_[hidden]>
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> ------------------------------------------------------------------------
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>>
>> =========================
>> rolf.vandevaart_at_[hidden]
>> 781-442-3043
>> =========================
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users