Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] newbie: Submitting Open MPI jobs to SGE ( `qsh, -pe orte 4` fails)
From: Pierre Lindenbaum (pierre.lindenbaum_at_[hidden])
Date: 2013-02-11 06:26:17


> This is a good sign, as it tries to use `qrsh -inherit ...` already. Can you confirm the following settings:
>
> $ qconf -sp orte
> ...
> control_slaves TRUE
>
> $ qconf -sq all.q
> ...
> shell_start_mode unix_behavior
>
> -- Reuti

    qconf -sp orte

    pe_name orte
    slots 448
    user_lists NONE
    xuser_lists NONE
    start_proc_args /bin/true
    stop_proc_args /bin/true
    allocation_rule $round_robin
    control_slaves FALSE
    job_is_first_task TRUE
    urgency_slots min
    accounting_summary FALSE

and

      qconf -sq all.q | grep start_
    shell_start_mode posix_compliant

I've edited the env conf using `qconf -mp orte` changing
`control_slaves` to TRUE

    # qconf -sp orte
    pe_name orte
    slots 448
    user_lists NONE
    xuser_lists NONE
    start_proc_args /bin/true
    stop_proc_args /bin/true
    allocation_rule $round_robin
    control_slaves TRUE
    job_is_first_task TRUE
    urgency_slots min
    accounting_summary FALSE

and I've changed `shell_start_mode posix_compliant` to
`unix_behavior ` using `qconf -mconf`. (However, shell_start_mode is
still listed as posix_compliant )

Now, qsh -pe orte 4 works

    qsh -pe orte 4
    Your job 84581 ("INTERACTIVE") has been submitted
    waiting for interactive job to be scheduled ...
    Your interactive job 84581 has been successfully scheduled.

(should I run that command before running any a new mpirun command ?)

when invoking:

      qsub -cwd -pe orte 7 with-a-shell.sh
or
      qrsh -cwd -pe orte 100 /commun/data/packages/openmpi/bin/mpirun
/path/to/a.out arg1 arg2 arg3 ....

that works too ! Thank you ! :-)

    queuename qtype resv/used/tot. load_avg
    arch states
    ---------------------------------------------------------------------------------
    all.q_at_node01 BIP 0/15/64 2.76 lx24-amd64
       84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 15
    ---------------------------------------------------------------------------------
    all.q_at_node02 BIP 0/14/64 3.89 lx24-amd64
       84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14
    ---------------------------------------------------------------------------------
    all.q_at_node03 BIP 0/14/64 3.23 lx24-amd64
       84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14
    ---------------------------------------------------------------------------------
    all.q_at_node04 BIP 0/14/64 3.68 lx24-amd64
       84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14
    ---------------------------------------------------------------------------------
    all.q_at_node05 BIP 0/15/64 2.91 lx24-amd64
       84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 15
    ---------------------------------------------------------------------------------
    all.q_at_node06 BIP 0/14/64 3.91 lx24-amd64
       84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14
    ---------------------------------------------------------------------------------
    all.q_at_node07 BIP 0/14/64 3.79 lx24-amd64
       84598 0.55500 mpirun lindenb r 02/11/2013 12:03:36 14

OK, my first openmpi program works. But as far as I can see: it is
faster when invoked on the master node (~3.22min) than when invoked by
means of SGE (~7H45):

    time /commun/data/packages/openmpi/bin/mpirun -np 7 /path/to/a.out
    arg1 arg2 arg3 ....
    670.985u 64.929s 3:32.36 346.5% 0+0k 16322112+6560io 32pf+0w

    time qrsh -cwd -pe orte 7 /commun/data/packages/openmpi/bin/mpirun
    /path/to/a.out arg1 arg2 arg3 ....
    0.023u 0.036s 7:45.05 0.0% 0+0k 1496+0io 1pf+0w

I'm going to investigate this... :-)

Thank you again

Pierre