Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Frank (openmpi-user_at_[hidden])
Date: 2006-03-20 02:11:32


Hi Brian,

this is the full -d option output I've got mpi-running vhone on the
xgrid. The truncation is due to the reported "hang".

[powerbook:/usr/local/MVH-1] admin% mpirun -d -np 4 ./vhone
[powerbook:03138] procdir: (null)
[powerbook:03138] jobdir: (null)
[powerbook:03138] unidir:
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe
[powerbook:03138] top: openmpi-sessions-admin_at_powerbook_0
[powerbook:03138] tmp: /tmp
[powerbook:03138] connect_uni: contact info read
[powerbook:03138] connect_uni: connection not allowed
[powerbook:03138] [0,0,0] setting up session dir with
[powerbook:03138] tmpdir /tmp
[powerbook:03138] universe default-universe-3138
[powerbook:03138] user admin
[powerbook:03138] host powerbook
[powerbook:03138] jobid 0
[powerbook:03138] procid 0
[powerbook:03138] procdir:
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-3138/0/0
[powerbook:03138] jobdir:
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-3138/0
[powerbook:03138] unidir:
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-3138
[powerbook:03138] top: openmpi-sessions-admin_at_powerbook_0
[powerbook:03138] tmp: /tmp
[powerbook:03138] [0,0,0] contact_file
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-3138/universe-setup.txt
[powerbook:03138] [0,0,0] wrote setup file
[powerbook:03138] spawn: in job_state_callback(jobid = 1, state = 0x1)
[ibi:00717] [0,1,2] setting up session dir with
[ibi:00717] universe default-universe
[ibi:00717] user nobody
[ibi:00717] host xgrid-node-2
[ibi:00717] jobid 1
[ibi:00717] procid 2
[ibi:00717] procdir:
/tmp/openmpi-sessions-nobody_at_xgrid-node-2_0/default-universe/1/2
[ibi:00717] jobdir:
/tmp/openmpi-sessions-nobody_at_xgrid-node-2_0/default-universe/1
[ibi:00717] unidir:
/tmp/openmpi-sessions-nobody_at_xgrid-node-2_0/default-universe
[ibi:00717] top: openmpi-sessions-nobody_at_xgrid-node-2_0
[ibi:00717] tmp: /tmp
[powerbook:03147] [0,1,0] setting up session dir with
[powerbook:03147] universe default-universe
[powerbook:03147] user nobody
[powerbook:03147] host xgrid-node-0
[powerbook:03147] jobid 1
[powerbook:03147] procid 0
[powerbook:03147] procdir:
/tmp/openmpi-sessions-nobody_at_xgrid-node-0_0/default-universe/1/0
[powerbook:03147] jobdir:
/tmp/openmpi-sessions-nobody_at_xgrid-node-0_0/default-universe/1
[powerbook:03147] unidir:
/tmp/openmpi-sessions-nobody_at_xgrid-node-0_0/default-universe
[powerbook:03147] top: openmpi-sessions-nobody_at_xgrid-node-0_0
[powerbook:03147] tmp: /tmp
^Z
Suspended
[powerbook:/usr/local/MVH-1] admin%

I've been waiting quite a while before canceling the jobs, so this is
not due to poor priority of the jobs supplied to the xgrid (i.e. xgrid
is told to always accept jobs and run them). Comparing this with the
output I get from a non-xgrid-mpirun (ssh submitting jobs) the next line
of -d output I've been waiting on is another spawn and thereafter the
message, that the open_mpi_init has been completed. While "hanging"
adding another xgrid-node or removing a node is still recognized, though
initializing does not finish.

Just to compare with, here's the -d output I get from submitting the
same job via ssh:

[powerbook:/usr/local/MVH-1] admin% mpirun -d -hostfile machinefile -np
4 ./vhone
[powerbook:03270] procdir: (null)
[powerbook:03270] jobdir: (null)
[powerbook:03270] unidir:
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe
[powerbook:03270] top: openmpi-sessions-admin_at_powerbook_0
[powerbook:03270] tmp: /tmp
[powerbook:03270] connect_uni: contact info read
[powerbook:03270] connect_uni: connection not allowed
[powerbook:03270] [0,0,0] setting up session dir with
[powerbook:03270] tmpdir /tmp
[powerbook:03270] universe default-universe-3270
[powerbook:03270] user admin
[powerbook:03270] host powerbook
[powerbook:03270] jobid 0
[powerbook:03270] procid 0
[powerbook:03270] procdir:
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-3270/0/0
[powerbook:03270] jobdir:
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-3270/0
[powerbook:03270] unidir:
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-3270
[powerbook:03270] top: openmpi-sessions-admin_at_powerbook_0
[powerbook:03270] tmp: /tmp
[powerbook:03270] [0,0,0] contact_file
/tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-3270/universe-setup.txt
[powerbook:03270] [0,0,0] wrote setup file
[powerbook:03270] spawn: in job_state_callback(jobid = 1, state = 0x1)
[powerbook:03270] pls:rsh: local csh: 1, local bash: 0
[powerbook:03270] pls:rsh: assuming same remote shell as local shell
[powerbook:03270] pls:rsh: remote csh: 1, remote bash: 0
[powerbook:03270] pls:rsh: final template argv:
[powerbook:03270] pls:rsh: ssh <template> orted --debug --bootproxy
1 --name <template> --num_procs 3 --vpid_start 0 --nodename <template>
--universe admin_at_powerbook:default-universe-3270 --nsreplica
"0.0.0;tcp://192.168.178.23:50205" --gprreplica
"0.0.0;tcp://192.168.178.23:50205" --mpi-call-yield 0
[powerbook:03270] pls:rsh: launching on node powerbook.local
[powerbook:03270] pls:rsh: oversubscribed -- setting mpi_yield_when_idle
to 1 (1 2)
[powerbook:03270] pls:rsh: powerbook.local is a LOCAL node
[powerbook:03270] pls:rsh: executing: orted --debug --bootproxy 1 --name
0.0.1 --num_procs 3 --vpid_start 0 --nodename powerbook.local --universe
admin_at_powerbook:default-universe-3270 --nsreplica
"0.0.0;tcp://192.168.178.23:50205" --gprreplica
"0.0.0;tcp://192.168.178.23:50205" --mpi-call-yield 1
[powerbook:03271] [0,0,1] setting up session dir with
[powerbook:03271] universe default-universe-3270
[powerbook:03271] user admin
[powerbook:03271] host powerbook.local
[powerbook:03271] jobid 0
[powerbook:03271] procid 1
[powerbook:03271] procdir:
/tmp/openmpi-sessions-admin_at_powerbook.local_0/default-universe-3270/0/1
[powerbook:03271] jobdir:
/tmp/openmpi-sessions-admin_at_powerbook.local_0/default-universe-3270/0
[powerbook:03271] unidir:
/tmp/openmpi-sessions-admin_at_powerbook.local_0/default-universe-3270
[powerbook:03271] top: openmpi-sessions-admin_at_powerbook.local_0
[powerbook:03271] tmp: /tmp
[powerbook:03270] pls:rsh: launching on node ibi.local
[powerbook:03270] pls:rsh: oversubscribed -- setting mpi_yield_when_idle
to 1 (1 2)
[powerbook:03270] pls:rsh: ibi.local is a REMOTE node
[powerbook:03270] pls:rsh: executing: ssh ibi.local orted --debug
--bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename
ibi.local --universe admin_at_powerbook:default-universe-3270 --nsreplica
"0.0.0;tcp://192.168.178.23:50205" --gprreplica
"0.0.0;tcp://192.168.178.23:50205" --mpi-call-yield 1
[ibi:00734] [0,0,2] setting up session dir with
[ibi:00734] universe default-universe-3270
[ibi:00734] user admin
[ibi:00734] host ibi.local
[ibi:00734] jobid 0
[ibi:00734] procid 2
[ibi:00734] procdir:
/tmp/openmpi-sessions-admin_at_ibi.local_0/default-universe-3270/0/2
[ibi:00734] jobdir:
/tmp/openmpi-sessions-admin_at_ibi.local_0/default-universe-3270/0
[ibi:00734] unidir:
/tmp/openmpi-sessions-admin_at_ibi.local_0/default-universe-3270
[ibi:00734] top: openmpi-sessions-admin_at_ibi.local_0
[ibi:00734] tmp: /tmp
[powerbook:03279] [0,1,0] setting up session dir with
[powerbook:03279] universe default-universe-3270
[powerbook:03279] user admin
[powerbook:03279] host powerbook.local
[powerbook:03279] jobid 1
[powerbook:03279] procid 0
[powerbook:03279] procdir:
/tmp/openmpi-sessions-admin_at_powerbook.local_0/default-universe-3270/1/0
[powerbook:03279] jobdir:
/tmp/openmpi-sessions-admin_at_powerbook.local_0/default-universe-3270/1
[powerbook:03279] unidir:
/tmp/openmpi-sessions-admin_at_powerbook.local_0/default-universe-3270
[powerbook:03279] top: openmpi-sessions-admin_at_powerbook.local_0
[powerbook:03279] tmp: /tmp
[powerbook:03276] [0,1,2] setting up session dir with
[powerbook:03276] universe default-universe-3270
[powerbook:03276] user admin
[powerbook:03276] host powerbook.local
[powerbook:03276] jobid 1
[powerbook:03276] procid 2
[powerbook:03276] procdir:
/tmp/openmpi-sessions-admin_at_powerbook.local_0/default-universe-3270/1/2
[powerbook:03276] jobdir:
/tmp/openmpi-sessions-admin_at_powerbook.local_0/default-universe-3270/1
[powerbook:03276] unidir:
/tmp/openmpi-sessions-admin_at_powerbook.local_0/default-universe-3270
[powerbook:03276] top: openmpi-sessions-admin_at_powerbook.local_0
[powerbook:03276] tmp: /tmp
[ibi:00740] [0,1,1] setting up session dir with
[ibi:00740] universe default-universe-3270
[ibi:00740] user admin
[ibi:00740] host ibi.local
[ibi:00740] jobid 1
[ibi:00740] procid 1
[ibi:00740] procdir:
/tmp/openmpi-sessions-admin_at_ibi.local_0/default-universe-3270/1/1
[ibi:00740] jobdir:
/tmp/openmpi-sessions-admin_at_ibi.local_0/default-universe-3270/1
[ibi:00740] unidir:
/tmp/openmpi-sessions-admin_at_ibi.local_0/default-universe-3270
[ibi:00740] top: openmpi-sessions-admin_at_ibi.local_0
[ibi:00740] tmp: /tmp
[ibi:00737] [0,1,3] setting up session dir with
[ibi:00737] universe default-universe-3270
[ibi:00737] user admin
[ibi:00737] host ibi.local
[ibi:00737] jobid 1
[ibi:00737] procid 3
[ibi:00737] procdir:
/tmp/openmpi-sessions-admin_at_ibi.local_0/default-universe-3270/1/3
[ibi:00737] jobdir:
/tmp/openmpi-sessions-admin_at_ibi.local_0/default-universe-3270/1
[ibi:00737] unidir:
/tmp/openmpi-sessions-admin_at_ibi.local_0/default-universe-3270
[ibi:00737] top: openmpi-sessions-admin_at_ibi.local_0
[ibi:00737] tmp: /tmp
[powerbook:03270] spawn: in job_state_callback(jobid = 1, state = 0x3)
[powerbook:03270] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_gate = 0
  MPIR_debug_state = 1
  MPIR_acquired_pre_main = 0
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 4
  MPIR_proctable:
    (i, host, exe, pid) = (0, ibi.local, ./vhone, 737)
    (i, host, exe, pid) = (1, powerbook.local, ./vhone, 3276)
    (i, host, exe, pid) = (2, ibi.local, ./vhone, 740)
    (i, host, exe, pid) = (3, powerbook.local, ./vhone, 3279)
[powerbook:03270] spawn: in job_state_callback(jobid = 1, state = 0x4)
[powerbook:03276] [0,1,2] ompi_mpi_init completed
[powerbook:03279] [0,1,0] ompi_mpi_init completed
[ibi:00737] [0,1,3] ompi_mpi_init completed
[ibi:00740] [0,1,1] ompi_mpi_init completed
[powerbook:03270] spawn: in job_state_callback(jobid = 1, state = 0x7)
[powerbook:03270] spawn: in job_state_callback(jobid = 1, state = 0x8)
[powerbook:03276] sess_dir_finalize: found proc session dir empty - deleting
[powerbook:03276] sess_dir_finalize: job session dir not empty - leaving
[ibi:00740] sess_dir_finalize: found proc session dir empty - deleting
[ibi:00740] sess_dir_finalize: job session dir not empty - leaving
[powerbook:03271] sess_dir_finalize: proc session dir not empty - leaving
[ibi:00734] sess_dir_finalize: proc session dir not empty - leaving
[powerbook:03271] sess_dir_finalize: proc session dir not empty - leaving
[powerbook:03271] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_TERMINATED)
[powerbook:03271] sess_dir_finalize: found proc session dir empty - deleting
[powerbook:03271] sess_dir_finalize: found job session dir empty - deleting
[powerbook:03271] sess_dir_finalize: found univ session dir empty - deleting
[powerbook:03271] sess_dir_finalize: top session dir not empty - leaving
[ibi:00734] sess_dir_finalize: proc session dir not empty - leaving
[ibi:00734] orted: job_state_callback(jobid = 1, state =
ORTE_PROC_STATE_TERMINATED)
[ibi:00734] sess_dir_finalize: found proc session dir empty - deleting
[ibi:00734] sess_dir_finalize: found job session dir empty - deleting
[ibi:00734] sess_dir_finalize: found univ session dir empty - deleting
[ibi:00734] sess_dir_finalize: found top session dir empty - deleting
[powerbook:/usr/local/MVH-1] admin%

Thanks,
Frank

Brian Barrett wrote:
> It doesn't look like you included the full output (for example, you
> didn't include the mpirun command itself, and it looks like some of
> the later output was truncated). Can you include this information?
>
> Also, it looks like you are trying to run at least 33 processes - it
> might be easier to initially test with a small number of processes and
> work your way up as problems are fixed. But without seeing the mpirun
> command, I can't know for sure.
>
> Brian
>
>
> On Mar 19, 2006, at 7:10 AM, Frank wrote:
>
>> Hi Brian,
>>
>> that's all I get when submitting the job with the -d option to mpirun:
>>
>> [powerbook:00682] procdir: (null)
>> [powerbook:00682] jobdir: (null)
>> [powerbook:00682] unidir:
>> /tmp/openmpi-sessions-admin_at_powerbook_0/default-universe
>> [powerbook:00682] top: openmpi-sessions-admin_at_powerbook_0
>> [powerbook:00682] tmp: /tmp
>> [powerbook:00682] connect_uni: contact info read
>> [powerbook:00682] connect_uni: connection not allowed
>> [powerbook:00682] [0,0,0] setting up session dir with
>> [powerbook:00682] tmpdir /tmp
>> [powerbook:00682] universe default-universe-682
>> [powerbook:00682] user admin
>> [powerbook:00682] host powerbook
>> [powerbook:00682] jobid 0
>> [powerbook:00682] procid 0
>> [powerbook:00682] procdir:
>> /tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-682/0/0
>> [powerbook:00682] jobdir:
>> /tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-682/0
>> [powerbook:00682] unidir:
>> /tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-682
>> [powerbook:00682] top: openmpi-sessions-admin_at_powerbook_0
>> [powerbook:00682] tmp: /tmp
>> [powerbook:00682] [0,0,0] contact_file
>> /tmp/openmpi-sessions-admin_at_powerbook_0/default-universe-682/universe-setup.txt
>>
>> [powerbook:00682] [0,0,0] wrote setup file
>> [powerbook:00682] spawn: in job_state_callback(jobid = 1, state = 0x1)
>> [g4d003.local:19326] [0,1,26] setting up session dir with
>> [g4d003.local:19327] [0,1,33] setting up session dir with
>> [g4d003.local:19326] universe default-universe
>> [g4d003.local:19327] universe default-universe
>> [powerbook:00690] [0,1,17] setting up session dir with
>> [g4d003.local:19326] user nobody
>> [g4d003.local:19327] user nobody
>> [powerbook:00690] universe default-universe
>> [g4d003.local:19326] host xgrid-node-26
>> [g4d003.local:19327] host xgrid-node-33
>> [powerbook:00690] user nobody
>> [g4d003.local:19326] jobid 1
>> [g4d003.local:19327] jobid 1
>> [powerbook:00690] host xgrid-node-17
>> [ibook-g4:14666] [0,1,7] setting up session dir with
>> [g4d003.local:19326] procid 26
>> [g4d003.local:19327] procid 33
>> [powerbook:00690] jobid 1
>> [ibook-g4:14666] universe default-universe
>> [g4d003.local:19326] procdir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-26_0/default-universe/1/26
>> [g4d003.local:19327] procdir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-33_0/default-universe/1/33
>> [powerbook:00690] procid 17
>> [ibook-g4:14666] user nobody
>> [g4d003.local:19326] jobdir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-26_0/default-universe/1
>> [g4d003.local:19327] jobdir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-33_0/default-universe/1
>> [powerbook:00690] procdir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-17_0/default-universe/1/17
>> [ibook-g4:14666] host xgrid-node-7
>> [g4d003.local:19326] unidir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-26_0/default-universe
>> [g4d003.local:19327] unidir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-33_0/default-universe
>> [powerbook:00690] jobdir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-17_0/default-universe/1
>> [ibook-g4:14666] jobid 1
>> [g4d003.local:19326] top: openmpi-sessions-nobody_at_xgrid-node-26_0
>> [g4d003.local:19327] top: openmpi-sessions-nobody_at_xgrid-node-33_0
>> [powerbook:00690] unidir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-17_0/default-universe
>> [ibook-g4:14666] procid 7
>> [g4d003.local:19326] tmp: /tmp
>> [g4d003.local:19327] tmp: /tmp
>> [powerbook:00690] top: openmpi-sessions-nobody_at_xgrid-node-17_0
>> [ibook-g4:14666] procdir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-7_0/default-universe/1/7
>> [powerbook:00690] tmp: /tmp
>> [ibook-g4:14666] jobdir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-7_0/default-universe/1
>> [ibook-g4:14666] unidir:
>> /tmp/openmpi-sessions-nobody_at_xgrid-node-7_0/default-universe
>> [ibook-g4:14666] top: openmpi-sessions-nobody_at_xgrid-node-7_0
>> [ibook-g4:14666] tmp: /tmp
>>
>> Does this is of any help to you?
>>
>> Thanks,
>> Frank
>>
>> On Mar 18, 2006, at 5:40 AM, Frank wrote:
>>
>>> XGRID_CONTROLLER_HOSTNAME and XGRID_CONTROLLER_PASSWORD are
>>> properly set
>>> up, Open-MPI 1.0.1 is installed on all machines (with the same
>>> configure
>>> options). When configured with --prefix=/usr/local/openmpi my app is
>>> supplied to the xgrid controller and I can see that copy's of my
>>> app are
>>> "supplied" to the other machines, too - but the jobs hang, nothing
>>> happens (user nobody has full access to the folder /usr/local/myapp
>>> where my app is run). /usr/local/openmpi/bin and /usr/local/openmpi/
>>> lib
>>> are added to the variables PATH and DYLD_LIBRARY_PATH on every
>>> machine,
>>> too. I'm running into this situation no matter from which machine
>>> my app
>>> ist started. To the guys with openmpi and xgrid performing correct:
>>> which configure options did you use? The firewall is told not block
>>> any
>>> internal traffic on the subnet. When not using the xgrid my app
>>> performs
>>> correct.
>>>
>>> Has anyone any idea concerning this matter?
>>
>> My first guess was going to be the firewall issue, but if you can run
>> without XGrid, that probably isn't the case. Could you try an XGrid
>> run with the -d option to mpirun? That will enable some debugging
>> output that should help determine what is going wrong.
>>
>> Thanks,
>>
>> Brian
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>