Could you please clarify something? I’m a little confused by your comments about where things are running. I’m assuming that you mean everything works fine if you type the mpirun command on the head node and just let it launch on your compute nodes – that the problems only occur when you specifically tell mpirun you want processes on the head node as well (or exclusively). Is that correct?
There are several possible sources of trouble, if I have understood your situation correctly. Our bproc support is somewhat limited at the moment – you may be encountering one of those limits. We currently have bproc support focused on the configuration here at Los Alamos National Lab as (a) that is where the bproc-related developers are working, and (b) it is the only regular test environment we have to work with for bproc. We don’t normally use bproc in combination with hostfiles, so I’m not sure if there is a problem in that combination. I can investigate that a little later this week.
Similarly, we require that all the nodes being used must be accessible via the same launch environment. It sounds like we may be able to launch processes on your head node via rsh, but not necessarily bproc. You might check to ensure that the head node will allow bproc-based process launch (I know ours don’t – all jobs are run solely on the compute nodes. I believe that is generally the case). We don’t currently support mixed environments, and I honestly don’t expect that to change anytime soon.
Hope that helps at least a little.
On 6/11/07 1:04 PM, "Kelley, Sean" <Sean.Kelley@solers.com> wrote:
I forgot to add that we are using 'bproc'. Launching processes on the compute nodes using bproc works well, I'm not sure if bproc is involved when processes are launched on the local node.
From: firstname.lastname@example.org on behalf of Kelley, Sean
Sent: Mon 6/11/2007 2:07 PM
Subject: [OMPI users] mpirun hanging when processes started on head node
We are running the OFED 1.2rc4 distribution containing openmpi-1.2.2 on a RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware configuration consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as compute nodes using Cisco TopSpin Infiband HCAs and switches for the interconnect.
When we use 'mpirun' from the OFED/Open MPI distribution to start processes on the compute nodes, everything works correctly. However, when we try to start processes on the head node, the processes appear to run correctly but 'mpirun' hangs and does not terminate until killed. The attached 'run1.tgz' file contains detailed information from running the following command:
mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d hostname
where 'hostfile1' contains the following:
-1 slots=2 max_slots=2
The 'run.log' is the output of the above line. The 'strace.out.0' is the result of 'strace -f' on the mpirun process (and the 'hostname' child process since mpirun simply forks the local processes). The child process (pid 23415 in this case) runs to completion and exits successfully. The parent process (mpirun) doesn't appear to recognize that the child has completed and hangs until killed (with a ^c).
Additionally, when we run a set of processes which span the headnode and the compute nodes, the processes on the head node complete successfully, but the processes on the compute nodes do not appear to start. mpirun again appears to hang.
Do I have a configuration error or is there a problem that I have encountered? Thank you in advance for your assistance or suggestions
Sean M. Kelley
users mailing list