[Sean] This is correct.
There are several possible sources of trouble, if I have understood your situation correctly. Our bproc support is somewhat limited at the moment – you may be encountering one of those limits. We currently have bproc support focused on the configuration here at Los Alamos National Lab as (a) that is where the bproc-related developers are working, and (b) it is the only regular test environment we have to work with for bproc. We don’t normally use bproc in combination with hostfiles, so I’m not sure if there is a problem in that combination. I can investigate that a little later this week.
[Sean] If it is helpful, running 'export NODES=-1; mpirun -np 1 hostname' exibits identical behaviour.
Similarly, we require that all the nodes being used must be accessible via the same launch environment. It sounds like we may be able to launch processes on your head node via rsh, but not necessarily bproc. You might check to ensure that the head node will allow bproc-based process launch (I know ours don’t – all jobs are run solely on the compute nodes. I believe that is generally the case). We don’t currently support mixed environments, and I honestly don’t expect that to change anytime soon.
[Sean] I'm working through the strace output to follow the progression on the head node. It looks like mpirun consults '/bpfs/self' and determines that the request is to be run on the local machine so it fork/execs 'orted' which then runs 'hostname'. 'mpirun' didn't consult '/bpfs' or utilize 'rsh' after the determination to run on the local machine was made. When the 'hostname' command completes, 'orted' receives the SIGCHLD signal, performs some work and then both 'mpirun' and 'orted' go into what appears to be a poll() waiting for events.
Hope that helps at least a little.
[Sean] I appreciate the help. We are running processes on the head node because the head node is the only node which can access external resources (storage devices).
On 6/11/07 1:04 PM, "Kelley, Sean" <Sean.Kelley@solers.com> wrote:
I forgot to add that we are using 'bproc'. Launching processes on the compute nodes using bproc works well, I'm not sure if bproc is involved when processes are launched on the local node.
From: email@example.com on behalf of Kelley, Sean
Sent: Mon 6/11/2007 2:07 PM
Subject: [OMPI users] mpirun hanging when processes started on head node
We are running the OFED 1.2rc4 distribution containing openmpi-1.2.2 on a RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware configuration consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as compute nodes using Cisco TopSpin Infiband HCAs and switches for the interconnect.
When we use 'mpirun' from the OFED/Open MPI distribution to start processes on the compute nodes, everything works correctly. However, when we try to start processes on the head node, the processes appear to run correctly but 'mpirun' hangs and does not terminate until killed. The attached 'run1.tgz' file contains detailed information from running the following command:
mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d hostname
where 'hostfile1' contains the following:
-1 slots=2 max_slots=2
The 'run.log' is the output of the above line. The 'strace.out.0' is the result of 'strace -f' on the mpirun process (and the 'hostname' child process since mpirun simply forks the local processes). The child process (pid 23415 in this case) runs to completion and exits successfully. The parent process (mpirun) doesn't appear to recognize that the child has completed and hangs until killed (with a ^c).
Additionally, when we run a set of processes which span the headnode and the compute nodes, the processes on the head node complete successfully, but the processes on the compute nodes do not appear to start. mpirun again appears to hang.
Do I have a configuration error or is there a problem that I have encountered? Thank you in advance for your assistance or suggestions
Sean M. Kelley
users mailing list