Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Eric Thibodeau (kyron_at_[hidden])
Date: 2006-11-08 12:48:59


Hello everyone,

        I am having a hard time getting OpenMPI (1.1.2) to run in a heterogeneous environment. In short, here is my command line:

orterun --prefix ~/openmpi_x86_64/ -hostfile head -np 2 mandelbrot-mpi_x86_64 10000 400 400 0 : --prefix ~/openmpi_i686/ -hostfile nodes -np `wc -l<nodes ` mandelbrot-mpi_i686 10000 400 400 0

On execution, I get the followin error:
bash: /export/home/eric/openmpi_x86_64/bin/orted: cannot execute binary file
bash: /export/home/eric/openmpi_x86_64/bin/orted: cannot execute binary file
[headless:06930] ERROR: A daemon on node thinkbig2 failed to start as expected.
[headless:06930] ERROR: There may be more information available from
[headless:06930] ERROR: the remote shell (see above).
[headless:06930] ERROR: The daemon exited unexpectedly with status 126.
[headless:06930] ERROR: A daemon on node thinkbig12 failed to start as expected.
[headless:06930] ERROR: There may be more information available from
[headless:06930] ERROR: the remote shell (see above).
[headless:06930] ERROR: The daemon exited unexpectedly with status 126.

After which I have to cancel the excution with CTRL-C.

I am still trying to investigate this problem and I am coming up with the following. It would seem that orterun mixes the executables across the commands. For example, the follwoing command line should essentially return the contents of the host files "head" _and_ "nodes":

First, the contents of the head and nodes files:

eric_at_headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ cat head
headless slots=2
eric_at_headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ cat nodes
thinkbig12
thinkbig2
thinkbig3
thinkbig5
thinkbig6
thinkbig9
thinkbig10
thinkbig11
thinkbig4
thinkbig7

Second, the execution of the command:

orterun --prefix ~/openmpi_x86_64/ -hostfile head -np 2 hostname : --prefix ~/openmpi_i686/ -hostfile nodes -np `wc -l<nodes ` hostname
bash: /export/home/eric/openmpi_x86_64/bin/orted: cannot execute binary file
bash: /export/home/eric/openmpi_x86_64/bin/orted: cannot execute binary file
[headless:07196] ERROR: A daemon on node thinkbig2 failed to start as expected.
[headless:07196] ERROR: There may be more information available from
[headless:07196] ERROR: the remote shell (see above).
[headless:07196] ERROR: The daemon exited unexpectedly with status 126.
[headless:07196] ERROR: A daemon on node thinkbig12 failed to start as expected.
[headless:07196] ERROR: There may be more information available from
[headless:07196] ERROR: the remote shell (see above).
[headless:07196] ERROR: The daemon exited unexpectedly with status 126.
thinkbig10
thinkbig11
thinkbig4
thinkbig7
thinkbig6
thinkbig9
thinkbig5
thinkbig3

Now, if I remove the --prefix for the first par, I get the following:
eric_at_headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ orterun -hostfile head -np 2 hostname : --prefix ~/openmpi_i686/ -hostfile nodes -np `wc -l<nodes ` hostname
thinkbig9
thinkbig2
thinkbig2
thinkbig12
thinkbig12
thinkbig4
thinkbig7
thinkbig10
thinkbig11
thinkbig5
thinkbig3
thinkbig6

Immediately, we notice that "hostname" is never runned on the "headless" node but runned twice on thinkbig2 and thinkbig12. This tells me that the first -hostfile is being ignored entirely and we fall into the round-robin schema. What am-I doing wrong? I would like to read up documentation on this but the manpage and web pages are very superficial on the subject of heterogeneous environments and I found no documentation on writing up an appfile as would be used with --app.

Thanks,

-- 
Eric Thibodeau