Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Torque 2.4.3 fails with OpenMPI 1.3.4; no startup at all
From: Johann Knechtel (s9158897_at_[hidden])
Date: 2009-12-20 07:28:39


Ralph, thank you very much for your input! The parameter "mca plm rsh"
did it. I am just curious about the reasons for that behavior?
You can find the complete output of the different commands embedded in
your mail below. The first line states the successful load of the OMPI
environment, we use the modules package on our cluster.

Greetings
Johann

Ralph Castain schrieb:
> Sorry - hit "send" and then saw the version sitting right there in the subject! Doh...
>
> First, let's try verifying what components are actually getting used. Run this:
>
> mpirun -n 1 -mca ras_base_verbose 10 -mca plm_base_verbose 10 which orted
>
 OpenMPI with PPU-GCC was loaded
[node1:00706] mca: base: components_open: Looking for plm components
[node1:00706] mca: base: components_open: opening plm components
[node1:00706] mca: base: components_open: found loaded component rsh
[node1:00706] mca: base: components_open: component rsh has no register
function
[node1:00706] mca: base: components_open: component rsh open function
successful
[node1:00706] mca: base: components_open: found loaded component slurm
[node1:00706] mca: base: components_open: component slurm has no
register function
[node1:00706] mca: base: components_open: component slurm open function
successful
[node1:00706] mca: base: components_open: found loaded component tm
[node1:00706] mca: base: components_open: component tm has no register
function
[node1:00706] mca: base: components_open: component tm open function
successful
[node1:00706] mca:base:select: Auto-selecting plm components
[node1:00706] mca:base:select:( plm) Querying component [rsh]
[node1:00706] mca:base:select:( plm) Query of component [rsh] set
priority to 10
[node1:00706] mca:base:select:( plm) Querying component [slurm]
[node1:00706] mca:base:select:( plm) Skipping component [slurm]. Query
failed to return a module
[node1:00706] mca:base:select:( plm) Querying component [tm]
[node1:00706] mca:base:select:( plm) Query of component [tm] set
priority to 75
[node1:00706] mca:base:select:( plm) Selected component [tm]
[node1:00706] mca: base: close: component rsh closed
[node1:00706] mca: base: close: unloading component rsh
[node1:00706] mca: base: close: component slurm closed
[node1:00706] mca: base: close: unloading component slurm
[node1:00706] mca: base: components_open: Looking for ras components
[node1:00706] mca: base: components_open: opening ras components
[node1:00706] mca: base: components_open: found loaded component slurm
[node1:00706] mca: base: components_open: component slurm has no
register function
[node1:00706] mca: base: components_open: component slurm open function
successful
[node1:00706] mca: base: components_open: found loaded component tm
[node1:00706] mca: base: components_open: component tm has no register
function
[node1:00706] mca: base: components_open: component tm open function
successful
[node1:00706] mca:base:select: Auto-selecting ras components
[node1:00706] mca:base:select:( ras) Querying component [slurm]
[node1:00706] mca:base:select:( ras) Skipping component [slurm]. Query
failed to return a module
[node1:00706] mca:base:select:( ras) Querying component [tm]
[node1:00706] mca:base:select:( ras) Query of component [tm] set
priority to 100
[node1:00706] mca:base:select:( ras) Selected component [tm]
[node1:00706] mca: base: close: unloading component slurm
/opt/openmpi_1.3.4_gcc_ppc/bin/orted
[node1:00706] mca: base: close: unloading component tm
[node1:00706] mca: base: close: component tm closed
[node1:00706] mca: base: close: unloading component tm

> Then get an allocation and run
>
> mpirun -pernode which orted
>
 OpenMPI with PPU-GCC was loaded
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
        node2 - daemon did not report back when launched
> and
>
> mpirun -pernode -mca plm rsh which orted
>
 OpenMPI with PPU-GCC was loaded
/opt/openmpi_1.3.4_gcc_ppc/bin/orted
/opt/openmpi_1.3.4_gcc_ppc/bin/orted
> and see what happens
>
>
> On Dec 19, 2009, at 5:17 PM, Ralph Castain wrote:
>
>
>> That error has nothing to do with Torque. The cmd line is simply wrong - you are specifying a btl that doesn't exist.
>>
>> It should work just fine with
>>
>> mpirun -n X hellocluster
>>
>> Nothing else is required. When you run
>>
>> mpirun --hostfile nodefile hellocluster
>>
>> OMPI will still use Torque to do the launch - it just gets the list of nodes from your nodefile instead of the PBS_NODEFILE.
>>
>> You may have stated it below, but I can't find it: what version of OMPI are you using? Are there additional versions installed on your system?
>>
>>
>> On Dec 19, 2009, at 3:58 PM, Johann Knechtel wrote:
>>
>>
>>> Ah, and do I have to take care of the MCA ras plugin by my own?
>>> I tried somethings like
>>>
>>>> mpirun --mca ras tm --mca btl ras,plm --mca ras_tm_nodefile_dir
>>>> /var/spool/torque/aux/ hellocluster
>>>>
>>> but despite that it has not helped/worked out ([node3:22726] mca: base:
>>> components_open: component pml / csum open function failed) it also does
>>> not look so convenient to me...
>>>
>>> Greetings
>>> Johann
>>>
>>>
>>> Johann Knechtel schrieb:
>>>
>>>> Hi Ralph and all,
>>>>
>>>> Yes, the OMPI libs and binaries are at the same place on the nodes, I
>>>> packed OMPI via checkinstall and installed the deb via pdsh on the nodes.
>>>> The LD_LIBRARY_PATH is set; I can run for example "mpirun --hostfile
>>>> nodefile hellocluster" without problems. But when started via torque job
>>>> it does not work out. I do assume correctly, that the LD_LIBRARY_PATH
>>>> will be exported by torque to the daemonized mpirunners, dont I?
>>>> The torque libs are all on the same place, I installed the package shell
>>>> scripts via pdsh.
>>>>
>>>> Greetings,
>>>> Johann
>>>>
>>>>
>>>> Ralph Castain schrieb:
>>>>
>>>>
>>>>> Are the OMPI libraries and binaries installed at the same place on all the remote nodes?
>>>>>
>>>>> Are you setting the LD_LIBRARY_PATH correctly?
>>>>>
>>>>> Are the Torque libs available in the same place on the remote nodes? Remember, Torque runs mpirun on a backend node - not on the frontend.
>>>>>
>>>>> These are the most typical problems.
>>>>>
>>>>>
>>>>> On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Your help with the following torque integration issue will be much
>>>>>> appreciated: whenever I try to start a openmpi job on more than one
>>>>>> node, it simply does not start up on the nodes.
>>>>>> The torque job fails with the following:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Fri Dec 18 22:11:07 CET 2009
>>>>>>> OpenMPI with PPU-GCC was loaded
>>>>>>> --------------------------------------------------------------------------
>>>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
>>>>>>> launch so we are aborting.
>>>>>>>
>>>>>>> There may be more information reported by the environment (see above).
>>>>>>>
>>>>>>> This may be because the daemon was unable to find all the needed shared
>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>>>>>> location of the shared libraries on the remote nodes and this will
>>>>>>> automatically be forwarded to the remote nodes.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>>>> that caused that situation.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>>>> below. Additional manual cleanup may be required - please refer to
>>>>>>> the "orte-clean" tool for assistance.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> node2 - daemon did not report back when launched
>>>>>>> Fri Dec 18 22:12:47 CET 2009
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I am quite confident about the compilation and installation of torque
>>>>>> and openmpi, since it runs without error on one node:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Fri Dec 18 22:14:11 CET 2009
>>>>>>> OpenMPI with PPU-GCC was loaded
>>>>>>> Process 1 on node1 out of 2
>>>>>>> Process 0 on node1 out of 2
>>>>>>> Fri Dec 18 22:14:12 CET 2009
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> The called programm is a simple helloworld which runs without errors
>>>>>> started manually on the nodes; therefore it also runs without errors
>>>>>> using a hostfile to daemonize on more than one node. I already tried to
>>>>>> compile openmpi with default prefix:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32
>>>>>>> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32
>>>>>>> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32
>>>>>>> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized
>>>>>>> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32
>>>>>>> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32
>>>>>>> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> Also the called helloworld is compiled with and without -rpath, so I
>>>>>> just wanted to be sure regarding any linked library issue.
>>>>>>
>>>>>> Now, the interesting fact is the following: I compiled on one node a
>>>>>> kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the
>>>>>> pbs, mpi and helloworld daemons. And as already mentioned at the
>>>>>> beginning, therefore I assumed that the mpi startup within torque is not
>>>>>> working for me.
>>>>>> Please request any further logs or so you want to review, I did not
>>>>>> wanted to get the mail to large at first.
>>>>>> Any ideas?
>>>>>>
>>>>>> Greetings,
>>>>>> Johann
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>