Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Torque 2.4.3 fails with OpenMPI 1.3.4; no startup at all
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-12-19 19:17:01


That error has nothing to do with Torque. The cmd line is simply wrong - you are specifying a btl that doesn't exist.

It should work just fine with

mpirun -n X hellocluster

Nothing else is required. When you run

mpirun --hostfile nodefile hellocluster

OMPI will still use Torque to do the launch - it just gets the list of nodes from your nodefile instead of the PBS_NODEFILE.

You may have stated it below, but I can't find it: what version of OMPI are you using? Are there additional versions installed on your system?

On Dec 19, 2009, at 3:58 PM, Johann Knechtel wrote:

> Ah, and do I have to take care of the MCA ras plugin by my own?
> I tried somethings like
>> mpirun --mca ras tm --mca btl ras,plm --mca ras_tm_nodefile_dir
>> /var/spool/torque/aux/ hellocluster
> but despite that it has not helped/worked out ([node3:22726] mca: base:
> components_open: component pml / csum open function failed) it also does
> not look so convenient to me...
>
> Greetings
> Johann
>
>
> Johann Knechtel schrieb:
>> Hi Ralph and all,
>>
>> Yes, the OMPI libs and binaries are at the same place on the nodes, I
>> packed OMPI via checkinstall and installed the deb via pdsh on the nodes.
>> The LD_LIBRARY_PATH is set; I can run for example "mpirun --hostfile
>> nodefile hellocluster" without problems. But when started via torque job
>> it does not work out. I do assume correctly, that the LD_LIBRARY_PATH
>> will be exported by torque to the daemonized mpirunners, dont I?
>> The torque libs are all on the same place, I installed the package shell
>> scripts via pdsh.
>>
>> Greetings,
>> Johann
>>
>>
>> Ralph Castain schrieb:
>>
>>> Are the OMPI libraries and binaries installed at the same place on all the remote nodes?
>>>
>>> Are you setting the LD_LIBRARY_PATH correctly?
>>>
>>> Are the Torque libs available in the same place on the remote nodes? Remember, Torque runs mpirun on a backend node - not on the frontend.
>>>
>>> These are the most typical problems.
>>>
>>>
>>> On Dec 18, 2009, at 3:58 PM, Johann Knechtel wrote:
>>>
>>>
>>>
>>>> Hi all,
>>>>
>>>> Your help with the following torque integration issue will be much
>>>> appreciated: whenever I try to start a openmpi job on more than one
>>>> node, it simply does not start up on the nodes.
>>>> The torque job fails with the following:
>>>>
>>>>
>>>>
>>>>> Fri Dec 18 22:11:07 CET 2009
>>>>> OpenMPI with PPU-GCC was loaded
>>>>> --------------------------------------------------------------------------
>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to
>>>>> launch so we are aborting.
>>>>>
>>>>> There may be more information reported by the environment (see above).
>>>>>
>>>>> This may be because the daemon was unable to find all the needed shared
>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>>>>> location of the shared libraries on the remote nodes and this will
>>>>> automatically be forwarded to the remote nodes.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that the job aborted, but has no info as to the process
>>>>> that caused that situation.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>> below. Additional manual cleanup may be required - please refer to
>>>>> the "orte-clean" tool for assistance.
>>>>> --------------------------------------------------------------------------
>>>>> node2 - daemon did not report back when launched
>>>>> Fri Dec 18 22:12:47 CET 2009
>>>>>
>>>>>
>>>> I am quite confident about the compilation and installation of torque
>>>> and openmpi, since it runs without error on one node:
>>>>
>>>>
>>>>> Fri Dec 18 22:14:11 CET 2009
>>>>> OpenMPI with PPU-GCC was loaded
>>>>> Process 1 on node1 out of 2
>>>>> Process 0 on node1 out of 2
>>>>> Fri Dec 18 22:14:12 CET 2009
>>>>>
>>>>>
>>>> The called programm is a simple helloworld which runs without errors
>>>> started manually on the nodes; therefore it also runs without errors
>>>> using a hostfile to daemonize on more than one node. I already tried to
>>>> compile openmpi with default prefix:
>>>>
>>>>
>>>>> $ ./configure CC=ppu-gcc CPP=ppu-cpp CXX=ppu-c++ CFLAGS=-m32
>>>>> CXXFLAGS=-m32 FC=ppu-gfortran43 FCFLAGS=-m32 FFLAGS=-m32
>>>>> CCASFLAGS=-m32 LD=ppu32-ld LDFLAGS=-m32
>>>>> --prefix=/shared/openmpi_gcc_ppc --with-platform=optimized
>>>>> --disable-mpi-profile --with-tm=/usr/local/ --with-wrapper-cflags=-m32
>>>>> --with-wrapper-ldflags=-m32 --with-wrapper-fflags=-m32
>>>>> --with-wrapper-fcflags=-m32 --enable-mpirun-prefix-by-default
>>>>>
>>>>>
>>>> Also the called helloworld is compiled with and without -rpath, so I
>>>> just wanted to be sure regarding any linked library issue.
>>>>
>>>> Now, the interesting fact is the following: I compiled on one node a
>>>> kernel with CONFIG_BSD_PROCESS_ACCT_V3 to monitor the startup of the
>>>> pbs, mpi and helloworld daemons. And as already mentioned at the
>>>> beginning, therefore I assumed that the mpi startup within torque is not
>>>> working for me.
>>>> Please request any further logs or so you want to review, I did not
>>>> wanted to get the mail to large at first.
>>>> Any ideas?
>>>>
>>>> Greetings,
>>>> Johann
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users