Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1
From: charlie strauss (cems_at_[hidden])
Date: 2010-06-21 15:28:29


Perhaps I was mistaken about 1.5rc1. As for the installed openMPI
on mac osx, my 10.5 OSX has v1.2.3 when I try to run it, it works
fine locally but it never finds the xgrid.

any mpi job I run, will run on the localhost not the xgrid agents. If
try to force the issue by specifying -nolocal then it just complains
there are no nodes.

SO how do I use openMPI so that it uses the nodes of an xgrid cluster?

mpirun -nolocal -n 32 /bin/hostname
--------------------------------------------------------------------------
There are no available nodes allocated to this job. This could be
because
no nodes were found or all the available nodes were already used.

Note that since the -nolocal option was given no processes can be
launched on the local node.
--------------------------------------------------------------------------
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/
base/rmaps_base_support_fns.c at line 168
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/
round_robin/rmaps_rr.c at line 402
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/
base/rmaps_base_map_job.c at line 210
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmgr/
urm/rmgr_urm.c at line 372

On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:

> Where did you see that 1.5 works with xgrid? That support has been
> broken since the 1.2 series, unfortunately, so it would help to
> ensure we don't have stale docs out there to the contrary.
>
> As for the 1.2 results, you are aware (I imagine) that OSX ships
> with the last 1.2 release already installed? You don't have to do
> anything to use it but run.
>
> If you are getting peer timeouts, that is almost always a firewall
> issue. But I would try the factory-installed version first to be sure.
>
> On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:
>
>> I'm new to openMPI. I'm trying to set it up for using xgrid. I
>> have read
>> that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I have
>> seen
>> some discussions in the archives of this mail list saying some
>> people have
>> v1.4 running on 10.6.
>>
>> I have now compiled both openMPI 1.2 and openMPI1.5rc and neither of
>> these is working for me with xgrid. Both of these say they work
>> with
>> xgrid.
>>
>> The failuremodes are different.
>>
>> Anyone know how to get a working install? I am building this on a
>> OSX 10.5.8
>> machine. THe xgrid controller is on a OSX 10.6 server machine. I
>> have tried
>> configuring with and without the --with-xgrid option.
>>
>> Behaviour of openMPI1.2
>> $ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname
>>
>> THe job appears in the xgrid queue, and the logs show it is running
>> on a
>> remote machine. However nothing ever happens and peeking in the
>> xgrid
>> results I see:
>>
>> $ xgrid -job results -id 8703
>> [brio.llnl.gov:38789] [0,0,1]-[0,0,0]
>> mca_oob_tcp_peer_complete_connect:
>> connection failed: Operation timed out (60) - retrying
>> [brio.llnl.gov:38792] [0,0,2]-[0,0,0]
>> mca_oob_tcp_peer_complete_connect:
>> connection failed: Operation timed out (60) - retrying
>>
>> Perhaps a firewall issue?
>>
>> Of course I'm more interested in getting the new openMPI1.5 working.
>> When I run this, again I get an entry in the queue, and the job
>> runs on a
>> remote machine but I get a job failed message
>>
>> $ /usr/local/openmpi5/bin/mpirun -n 2 /bin/hostname
>> $ xgrid -job results -id 8702
>> [brio.llnl.gov:38776] Error: unknown option "-mca"
>>
>> ----
>>
>> Note I have NOT installed openMPI on any of the other computers in
>> the
>> grid. So perhaps that is the problem? If I did install it on other
>> computers how would I tell mpirun where to find the path to the
>> install
>> point?
>>
>> ----
>>
>>
>> Finally in both cases, I don't see any way to pass xgrid specific
>> argument
>> in on the mpi command line. An xgrid controller divides the agents
>> into
>> sets of logical grids and you need to specify which logical grid to
>> submit
>> the job to. In xgrid cli syntax one write "xgrid -gid 2" for
>> grid 2.
>> When I use openMPI all the jobs get sent to just the default grid
>> which is
>> the grid that xgrid uses if no gid is specified.
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Charlie Strauss
Bioscience Division
cems_at_[hidden]
505 665 4838
Quidquid latine dictum sit, altum sonatur.