Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Open-MPI-1.3.2 compatibility with old torque?
From: Song, Kai Song (KSong_at_[hidden])
Date: 2009-07-21 18:44:30


Hi Ralph,

Thanks a lot for the fast response.

Could you give me more instructions on which command do I put "--display-allocation" and "--display-map" with? mpirun? ./configure?...

Also,we have tested that in our PBS script, if we put node=1, the helloworld works. But, when I put node=2 or more, it will hang until timeout . And the error message will be something like:
 node0006 - daemon did not report back when launched

However, if we don't go through the scheduler and run mpi manually, everything works fine too.
/home/software/ompi/1.3.2-pgi/bin/mpirun -machinefile ./nodes -np 16 ./a.out

What do you think the problem would be? It's not the network issue, because manually running MPI works. That is why we question about torque compatibility.

Thanks again,

Kai

--------------------
Kai Song
<ksong_at_[hidden]> 1.510.486.4894
High Performance Computing Services (HPCS) Intern
Lawrence Berkeley National Laboratory - http://scs.lbl.gov

----- Original Message -----
From: Ralph Castain <rhc_at_[hidden]>
Date: Tuesday, July 21, 2009 12:12 pm
Subject: Re: [OMPI users] Open-MPI-1.3.2 compatibility with old torque?
To: Open MPI Users <users_at_[hidden]>

> I'm afraid I have no idea - I've never seen a Torque version that old,
> however, so it is quite possible that we don't work with it. It
> also looks
> like it may have been modified (given the p2-aspen3 on the end), so
> I have
> no idea how the system would behave.
>
> First thing you could do is verify that the allocation is being read
> correctly. Add a --display-allocation to the cmd line and see what
> we think
> Torque gave us. Then add --display-map to see where it plans to
> place the
> processes.
>
> If all that looks okay, and if you allow ssh, then try -mca plm rsh
> on the
> cmd line and see if that works.
>
> HTH
> Ralph
>
>
> On Tue, Jul 21, 2009 at 12:57 PM, Song, Kai Song <KSong_at_[hidden]>
> wrote:
> > Hi All,
> >
> > I am building open-mpi-1.3.2 on centos-3.4, with torque-1.1.0p2-
> aspen3 and
> > myrinet. I compiled it just fine with this configuration:
> > ./configure --prefix=/home/software/ompi/1.3.2-pgi --with-
> gm=/usr/local/> --with-gm-libdir=/usr/local/lib64/ --enable-static -
> -disable-shared
> > --with-tm=/usr/ --without-threads CC=pgcc CXX=pgCC FC=pgf90
> F77=pgf77> LDFLAGS=-L/usr/lib64/torque/
> >
> > However, when I submit jobs for 2 or more nodes through the torque
> > schedular, the jobs just hang here. It shows the RUN state, but no
> > communication between the nodes, then jobs will die with timeout.
> >
> > We have comfirmed that the myrinet is working because our lam-mpi-
> 7.1 works
> > just fine. We are having a really hard time determining what are
> the causes
> > for this problem. So, we suspect it's because our torque is too old.
> >
> > What is the lowest version requirement of torque for open-mpi-
> 1.3.2? The
> > README file didn't specify this detail. Does anyone know more
> about it?
> >
> > Thanks in advance,
> >
> > Kai
> > --------------------
> > Kai Song
> > <ksong_at_[hidden]> 1.510.486.4894
> > High Performance Computing Services (HPCS) Intern
> > Lawrence Berkeley National Laboratory - http://scs.lbl.gov
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>