Thanks a lot for the fast response.
Could you give me more instructions on which command do I put "--display-allocation" and "--display-map" with? mpirun? ./configure?...
Also,we have tested that in our PBS script, if we put node=1, the helloworld works. But, when I put node=2 or more, it will hang until timeout . And the error message will be something like:
node0006 - daemon did not report back when launched
However, if we don't go through the scheduler and run mpi manually, everything works fine too.
/home/software/ompi/1.3.2-pgi/bin/mpirun -machinefile ./nodes -np 16 ./a.out
What do you think the problem would be? It's not the network issue, because manually running MPI works. That is why we question about torque compatibility.
High Performance Computing Services (HPCS) Intern
Lawrence Berkeley National Laboratory - http://scs.lbl.gov
----- Original Message -----
From: Ralph Castain <rhc_at_[hidden]>
Date: Tuesday, July 21, 2009 12:12 pm
Subject: Re: [OMPI users] Open-MPI-1.3.2 compatibility with old torque?
To: Open MPI Users <users_at_[hidden]>
> I'm afraid I have no idea - I've never seen a Torque version that old,
> however, so it is quite possible that we don't work with it. It
> also looks
> like it may have been modified (given the p2-aspen3 on the end), so
> I have
> no idea how the system would behave.
> First thing you could do is verify that the allocation is being read
> correctly. Add a --display-allocation to the cmd line and see what
> we think
> Torque gave us. Then add --display-map to see where it plans to
> place the
> If all that looks okay, and if you allow ssh, then try -mca plm rsh
> on the
> cmd line and see if that works.
> On Tue, Jul 21, 2009 at 12:57 PM, Song, Kai Song <KSong_at_[hidden]>
> > Hi All,
> > I am building open-mpi-1.3.2 on centos-3.4, with torque-1.1.0p2-
> aspen3 and
> > myrinet. I compiled it just fine with this configuration:
> > ./configure --prefix=/home/software/ompi/1.3.2-pgi --with-
> gm=/usr/local/> --with-gm-libdir=/usr/local/lib64/ --enable-static -
> > --with-tm=/usr/ --without-threads CC=pgcc CXX=pgCC FC=pgf90
> F77=pgf77> LDFLAGS=-L/usr/lib64/torque/
> > However, when I submit jobs for 2 or more nodes through the torque
> > schedular, the jobs just hang here. It shows the RUN state, but no
> > communication between the nodes, then jobs will die with timeout.
> > We have comfirmed that the myrinet is working because our lam-mpi-
> 7.1 works
> > just fine. We are having a really hard time determining what are
> the causes
> > for this problem. So, we suspect it's because our torque is too old.
> > What is the lowest version requirement of torque for open-mpi-
> 1.3.2? The
> > README file didn't specify this detail. Does anyone know more
> about it?
> > Thanks in advance,
> > Kai
> > --------------------
> > Kai Song
> > <ksong_at_[hidden]> 1.510.486.4894
> > High Performance Computing Services (HPCS) Intern
> > Lawrence Berkeley National Laboratory - http://scs.lbl.gov
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users