I'm afraid I have no idea - I've never seen a Torque version that old, however, so it is quite possible that we don't work with it. It also looks like it may have been modified (given the p2-aspen3 on the end), so I have no idea how the system would behave.
First thing you could do is verify that the allocation is being read correctly. Add a --display-allocation to the cmd line and see what we think Torque gave us. Then add --display-map to see where it plans to place the processes.
If all that looks okay, and if you allow ssh, then try -mca plm rsh on the cmd line and see if that works.
HTH
Ralph
Hi All,
I am building open-mpi-1.3.2 on centos-3.4, with torque-1.1.0p2-aspen3 and myrinet. I compiled it just fine with this configuration:
./configure --prefix=/home/software/ompi/1.3.2-pgi --with-gm=/usr/local/ --with-gm-libdir=/usr/local/lib64/ --enable-static --disable-shared --with-tm=/usr/ --without-threads CC=pgcc CXX=pgCC FC=pgf90 F77=pgf77 LDFLAGS=-L/usr/lib64/torque/
However, when I submit jobs for 2 or more nodes through the torque schedular, the jobs just hang here. It shows the RUN state, but no communication between the nodes, then jobs will die with timeout.
We have comfirmed that the myrinet is working because our lam-mpi-7.1 works just fine. We are having a really hard time determining what are the causes for this problem. So, we suspect it's because our torque is too old.
What is the lowest version requirement of torque for open-mpi-1.3.2? The README file didn't specify this detail. Does anyone know more about it?
Thanks in advance,
Kai
--------------------
Kai Song
<ksong@lbl.gov> 1.510.486.4894
High Performance Computing Services (HPCS) Intern
Lawrence Berkeley National Laboratory - http://scs.lbl.gov
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users