Dear Jeff
Thanks for your help.
Unfortunately, after I thoroughly examined entire cluster, I found a bad node with busted hard drive. That's the reason why this job hanged.
Also, when this job is sent with one bad node among the machinefile, neither the openmpi nor my program gives me any error messages. That's why I can't find the reason for job hanged.
Best regard
On Apr 21, 2009, at 11:01 AM, Tsung Han Shie wrote:Did you mean 1.1.3 or 1.3.1?
I tried to increase speed of a program with openmpi-1.1.3
I mean 1.1.3.If you meant 1.3.1 above, please see the following message about an important bug in 1.3 and 1.3.1 with the use of mpi_leave_pinned:
by adding following 4 parameters into openmpi-mca-params.conf file.
mpi_leave_pinned=1
btl_openib_eager_rdma_num=128
btl_openib_max_eager_rdma=128
btl_openib_eager_limit=1024
http://www.open-mpi.org/community/lists/announce/2009/03/0029.phpWhy -- did they hang?
and then, I ran my program twice(124 processes on 31 nodes). one with "mpi_leave_pinned=1", another with "mpi_leave_pinned=0".
All of them were stopped abnormally with "ctrl+c" and "killall -9 <program>".
What exactly was the error?
After that, I couldn't start to run that program again.
Probably not.
I checked every nodes with "free -m" and I found that huge amount of cached memory were used in each nodes.
Could this situation be caused by those 4 parameters? IS there anyway to free theme?
Can you send all the information listed here:
http://www.open-mpi.org/community/help/
--
Jeff Squyres
Cisco Systems
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users