Thanks for your help.
Unfortunately, after I thoroughly examined entire cluster, I found a bad node with busted hard drive. That's the reason why this job hanged.
Also, when this job is sent with one bad node among the machinefile, neither the openmpi nor my program gives me any error messages. That's why I can't find the reason for job hanged.
On Apr 21, 2009, at 11:01 AM, Tsung Han Shie wrote:Did you mean 1.1.3 or 1.3.1?
I tried to increase speed of a program with openmpi-1.1.3
I mean 1.1.3.If you meant 1.3.1 above, please see the following message about an important bug in 1.3 and 1.3.1 with the use of mpi_leave_pinned:
by adding following 4 parameters into openmpi-mca-params.conf file.
http://www.open-mpi.org/community/lists/announce/2009/03/0029.phpWhy -- did they hang?
and then, I ran my program twice(124 processes on 31 nodes). one with "mpi_leave_pinned=1", another with "mpi_leave_pinned=0".
All of them were stopped abnormally with "ctrl+c" and "killall -9 <program>".
What exactly was the error?
After that, I couldn't start to run that program again.
I checked every nodes with "free -m" and I found that huge amount of cached memory were used in each nodes.
Could this situation be caused by those 4 parameters? IS there anyway to free theme?
Can you send all the information listed here:
users mailing list