On Nov 21, 2006, at 1:27 PM, Brock Palen wrote:
> I had sent a message two weeks ago about this problem and talked with
> jeff at SC06 about how it might not be a OMPI problem. But it
> appears now working with myricom that it is a problem in both
> lam-7.1.2 and openmpi-1.1.2/1.1.1. Basically the results from a HPL
> run are wrong, Also causes a large number of packets to be dropped
> by the fabric.
> This problem does not happen when using mpichgm. The number of
> dropped packets does not go up. There is a ticket open with myircom
> on this. They are a member of the group working on OMPI but i sent
> this out just to bring the list uptodate.
> If you have any questions feel free to ask me. The details are in
> the archive.
> Brock Palen
I am working on this ticket at Myricom.
I am using Linux nodes since we do not have two OSX machines running
10.3 available. Each has 1 GB of RAM and two Myrinet PCI-X cards, a
single-port D card and a dual-port E card. I have disabled the E
card. I am using GM-2.0.26. I am also using Open-MPI 1.2b1.
I am running HPCC which includes HPL as well as other benchmarks.
Using Brock's HPL.dat values in my hpccinf.txt, I do not see any
failed HPL runs. I do see some runs hang and require a reboot (the
machine is unresponsive), but it may happen in the HPL portion of the
run or it may happen in another benchmark.
My last few runs all completed successfully without hanging. The job
I am currently running just hung one node (can respond to ping,
cannot ssh into it, cannot use any terminals connected to it).
There are no messages in dmesg and vmstat shows that the node is not
swapping (before it hung).
Any ideas where I should look next?