Meanwhile, much later -- you'll sympathize: Did you have any joy with
> These messages appeared when running IMB compiled with openmpi 1.6.1
> across 256 cores (16 nodes, 16 cores per node). The job ran from
> 09:56:54 to 10:08:46 and failed with no obvious error messages.
I don't know about the messages, but there are successful Ã256 runs in
my ~/imb, one with default params, and also Ã512. The 256 ones were
accounted â¼330GB, and the default h_vmem is still 1G. That's not the
cause of the failure, is it?
For the kernel issue, do you actually have the same adaptors under RH5
to compare? lspci says our current QDR ones are the same as yours
(surprisingly), and they're OK with openib params from the default
Mellanox OFED setup. They're on older OFED (not vanilla RH) due to our
horrible hardware mixture.
With the new stuff I can't do more than access a head. That also has
the same adaptors, but I don't know if MPI runs have your symptom. The
modprobe config is different from yours, but the driver is older. If it
might help, I could poke around node images and send config files, but I
don't know what the various different images are.
I might have hoped you wouldn't have to sort this yourself, if I was a