Yeah bummers, but something tells me it might not be OpenMPI's fault. Here's why:

1- The tech that takes care of these machines told me that he gets RTC errors on bootup (the cpu borads are apprantly "out of sync" since the clocks aren't set correctly).

2- There is also a possibility that the prior admin did not put in a "stable" firmware version.

So if any Sun guru can help out by telling me which command or point to a quick HOWTO for resolvin these clock issues, it would be greatly appreciated (our analyst is overloaded and he would not be able to justify the 3 days of reading up docs just to satisfy my running parallel code problems ;P)

3- I realised that the OS is not booted in 64 O_o!! (not that this has to do with OpenMPI bombing):

Jun 21 07:45:15 unknown genunix: [ID 540533 kern.notice] ^MSunOS Release 5.8 Version Generic_108528-29 32-bit

Jun 21 07:45:15 unknown NOTICE: 64-bit OS installed, but the 32-bit OS is the default

Jun 21 07:45:15 unknown Booting the 32-bit OS ...

4- LAM-MPI 7.1.1 also bombs, but it does so at a much higher processor count (OpenMPI bombs at 5, LAM-MPI bombs around 10, but it vraies).

As for the questions regarding OpenMPI build, I just recently built 1.1 with the same basic configure options with the exact same results (clean cache).

So, I guess this one is on pause untill I have the confirmation that the clocks on the processor boards are set correctly. There is one this that bothers me though, one of the machines has only 1 processor board (4 procs) and I still get the error on that machine if I go over 4 can a board be out of sync with itself??


PS: I am at liberty of providing the source code if anyone wants it.

Le mercredi 28 juin 2006 08:56, Jeff Squyres (jsquyres) a écrit :

> Bummer! :-(


> Just to be sure -- you had a clean config.cache file before you ran configure, right? (e.g., the file didn't exist -- just to be sure it didn't get potentially erroneous values from a previous run of configure) Also, FWIW, it's not necessary to specify --enable-ltdl-convenience; that should be automatic.


> If you had a clean configure, we *suspect* that this might be due to alignment issues on Solaris 64 bit platforms, but thought that we might have had a pretty good handle on it in 1.1. Obviously we didn't solve everything. Bonk.


> Did you get a corefile, perchance? If you could send a stack trace, that would be most helpful.