On Nov 7, 2007, at 4:41 PM, Benjamin, Ted G. wrote:
> Please understand that Im decent at the engineering side of it. As
> a system administrator, Im a decent engineer.
> On the previous configurations, this program seems to run with any
> number of processors. I believe these successful users have been
> using LAM/MPI. While I was waiting for a reply, I installed LAM/
> MPI. The results were similar to those from OpenMPI.
This is a good sign; consistent behavior across multiple different
MPIs implies a problem at the application or system level (i.e., not
the MPI level). Again, I'll not promise that any MPI is bug free, but
these signs point to an application/system problem.
> While I can choose LAM/MPI, Id prefer to port it to OpenMPI since
> that is where all the development and most of the support are.
> I cannot choose the Portland compiler. I must use either GNU or
> Intel compilers on the Itanium2.
> > Have you tried running your code through a memory-checking debugger,
> > and/or examining any corefiles that were generated to see if there
> > a problem in your code?
> > I will certainly not guarantee that Open MPI is bug free, but
> > like this are *usually* application-level issues. One place I
> > start is running the application in a debugger to see if you can
> > exactly where the Badness happens. This can be most helpful.
> I have tried to run a debugger, but I am not an expert at it. I
> could not get Intels idb debugger to give me a prompt, but I could
> get a prompt from gdb. Ive looked over the manual, but Im not
> sure how to put in the breakpoints et. al. that you geniuses use to
> evaluate a program at critical junctures. I actually used an
> mpirun np 2 dbg command to run it on 2 CPUs. I attached the file
> at the prompt. When I did a run, it ran fine with no optimization
> and one processor. With 2 processors, it didnt seem to do
> anything. All I will say here is that I have a lot to learn. Im
> calling on my friends for help on this.
For such small rung, I typically do the lazy thing:
- mpirun -np 2 ... as normal
- login to the node(s) where the jobs were launched
- use "gdb --pid <pid>" to attach to each of the jobs
- when gdb attaches, use the "continue" command to let the jobs keep
- eventually, the problem will occur and the process will die
- in several kinds of scenarios, gdb will show you right where it died
Consult the gdb documentation and/or any local resources you have for
> > That's fun. Can you tell if it runs the app at all, or if it dies
> > main() starts? This is probably more of an issue for your
> > intel support guy than us...
> Its a Fortran program. It starts in the main program. I inserted
> some PRINT*, statements of the PRINT*,Read the input at line 213
> variety into the main program to see what would print. It printed
> the first four statements, but it didnt reach the last three. The
> calls that were reached were in the set-up section of the program.
> The section that wasnt reached had a lot of matrix-setting and
> solving subroutine calls.
That's also a good sign; it started to execute and then died later.
So it's not a system-level issue that prevents the app from starting;
that eliminates one whole line of troubleshooting.
> mpirun np 2 mpi_hello
> mpirun np 2 non_mpi_hello
> print two Hello, worlds).
So just to be absolutely clear: this is expected behavior. Open MPI's
mpirun can launch non-MPI applications.
> This is my mistake. I attached an old version of ompi_info.txt. I
> am now attaching the correct version. I already have 1.2.4 installed.
Gotcha. I would proceed with seeing what the debugger will tell you,
or, failing that, putting more and more printf's in to narrow down
exactly where things fail. I'm an advocate of using tools, though --
so I tend to prefer using debuggers. But sometimes a small number of
printf's are ok. :-)