I have an application that appears to function as I expect
when compiled with openmpi-1.4.2 on FreeBSD 9.0. But, it
appears to hang during communication between nodes. What
follows is the long version.
I configure 1.4.2 with
./configure --prefix=/usr/local/openmpi-1.4.2 \
--enable-mpirun-prefix-by-default --disable-shared --enable-static
The Fortran compiler is gfortran 4.5.3. I rebuild my application
and launch the app from node10 with
% /usr/local/openmpi-1.4.2/bin/mpiexec -mca btl tcp,self -machinefile mf1 \
-np 4 sasmp sas.in
where the machine file is
% cat mf1
node10 slots=3
node11 slots=4
Using top(1) on node10 and node11, I see
node10
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
74158 kargl 1 65 0 302M 293M CPU1 1 57:06 99.12% sasmp
74160 kargl 1 65 0 306M 298M CPU0 0 57:06 99.07% sasmp
74159 kargl 1 65 0 306M 298M CPU3 3 57:06 99.02% sasmp
node11
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
13144 kargl 1 48 0 306M 297M CPU3 3 55:55 99.02% sasmp
The above is the expected process information, and more important
the application is producing the right answer.
Now, if I repeat everything above for 1.4.3, I get
./configure --prefix=/usr/local/openmpi-1.4.3 \
--enable-mpirun-prefix-by-default --disable-shared --enable-static
Rebuild my application and launch the app from node10 with
% /usr/local/openmpi-1.4.3/bin/mpiexec -mca btl tcp,self -machinefile mf1 \
-np 4 sasmp sas.in
node10
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
74460 kargl 1 66 0 302M 291M CPU2 2 3:15 99.03% sasmp
74462 kargl 1 66 0 302M 291M CPU3 3 3:15 99.03% sasmp
74461 kargl 1 66 0 14472K 4616K CPU1 1 3:15 99.03% sasmp
node11
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
13298 kargl 1 49 0 14472K 3336K CPU3 3 3:11 99.03% sasmp
I've left the application running for up to 12 minutes, and it never
reaches the ~300 MB SIZE nor 293M RES on node11 and the one process
of node10.
Now, if I reduce -np from 4 to 3, then only 3 processes are started
on node10, and I get the expected results. So, as soon as I try to
send something over tcp, the application stalls. Any idea on how
I might debug this problem?
--
Steve
|