Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] tcp communication problems with 1.4.3 and 1.4.4 rc2 on FreeBSD
From: Steve Kargl (sgk_at_[hidden])
Date: 2011-07-05 16:14:06


I have an application that appears to function as I expect
when compiled with openmpi-1.4.2 on FreeBSD 9.0. But, it
appears to hang during communication between nodes. What
follows is the long version.

I configure 1.4.2 with

./configure --prefix=/usr/local/openmpi-1.4.2 \
--enable-mpirun-prefix-by-default --disable-shared --enable-static

The Fortran compiler is gfortran 4.5.3. I rebuild my application
and launch the app from node10 with

% /usr/local/openmpi-1.4.2/bin/mpiexec -mca btl tcp,self -machinefile mf1 \
  -np 4 sasmp sas.in

where the machine file is

% cat mf1
node10 slots=3
node11 slots=4

Using top(1) on node10 and node11, I see

node10
  PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
74158 kargl 1 65 0 302M 293M CPU1 1 57:06 99.12% sasmp
74160 kargl 1 65 0 306M 298M CPU0 0 57:06 99.07% sasmp
74159 kargl 1 65 0 306M 298M CPU3 3 57:06 99.02% sasmp

node11
  PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
13144 kargl 1 48 0 306M 297M CPU3 3 55:55 99.02% sasmp

The above is the expected process information, and more important
the application is producing the right answer.

Now, if I repeat everything above for 1.4.3, I get

./configure --prefix=/usr/local/openmpi-1.4.3 \
--enable-mpirun-prefix-by-default --disable-shared --enable-static

Rebuild my application and launch the app from node10 with

% /usr/local/openmpi-1.4.3/bin/mpiexec -mca btl tcp,self -machinefile mf1 \
  -np 4 sasmp sas.in

node10
  PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
74460 kargl 1 66 0 302M 291M CPU2 2 3:15 99.03% sasmp
74462 kargl 1 66 0 302M 291M CPU3 3 3:15 99.03% sasmp
74461 kargl 1 66 0 14472K 4616K CPU1 1 3:15 99.03% sasmp

node11
  PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
13298 kargl 1 49 0 14472K 3336K CPU3 3 3:11 99.03% sasmp

I've left the application running for up to 12 minutes, and it never
reaches the ~300 MB SIZE nor 293M RES on node11 and the one process
of node10.

Now, if I reduce -np from 4 to 3, then only 3 processes are started
on node10, and I get the expected results. So, as soon as I try to
send something over tcp, the application stalls. Any idea on how
I might debug this problem?

-- 
Steve