You were right about iptables being very complex. It seems that uninstalling it completly did the trick. All my Send / Receive operations now complete as they should. Just one more question. Will uninstalling iptables have any undesired effects on my Linux cluster?
 
Thaks!
Adrian

From: Jeff Squyres <jsquyres@cisco.com>
To: adrian sabou <adrian.sabou@yahoo.com>
Sent: Friday, February 3, 2012 12:30 PM
Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking

On Feb 3, 2012, at 5:21 AM, adrian sabou wrote:

> There is no iptables in my /etc/init.d.

It might be different in different OS's -- my RedHat-based system has /etc/init.d/iptables.

Perhaps try uninstalling iptables using your local package manager (rpm, yum, apt, ...whatever).

> It's most probably a communication issue between the nodes. However, I have no ideea what it might be. It's weird though that the first Send / Receive pair works and only subsequent pairs fail. Anyway, thankyou for taking the time to help me out. I am grateful!

> Adrian
>
> From: Jeff Squyres <jsquyres@cisco.com>
> To: adrian sabou <adrian.sabou@yahoo.com>; Open MPI Users <users@open-mpi.org>
> Sent: Thursday, February 2, 2012 11:19 PM
> Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
>
> When you run without a hostfile, you're likely only running on a single node via shared memory (unless you're running inside a SLURM job, which is unlikely, given the context of your mails). 
>
> When you're running in SLURM, I'm guessing that you're running across multiple nodes.  Are you using TCP as your MPI transport?
>
> If so, I would still recommend trying stopping iptables altogether -- /etc/init.d/iptables stop.  It might not make a difference, but I've found iptables to be sufficiently complex that it's easier to take that variable out altogether by stopping it to really, really test to see if that's the problem.
>
>
>
> On Feb 2, 2012, at 9:48 AM, adrian sabou wrote:
>
> > Hi,
> > 
> > I have disabled iptables on all nodes using:
> > 
> > iptables -F
> > iptables -X
> > iptables -t nat -F
> > iptables -t nat -X
> > iptables -t mangle -F
> > iptables -t mangle -X
> > iptables -P INPUT ACCEPT
> > iptables -P FORWARD ACCEPT
> > iptables -P OUTPUT ACCEPT
> > 
> > My problem is still there. I have re-enabled iptables. The current output of the "iptables --list" command is:
> > 
> > Chain INPUT (policy ACCEPT)
> > target    prot opt source              destination
> > ACCEPT    udp  --  anywhere            anywhere            udp dpt:domain
> > ACCEPT    tcp  --  anywhere            anywhere            tcp dpt:domain
> > ACCEPT    udp  --  anywhere            anywhere            udp dpt:bootps
> > ACCEPT    tcp  --  anywhere            anywhere            tcp dpt:bootps
> > Chain FORWARD (policy ACCEPT)
> > target    prot opt source              destination
> > ACCEPT    all  --  anywhere            192.168.122.0/24    state RELATED,ESTABLISHED
> > ACCEPT    all  --  192.168.122.0/24    anywhere
> > ACCEPT    all  --  anywhere            anywhere
> > REJECT    all  --  anywhere            anywhere            reject-with icmp-port-unreachable
> > REJECT    all  --  anywhere            anywhere            reject-with icmp-port-unreachable
> > Chain OUTPUT (policy ACCEPT)
> > target    prot opt source              destination
> > I don't think this is it. I have tried to run a simple ping-pong program that I found (keeps bouncing a value between two processes) and I keep getting the same results : the first Send / Receive pairs (p1 sends to p2, p2 receives and sends back to p1, p1 receives) work and after that the program just blocks. However, like all other examples, the example works if I launch it with "mpirun -np 2 <ping-pong>" and bounces the value 100 times.
> > 
> > Adrian
> > From: Jeff Squyres <jsquyres@cisco.com>
> > To: adrian sabou <adrian.sabou@yahoo.com>; Open MPI Users <users@open-mpi.org>
> > Sent: Thursday, February 2, 2012 3:09 PM
> > Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
> >
> > Have you disabled iptables (firewalling) on your nodes?
> >
> > Or, if you want to leave iptables enabled, set it such that all nodes in your cluster are allowed to open TCP connections from any port to any other port.
> >
> >
> >
> >
> > On Feb 2, 2012, at 4:49 AM, adrian sabou wrote:
> >
> > > Hi,
> > >
> > > The only example that works is hello_c.c. All others (that use MPI_Send and MPI_Recv)(connectivity_c.c and ring_c.c) block after the first MPI_Send / MPI_Recv (although the first Send/Receive pair works well for all processes, subsequent Send/Receive pairs block). My slurm version is 2.1.0. It is also worth mentioning that all examples work when not using SLURM (launching with "mpirun -np 5 <exaple_app>"). Blocking occurs only when I try to run on multiple hosts with SLURM ("salloc -N5 mpirun <example_app>").
> > >
> > > Adrian
> > >
> > > From: Jeff Squyres <jsquyres@cisco.com>
> > > To: adrian sabou <adrian.sabou@yahoo.com>; Open MPI Users <users@open-mpi.org>
> > > Sent: Wednesday, February 1, 2012 10:32 PM
> > > Subject: Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
> > >
> > > On Jan 31, 2012, at 11:16 AM, adrian sabou wrote:
> > >
> > > > Like I said, a very simple program.
> > > > When launching this application with SLURM (using "salloc -N2 mpirun ./<my_app>"), it hangs at the barrier.
> > >
> > > Are you able to run the MPI example programs in examples/ ?
> > >
> > > > However, it passes the barrier if I launch it without SLURM (using "mpirun -np 2 ./<my_app>"). I first noticed this problem when my application hanged if I tried to send two successive messages from a process to another. Only the first MPI_Send would work. The second MPI_Send would block indefinitely. I was wondering whether any of you have encountered a similar problem, or may have an ideea as to what is causing the Send/Receive pair to block when using SLURM. The exact output in my console is as follows:
> > > > 
> > > >        salloc: Granted job allocation 1138
> > > >        Process 0 - Sending...
> > > >        Process 1 - Receiving...
> > > >        Process 1 - Received.
> > > >        Process 1 - Barrier reached.
> > > >        Process 0 - Sent.
> > > >        Process 0 - Barrier reached.
> > > >        (it just hangs here)
> > > > 
> > > > I am new to MPI programming and to OpenMPI and would greatly appreciate any help. My OpenMPI version is 1.4.4 (although I have also tried it on 1.5.4), my SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1),
> > >
> > > I'm not sure what SLURM version that is -- my "srun --version" shows 2.2.4.  0.3.3 would be pretty ancient, no?
> > >
> > > --
> > > Jeff Squyres
> > > jsquyres@cisco.com
> > > For corporate legal information go to:
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > users@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > Jeff Squyres
> > jsquyres@cisco.com
> > For corporate legal information go to:
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > users@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquyres@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
> _______________________________________________
> users mailing list
> users@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquyres@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/