I have encountered the same problem too.
By gdb attached, it's show that the processes are in a loop of (e)poll. After configuring the network interface in ~/.openmpi/mca-params.conf using btl_tcp_if_include, all hosts work fine.
I'm just using OpenMPI for few days. I'll try to run a simple MPI program, the program is ProcessColors which I get from CI-Tutor. I have 2 hosts, if I run the program separately on each one, it runs well. However, if I run it on two hosts using following command: mpirun --host host1,host2 --preload-binary -np 8 ProcessColors. The program hangs.
When I use command ps -A to check running process, I find out that there is 4 processes running on each host. So, I think that there is a deadlock on my program, but why it runs well with single host?
All those following commands run without any problem on both machine:
Later, I found out that the problem comes when the remote host try to send message to the host which root process (process 0) is running, which is the host that I run the command. I don't know why the process is blocked at sending task.
- mpirun -np 8 ProcessColors
- mpirun --host host1 -np 8 ProcessColors
- mpirun --host host2 -np 8 ProcessColors
Any help from you is precious to me.
users mailing list