i have 3 nodes, one is master node and another is computing nodes,these nodes deployed in the internet (not in cluster)
when i running NPB (NASA parallel benchmark) in one node (use 2 processes)
mpirun -np 2 exe.
I can get the successful result, but when i running in two nodes(for example running on B and C nodes) i got a fail
mprirun -nolocal -hostfile hostfile -np 2 exe.
the fail information is :
B [0,1,0] connectimeout ,connect() fail errno=110
C [0,1,1] connectimeout ,connect() fail errno=110
but the connect between B and C has no problem, because i can use ping and ssh form B to C (or C to B).
I think this problem may be caused by the para connectimeout (so little that lead fail?). Because my nodes deployed on internet so delay is bigger.
who can help me attack this problem and how to set the connectimeout in openmpi?
users mailing list