Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Non-root install; hang there running on multiple nodes
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2010-03-25 11:12:22


You probably do want to check with your admin and ensure that either firewalling software is disabled or that a trust relationship is setup between the machines that you want to use. Effectively, Open MPI needs to be open random TCP ports between all the hosts that you will be using.

There are controls to restrict Open MPI's TCP port selection, but it's generally easier if you can just disable firewalling or setup trust between machines.

On Mar 24, 2010, at 4:45 PM, haoanyi wrote:

> I run a program with the following command line, and obtain the error message
> mpirun -x LD_LIBRARY_PATH=/home/haoanyi1/socIntel/goto --prefix /home/haoanyi1/openmpi1.4.1 -np 2 -host intel01,intel02 -rf hosts ./main 62 62 tests/ > newtest_64x64_np2_omp
>
> [btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.122.1 failed: Connection refused (111)
>
> In the hostsfile, I use following to do cpu mapping
> rank 0=intel01 slot=0
> rank 1=intel02 slot=1
>
> This file is different from the hosts file that I do with mpurun --hostfile hosts hostname, which reads like
> intel01
> intel02
> ......
>
> 2010-03-25 04:33:24, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>
> >Can you mpirun non-MPI applications, like "hostname"? I frequently run this as a first step to debugging a wonky install.&! nbsp; For example: > >shell$ hostname >barney >shell$ mpirun hostname >barney >shell$ cat hosts >barney >rubble >shell$ mpirun --hostfile hosts hostname >barney >rubble >shell$ > > >On Mar 24, 2010, at 4:28 PM, haoanyi wrote: > >> Hi, >> >> I installed OpenMPI1.4.1 as a non-root user on a cluster. It is totally OK when I run with mpirun or mpiexec on one single node for many processes. However, when I lauch many processes on multiple nodes, I can observe jobs are distributed to those nodes (by using "top"), but all the jobs just hang there and cannot finish. >> >> I think the nodes use TCP to communicate with each other. This cluster also provides MPICH2, which was configured by the sys admin., and has no problem to do node communication in MPICH2. Besides, I read from some posts, which says this may be caused by TCP firewall. Since I have no root's right, and I don't know what s
hall request the admin. to do to fix this problem. So, can you tell me how to do that either by the admin root or by the non-root user (if possible)? >> >> Thank you very much. >> Hao >> >> >> _______________________________________________ >> users mailing list >> users_at_[hidden] >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > >-- >Jeff Squyres >jsquyres_at_[hidden] >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > > >_______________________________________________ >users mailing list >users_at_[hidden] >http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/