Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem in remote nodes
From: Robert Collyer (rcollyer_at_[hidden])
Date: 2010-04-07 11:37:02


Jeff,
In my case, it was the firewall. It was restricting communication to
ssh only between the compute nodes. I appreciate the help.

Rob

Jeff Squyres (jsquyres) wrote:
>
> Those are normal ssh messages, I think - an ssh session may try
> mulktiple auth methods before one succeeds.
>
> You're absolutely sure that there's no firewalling software and
> selinux is disabled? Ompi is behaving as if it is trying to
> communicate and failing (e.g., its hanging while trying to open some
> tcp sockets back).
>
> Can you open random tcp sockets between your nodes? (E.g., in non-mpi
> processes)
>
> -jms
> Sent from my PDA. No type good.
>
> ----- Original Message -----
> From: users-bounces_at_[hidden] <users-bounces_at_[hidden]>
> To: Open MPI Users <users_at_[hidden]>
> Sent: Wed Mar 31 06:25:43 2010
> Subject: Re: [OMPI users] Problem in remote nodes
>
> I've been checking the /var/log/messages on the compute node and there is
> nothing new after executing ' mpirun --host itanium2 -np 2
> helloworld.out',
> but in the /var/log/messages file on the remote node it appears the
> following messages, nothing about unix_chkpwd.
>
> Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure;
> logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1 user=otro
> Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from
> 192.168.3.1 port 40999 ssh2
> Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user
> otro by (uid=500)
> Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for
> user otro
>
> It seems that the authentication fails at first, but in the next message
> it connects with the node...
>
> El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió:
> > I've been having similar problems using Fedora core 9. I believe the
> > issue may be with SELinux, but this is just an educated guess. In my
> > setup, shortly after a login via mpi, there is a notation in the
> > /var/log/messages on the compute node as follows:
> >
> > Mar 30 12:39:45 <node_name> kernel: type=1400 audit(1269970785.534:588):
> > avc: denied { read } for pid=8047 comm="unix_chkpwd" name="hosts"
> > dev=dm-0 ino=24579
> > scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023
> > tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file
> >
> > which says SELinux denied unix_chkpwd read access to hosts.
> >
> > Are you getting anything like this?
> >
> > In the meantime, I'll check if allowing unix_chkpwd read access to hosts
> > eliminates the problem on my system, and if it works, I'll post the
> > steps involved.
> >
> > uriz.49949_at_[hidden] wrote:
> >> I've benn investigating and there is no firewall that could stop TCP
> >> traffic in the cluster. With the option --mca plm_base_verbose 30 I get
> >> the following output:
> >>
> >> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host
> itanium2
> >> helloworld.out
> >> [itanium1:08311] mca: base: components_open: Looking for plm components
> >> [itanium1:08311] mca: base: components_open: opening plm components
> >> [itanium1:08311] mca: base: components_open: found loaded component rsh
> >> [itanium1:08311] mca: base: components_open: component rsh has no
> >> register
> >> function
> >> [itanium1:08311] mca: base: components_open: component rsh open
> function
> >> successful
> >> [itanium1:08311] mca: base: components_open: found loaded component
> >> slurm
> >> [itanium1:08311] mca: base: components_open: component slurm has no
> >> register function
> >> [itanium1:08311] mca: base: components_open: component slurm open
> >> function
> >> successful
> >> [itanium1:08311] mca:base:select: Auto-selecting plm components
> >> [itanium1:08311] mca:base:select:( plm) Querying component [rsh]
> >> [itanium1:08311] mca:base:select:( plm) Query of component [rsh] set
> >> priority to 10
> >> [itanium1:08311] mca:base:select:( plm) Querying component [slurm]
> >> [itanium1:08311] mca:base:select:( plm) Skipping component [slurm].
> >> Query
> >> failed to return a module
> >> [itanium1:08311] mca:base:select:( plm) Selected component [rsh]
> >> [itanium1:08311] mca: base: close: component slurm closed
> >> [itanium1:08311] mca: base: close: unloading component slurm
> >>
> >> --Hangs here
> >>
> >> It seems a slurm problem??
> >>
> >> Thanks to any idea
> >>
> >> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
> >>
> >>> Did you configure OMPI with --enable-debug? You should do this so that
> >>> more diagnostic output is available.
> >>>
> >>> You can also add the following to your cmd line to get more info:
> >>>
> >>> --debug --debug-daemons --leave-session-attached
> >>>
> >>> Something is likely blocking proper launch of the daemons and
> processes
> >>> so
> >>> you aren't getting to the btl's at all.
> >>>
> >>>
> >>> On Mar 19, 2010, at 9:42 AM, uriz.49949_at_[hidden] wrote:
> >>>
> >>>
> >>>> The processes are running on the remote nodes but they don't give the
> >>>> response to the origin node. I don't know why.
> >>>> With the option --mca btl_base_verbose 30, I have the same problems
> >>>> and
> >>>> it
> >>>> doesn't show any message.
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>>> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres <jsquyres_at_[hidden]>
> >>>>> wrote:
> >>>>>
> >>>>>> On Mar 17, 2010, at 4:39 AM, <uriz.49949_at_[hidden]> wrote:
> >>>>>>
> >>>>>>
> >>>>>>> Hi everyone I'm a new Open MPI user and I have just installed Open
> >>>>>>> MPI
> >>>>>>> in
> >>>>>>> a 6 nodes cluster with Scientific Linux. When I execute it in
> local
> >>>>>>> it
> >>>>>>> works perfectly, but when I try to execute it on the remote nodes
> >>>>>>> with
> >>>>>>> the
> >>>>>>> --host option it hangs and gives no message. I think that the
> >>>>>>> problem
> >>>>>>> could be with the shared libraries but i'm not sure. In my opinion
> >>>>>>> the
> >>>>>>> problem is not ssh because i can access to the nodes with no
> >>>>>>> password
> >>>>>>>
> >>>>>> You might want to check that Open MPI processes are actually
> running
> >>>>>> on
> >>>>>> the remote nodes -- check with ps if you see any "orted" or other
> >>>>>> MPI-related processes (e.g., your processes).
> >>>>>>
> >>>>>> Do you have any TCP firewall software running between the
> nodes? If
> >>>>>> so,
> >>>>>> you'll need to disable it (at least for Open MPI jobs).
> >>>>>>
> >>>>> I also recommend running mpirun with the option --mca
> >>>>> btl_base_verbose
> >>>>> 30 to troubleshoot tcp issues.
> >>>>>
> >>>>> In some environments, you need to explicitly tell mpirun what
> network
> >>>>> interfaces it can use to reach the hosts. Read the following FAQ
> >>>>> section for more information:
> >>>>>
> >>>>> http://www.open-mpi.org/faq/?category=tcp
> >>>>>
> >>>>> Item 7 of the FAQ might be of special interest.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> users_at_[hidden]
> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> users_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users