Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Problem in remote nodes
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2010-03-31 06:35:44


Those are normal ssh messages, I think - an ssh session may try mulktiple auth methods before one succeeds.

You're absolutely sure that there's no firewalling software and selinux is disabled? Ompi is behaving as if it is trying to communicate and failing (e.g., its hanging while trying to open some tcp sockets back).

Can you open random tcp sockets between your nodes? (E.g., in non-mpi processes)

-jms
Sent from my PDA. No type good.

----- Original Message -----
From: users-bounces_at_[hidden] <users-bounces_at_[hidden]>
To: Open MPI Users <users_at_[hidden]>
Sent: Wed Mar 31 06:25:43 2010
Subject: Re: [OMPI users] Problem in remote nodes

I've been checking the /var/log/messages on the compute node and there is
nothing new after executing ' mpirun --host itanium2 -np 2
helloworld.out',
but in the /var/log/messages file on the remote node it appears the
following messages, nothing about unix_chkpwd.

Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure;
logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1 user=otro
Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from
192.168.3.1 port 40999 ssh2
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user
otro by (uid=500)
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for user otro

It seems that the authentication fails at first, but in the next message
it connects with the node...

El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió:
> I've been having similar problems using Fedora core 9. I believe the
> issue may be with SELinux, but this is just an educated guess. In my
> setup, shortly after a login via mpi, there is a notation in the
> /var/log/messages on the compute node as follows:
>
> Mar 30 12:39:45 <node_name> kernel: type=1400 audit(1269970785.534:588):
> avc: denied { read } for pid=8047 comm="unix_chkpwd" name="hosts"
> dev=dm-0 ino=24579
> scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file
>
> which says SELinux denied unix_chkpwd read access to hosts.
>
> Are you getting anything like this?
>
> In the meantime, I'll check if allowing unix_chkpwd read access to hosts
> eliminates the problem on my system, and if it works, I'll post the
> steps involved.
>
> uriz.49949_at_[hidden] wrote:
>> I've benn investigating and there is no firewall that could stop TCP
>> traffic in the cluster. With the option --mca plm_base_verbose 30 I get
>> the following output:
>>
>> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2
>> helloworld.out
>> [itanium1:08311] mca: base: components_open: Looking for plm components
>> [itanium1:08311] mca: base: components_open: opening plm components
>> [itanium1:08311] mca: base: components_open: found loaded component rsh
>> [itanium1:08311] mca: base: components_open: component rsh has no
>> register
>> function
>> [itanium1:08311] mca: base: components_open: component rsh open function
>> successful
>> [itanium1:08311] mca: base: components_open: found loaded component
>> slurm
>> [itanium1:08311] mca: base: components_open: component slurm has no
>> register function
>> [itanium1:08311] mca: base: components_open: component slurm open
>> function
>> successful
>> [itanium1:08311] mca:base:select: Auto-selecting plm components
>> [itanium1:08311] mca:base:select:( plm) Querying component [rsh]
>> [itanium1:08311] mca:base:select:( plm) Query of component [rsh] set
>> priority to 10
>> [itanium1:08311] mca:base:select:( plm) Querying component [slurm]
>> [itanium1:08311] mca:base:select:( plm) Skipping component [slurm].
>> Query
>> failed to return a module
>> [itanium1:08311] mca:base:select:( plm) Selected component [rsh]
>> [itanium1:08311] mca: base: close: component slurm closed
>> [itanium1:08311] mca: base: close: unloading component slurm
>>
>> --Hangs here
>>
>> It seems a slurm problem??
>>
>> Thanks to any idea
>>
>> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
>>
>>> Did you configure OMPI with --enable-debug? You should do this so that
>>> more diagnostic output is available.
>>>
>>> You can also add the following to your cmd line to get more info:
>>>
>>> --debug --debug-daemons --leave-session-attached
>>>
>>> Something is likely blocking proper launch of the daemons and processes
>>> so
>>> you aren't getting to the btl's at all.
>>>
>>>
>>> On Mar 19, 2010, at 9:42 AM, uriz.49949_at_[hidden] wrote:
>>>
>>>
>>>> The processes are running on the remote nodes but they don't give the
>>>> response to the origin node. I don't know why.
>>>> With the option --mca btl_base_verbose 30, I have the same problems
>>>> and
>>>> it
>>>> doesn't show any message.
>>>>
>>>> Thanks
>>>>
>>>>
>>>>> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres <jsquyres_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>> On Mar 17, 2010, at 4:39 AM, <uriz.49949_at_[hidden]> wrote:
>>>>>>
>>>>>>
>>>>>>> Hi everyone I'm a new Open MPI user and I have just installed Open
>>>>>>> MPI
>>>>>>> in
>>>>>>> a 6 nodes cluster with Scientific Linux. When I execute it in local
>>>>>>> it
>>>>>>> works perfectly, but when I try to execute it on the remote nodes
>>>>>>> with
>>>>>>> the
>>>>>>> --host option it hangs and gives no message. I think that the
>>>>>>> problem
>>>>>>> could be with the shared libraries but i'm not sure. In my opinion
>>>>>>> the
>>>>>>> problem is not ssh because i can access to the nodes with no
>>>>>>> password
>>>>>>>
>>>>>> You might want to check that Open MPI processes are actually running
>>>>>> on
>>>>>> the remote nodes -- check with ps if you see any "orted" or other
>>>>>> MPI-related processes (e.g., your processes).
>>>>>>
>>>>>> Do you have any TCP firewall software running between the nodes? If
>>>>>> so,
>>>>>> you'll need to disable it (at least for Open MPI jobs).
>>>>>>
>>>>> I also recommend running mpirun with the option --mca
>>>>> btl_base_verbose
>>>>> 30 to troubleshoot tcp issues.
>>>>>
>>>>> In some environments, you need to explicitly tell mpirun what network
>>>>> interfaces it can use to reach the hosts. Read the following FAQ
>>>>> section for more information:
>>>>>
>>>>> http://www.open-mpi.org/faq/?category=tcp
>>>>>
>>>>> Item 7 of the FAQ might be of special interest.
>>>>>
>>>>> Regards,
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


_______________________________________________
users mailing list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users