Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] OpenMPI job launch failures
From: Ralph Castain (rhc_at_[hidden])
Date: 2013-02-14 14:59:06


Rats - sorry.

I seem to recall fixing something in 1.6 that might relate to this - a race condition in the startup. You might try updating to the 1.6.4 release candidate.

On Feb 14, 2013, at 11:04 AM, Bharath Ramesh <bramesh_at_[hidden]> wrote:

> When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of
> mca_oob_tcp_message_recv_complete: invalid message type errors
> and the job just hangs even when all the nodes have fired off the
> MPI application.
>
>
> --
> Bharath
>
> On Thu, Feb 14, 2013 at 09:51:50AM -0800, Ralph Castain wrote:
>> I don't think this is documented anywhere, but it is an available trick (not sure if it is in 1.6.1, but might be): if you set OPAL_OUTPUT_STDERR_FD=N in your environment, we will direct all our error outputs to that file descriptor. If it is "0", then it goes to stdout.
>>
>> Might be worth a try?
>>
>>
>> On Feb 14, 2013, at 8:38 AM, Bharath Ramesh <bramesh_at_[hidden]> wrote:
>>
>>> Is there any way to prevent the output of more than one node
>>> written to the same line. I tried setting --output-filename,
>>> which didnt help. For some reason only stdout was written to the
>>> files. Making it little bit hard to read close to a 6M output
>>> file.
>>>
>>> --
>>> Bharath
>>>
>>> On Thu, Feb 14, 2013 at 07:35:02AM -0800, Ralph Castain wrote:
>>>> Sounds like the orteds aren't reporting back to mpirun after launch. The MPI_proctable observation just means that the procs didn't launch in those cases where it is absent, which is something you already observed.
>>>>
>>>> Set "-mca plm_base_verbose 5" on your cmd line. You should see each orted report back to mpirun after it launches. If not, then it is likely that something is blocking it.
>>>>
>>>> You could also try updating to 1.6.3/4 in case there is some race condition in 1.6.1, though we haven't heard of it to-date.
>>>>
>>>>
>>>> On Feb 14, 2013, at 7:21 AM, Bharath Ramesh <bramesh_at_[hidden]> wrote:
>>>>
>>>>> On our cluster we are noticing intermediate job launch failure when using OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is integrated with Torque-4.1.3. It failes even for a simple MPI hello world applications. The issue is that orted gets launched on all the nodes but there are a bunch of nodes that dont launch the actual MPI application. There are no errors reported when the job gets killed because the walltime expires. Enabling --debug-daemons doesnt show any errors either. The only difference being that successful runs have MPI_proctable listed and for failures this is absent. Any help in debugging this issue is greatly appreciated.
>>>>>
>>>>> --
>>>>> Bharath
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users