Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Arnstein Ressem (aressem_at_[hidden])
Date: 2005-10-26 03:58:41


Hi,

I'm getting the same hangs in my environment and will contribute my
findings. The info from ompi_info, debug output and application source
is attached.

When running on one machine and two processes like this:
mpirun -np 2 -mca orte_debug 1 mpitest

The application successfully executes and terminates.

When running on two machines and two processes like this:
mpirun -hostlist nodelist -np 2 -mca orte_debug 1 mpitest

The application hangs. Both the mpitest application and orted is in the
process list on both machines so they have been started. I have also
tried to have only the local host in the nodelist and this works.

-Arnstein

Jeff Squyres wrote:
> Hugh --
>
> We are actually unable to replicate the problem; we've run some
> single-threaded and multi-threaded apps with no problems. This is
> unfortunately probably symptomatic of bugs that are still remaining in
> the code. :-(
>
> Can you try disabling MPI progress threads (I believe that tcp may be
> the only BTL component that has async progress support implemented
> anyway; sm *may*, but I'd have to go back and check)? Leave MPI threads
> enabled (i.e., MPI_THREAD_MULTIPLE) and see if that gets you further.
>
>
>
> Hugh Merz wrote:
>
>>>It's still only lightly tested. I'm surprised that it totally hangs for
>>>you, though -- what is your simple test program doing?
>>
>>
>>It just initializes mpi (tried both mpi_init and mpi_init_thread), prints
>>a string and exits. It works fine without thread support compiled into
>>ompi.
>>
>>It happens with any mpi program I try.
>>
>>Attaching gdb to each thread of the executable gives:
>>
>>(original process)
>>#0 0x420293d5 in sigsuspend () from /lib/i686/libc.so.6
>>#1 0x401e8609 in __pthread_wait_for_restart_signal () from /lib/i686/libpthread.so.0
>>#2 0x401e4eec in pthread_cond_wait () from /lib/i686/libpthread.so.0
>>#3 0x40bda418 in mca_oob_tcp_msg_wait () from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_oob_tcp.so
>>
>>(thread 1)
>>#0 0x420e01a7 in poll () from /lib/i686/libc.so.6
>>#1 0x401e5c30 in __pthread_manager () from /lib/i686/libpthread.so.0
>>
>>(thread 2)
>>#0 0x420e01a7 in poll () from /lib/i686/libc.so.6
>>#1 0x4013268b in poll_dispatch () from /opt/openmpi-1.0rc2_asynch/lib/libopal.so.0
>>Cannot access memory at address 0x3e8
>>
>>(thread 3)
>>#0 0x420dae14 in read () from /lib/i686/libc.so.6
>>#1 0x401f3b18 in __DTOR_END__ () from /lib/i686/libpthread.so.0
>>#2 0x40c8dfe3 in mca_btl_sm_component_event_thread ()
>> from /opt/openmpi-1.0rc2_asynch/lib/openmpi/mca_btl_sm.so
>>
>>And there are also 2 additional threads spawned by each of mpirun and
>>orted.
>>
>>Any clues or hints on how to debug this would be appreciated, but I
>>understand that it is probably not high priority right now.
>>
>>Thanks,
>>
>>Hugh
>>
>>
>>
>>>Hugh Merz wrote:
>>>
>>>
>>>>Howdy,
>>>>
>>>> I tried installing the release candidate with thread support
>>>>enabled ( --enable-mpi-threads and --enable-progress-threads ) using an
>>>>old rh7.3 install and a recent fc4 install (Intel compilers). When I try
>>>>to run a simple test program, the executable, mpirun and orted all sleep
>>>>in what appears to be a deadlock. If I compile ompi without threads
>>>>everything works fine.
>>>>
>>>> The faq states that thread support has only been lightly tested, and
>>>>there was only brief discussion about it in the maillist 8 months ago -
>>>>have there been any developments, and should I expect it to work properly?
>>>>
>>>>Thanks,
>>>>
>>>>Hugh
>>>>_______________________________________________
>>>>users mailing list
>>>>users_at_[hidden]
>>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>>--
>>>{+} Jeff Squyres
>>>{+} The Open MPI Project
>>>{+} http://www.open-mpi.org/
>>>_______________________________________________
>>>users mailing list
>>>users_at_[hidden]
>>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>_______________________________________________
>>users mailing list
>>users_at_[hidden]
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>



  • application/x-compressed-tar attachment: files.tgz