Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-09-12 11:39:26


(I am bringing back OMPI users to CC)

I reproduce the problem with OMPI 1.6.1 and found the problem.
mx_finalize() is called before this error occurs. So the error is
expected because calling mx_connect() after mx_finalize() is invalid.
It looks the MX component changed significantly between OMPI 1.5 and
1.6.1, and I am pretty sure it worked fine with early 1.5.x versions.
Can somebody comment on what was changed in the MX BTL component in late
1.5 versions ?

Brice

Le 12/09/2012 15:48, Douglas Eadline a écrit :
>> Resending this mail again, with another SMTP. Please re-add
>> open-mx-devel to CC when you reply.
>>
>> Brice
>>
>>
>> Le 07/09/2012 00:10, Brice Goglin a écrit :
>>> Hello Doug,
>>>
>>> Did you use the same Open-MX version when it worked fine? Same kernel
>>> too?
>>> Any chance you built OMPI over an old OMX that would not be compatible
>>> with 1.5.2?
> I double checked, and even rebuilt both Open MPI and MPICH2
> with 1.5.2.
>
> Running on a 4 node cluster with Warewulf provisioning. See below:
>
>>> The error below tells that the OMX driver and lib don't speak the same
>>> langage. EBADF is almost never returned from the OMX driver. The only
>>> case is when talking to /dev/open-mx-raw, but normal application don't
>>> do this. That's why I am thinking about OMPI using an old library that
>>> cannot talk to a new driver. We have checks to prevent this but we never
>>> know.
>>>
>>> Do you see anything in dmesg?
> no
>>> Is omx_info OK?
> yes shows:
>
> Open-MX version 1.5.2
> build: deadline_at_limulus:/raid1/home/deadline/rpms-sl6/BUILD/open-mx-1.5.2
> Mon Sep 10 08:44:16 EDT 2012
>
> Found 1 boards (32 max) supporting 32 endpoints each:
> n0:0 (board #0 name eth0 addr 00:1a:4d:4a:bf:85)
> managed by driver 'r8169'
>
> Peer table is ready, mapper is 00:00:00:00:00:00
> ================================================
> 0) 00:1a:4d:4a:bf:85 n0:0
> 1) 00:1c:c0:9b:66:d0 n1:0
> 2) 00:1a:4d:4a:bf:83 n2:0
> 3) e0:69:95:35:d7:71 limulus:0
>
>
>
>>> Does a basic omx_perf work? (see
>>> http://open-mx.gforge.inria.fr/FAQ/index-1.5.html#perf-omxperf)
> yes, checked host -> each node it works. And mpich2 compiled
> with same libraries works. What else
> can I check?
>
> --
> Doug
>
>
>>> Brice
>>>
>>>
>>> Le 06/09/2012 23:04, Douglas Eadline a écrit :
>>>> I built open-mpi 1.6.1 using the open-mx libraries.
>>>> This worked previously and now I get the following
>>>> error. Here is my system:
>>>>
>>>> kernel: 2.6.32-279.5.1.el6.x86_64
>>>> open-mx: 1.5.2
>>>>
>>>> BTW, open-mx worked previously with open-mpi and the current
>>>> version works with mpich2
>>>>
>>>>
>>>> $ mpiexec -np 8 -machinefile machines cpi
>>>> Process 0 on limulus
>>>> FatalError: Failed to lookup peer by addr, driver replied Bad file
>>>> descriptor
>>>> cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion
>>>> `0'
>>>> failed.
>>>> [limulus:04448] *** Process received signal ***
>>>> [limulus:04448] Signal: Aborted (6)
>>>> [limulus:04448] Signal code: (-6)
>>>> [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500]
>>>> [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5]
>>>> [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085]
>>>> [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e]
>>>> [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0)
>>>> [0x332462bae0]
>>>> [limulus:04448] [ 5]
>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197)
>>>> [0x7fb587418b37]
>>>> [limulus:04448] [ 6]
>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55)
>>>> [0x7fb58741a5d5]
>>>> [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a)
>>>> [0x7fb587419c7a]
>>>> [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c)
>>>> [0x7fb58741a27c]
>>>> [limulus:04448] [ 9] /usr/open-mx/lib/libopen-mx.so.0(mx_connect+0x15)
>>>> [0x7fb587425865]
>>>> [limulus:04448] [10]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_proc_connect+0x5e)
>>>> [0x7fb5876fe40e]
>>>> [limulus:04448] [11]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_send+0x2d4)
>>>> [0x7fb5876fbd94]
>>>> [limulus:04448] [12]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_send_request_start_prepare+0xcb)
>>>> [0x7fb58777d6fb]
>>>> [limulus:04448] [13]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_isend+0x4cb)
>>>> [0x7fb58777509b]
>>>> [limulus:04448] [14]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_generic+0x37b)
>>>> [0x7fb58770b55b]
>>>> [limulus:04448] [15]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_binomial+0xd8)
>>>> [0x7fb58770b8b8]
>>>> [limulus:04448] [16]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)
>>>> [0x7fb587702d8c]
>>>> [limulus:04448] [17]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_coll_sync_bcast+0x78)
>>>> [0x7fb587712e88]
>>>> [limulus:04448] [18]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(MPI_Bcast+0x130)
>>>> [0x7fb5876ce1b0]
>>>> [limulus:04448] [19] cpi(main+0x10b) [0x400cc4]
>>>> [limulus:04448] [20] /lib64/libc.so.6(__libc_start_main+0xfd)
>>>> [0x332461ecdd]
>>>> [limulus:04448] [21] cpi() [0x400ac9]
>>>> [limulus:04448] *** End of error message ***
>>>> Process 2 on limulus
>>>> Process 4 on limulus
>>>> Process 6 on limulus
>>>> Process 1 on n0
>>>> Process 7 on n0
>>>> Process 3 on n0
>>>> Process 5 on n0
>>>> --------------------------------------------------------------------------
>>>> mpiexec noticed that process rank 0 with PID 4448 on node limulus
>>>> exited
>>>> on signal 6 (Aborted).
>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>
>> --
>> Mailscanner: Clean
>>
>