Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1
From: Brice Goglin (Brice.Goglin_at_[hidden])
Date: 2012-09-12 11:39:26


(I am bringing back OMPI users to CC)

I reproduce the problem with OMPI 1.6.1 and found the problem.
mx_finalize() is called before this error occurs. So the error is
expected because calling mx_connect() after mx_finalize() is invalid.
It looks the MX component changed significantly between OMPI 1.5 and
1.6.1, and I am pretty sure it worked fine with early 1.5.x versions.
Can somebody comment on what was changed in the MX BTL component in late
1.5 versions ?

Brice

Le 12/09/2012 15:48, Douglas Eadline a écrit :
>> Resending this mail again, with another SMTP. Please re-add
>> open-mx-devel to CC when you reply.
>>
>> Brice
>>
>>
>> Le 07/09/2012 00:10, Brice Goglin a écrit :
>>> Hello Doug,
>>>
>>> Did you use the same Open-MX version when it worked fine? Same kernel
>>> too?
>>> Any chance you built OMPI over an old OMX that would not be compatible
>>> with 1.5.2?
> I double checked, and even rebuilt both Open MPI and MPICH2
> with 1.5.2.
>
> Running on a 4 node cluster with Warewulf provisioning. See below:
>
>>> The error below tells that the OMX driver and lib don't speak the same
>>> langage. EBADF is almost never returned from the OMX driver. The only
>>> case is when talking to /dev/open-mx-raw, but normal application don't
>>> do this. That's why I am thinking about OMPI using an old library that
>>> cannot talk to a new driver. We have checks to prevent this but we never
>>> know.
>>>
>>> Do you see anything in dmesg?
> no
>>> Is omx_info OK?
> yes shows:
>
> Open-MX version 1.5.2
> build: deadline_at_limulus:/raid1/home/deadline/rpms-sl6/BUILD/open-mx-1.5.2
> Mon Sep 10 08:44:16 EDT 2012
>
> Found 1 boards (32 max) supporting 32 endpoints each:
> n0:0 (board #0 name eth0 addr 00:1a:4d:4a:bf:85)
> managed by driver 'r8169'
>
> Peer table is ready, mapper is 00:00:00:00:00:00
> ================================================
> 0) 00:1a:4d:4a:bf:85 n0:0
> 1) 00:1c:c0:9b:66:d0 n1:0
> 2) 00:1a:4d:4a:bf:83 n2:0
> 3) e0:69:95:35:d7:71 limulus:0
>
>
>
>>> Does a basic omx_perf work? (see
>>> http://open-mx.gforge.inria.fr/FAQ/index-1.5.html#perf-omxperf)
> yes, checked host -> each node it works. And mpich2 compiled
> with same libraries works. What else
> can I check?
>
> --
> Doug
>
>
>>> Brice
>>>
>>>
>>> Le 06/09/2012 23:04, Douglas Eadline a écrit :
>>>> I built open-mpi 1.6.1 using the open-mx libraries.
>>>> This worked previously and now I get the following
>>>> error. Here is my system:
>>>>
>>>> kernel: 2.6.32-279.5.1.el6.x86_64
>>>> open-mx: 1.5.2
>>>>
>>>> BTW, open-mx worked previously with open-mpi and the current
>>>> version works with mpich2
>>>>
>>>>
>>>> $ mpiexec -np 8 -machinefile machines cpi
>>>> Process 0 on limulus
>>>> FatalError: Failed to lookup peer by addr, driver replied Bad file
>>>> descriptor
>>>> cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion
>>>> `0'
>>>> failed.
>>>> [limulus:04448] *** Process received signal ***
>>>> [limulus:04448] Signal: Aborted (6)
>>>> [limulus:04448] Signal code: (-6)
>>>> [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500]
>>>> [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5]
>>>> [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085]
>>>> [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e]
>>>> [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0)
>>>> [0x332462bae0]
>>>> [limulus:04448] [ 5]
>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197)
>>>> [0x7fb587418b37]
>>>> [limulus:04448] [ 6]
>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55)
>>>> [0x7fb58741a5d5]
>>>> [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a)
>>>> [0x7fb587419c7a]
>>>> [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c)
>>>> [0x7fb58741a27c]
>>>> [limulus:04448] [ 9] /usr/open-mx/lib/libopen-mx.so.0(mx_connect+0x15)
>>>> [0x7fb587425865]
>>>> [limulus:04448] [10]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_proc_connect+0x5e)
>>>> [0x7fb5876fe40e]
>>>> [limulus:04448] [11]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_send+0x2d4)
>>>> [0x7fb5876fbd94]
>>>> [limulus:04448] [12]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_send_request_start_prepare+0xcb)
>>>> [0x7fb58777d6fb]
>>>> [limulus:04448] [13]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_isend+0x4cb)
>>>> [0x7fb58777509b]
>>>> [limulus:04448] [14]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_generic+0x37b)
>>>> [0x7fb58770b55b]
>>>> [limulus:04448] [15]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_binomial+0xd8)
>>>> [0x7fb58770b8b8]
>>>> [limulus:04448] [16]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)
>>>> [0x7fb587702d8c]
>>>> [limulus:04448] [17]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_coll_sync_bcast+0x78)
>>>> [0x7fb587712e88]
>>>> [limulus:04448] [18]
>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(MPI_Bcast+0x130)
>>>> [0x7fb5876ce1b0]
>>>> [limulus:04448] [19] cpi(main+0x10b) [0x400cc4]
>>>> [limulus:04448] [20] /lib64/libc.so.6(__libc_start_main+0xfd)
>>>> [0x332461ecdd]
>>>> [limulus:04448] [21] cpi() [0x400ac9]
>>>> [limulus:04448] *** End of error message ***
>>>> Process 2 on limulus
>>>> Process 4 on limulus
>>>> Process 6 on limulus
>>>> Process 1 on n0
>>>> Process 7 on n0
>>>> Process 3 on n0
>>>> Process 5 on n0
>>>> --------------------------------------------------------------------------
>>>> mpiexec noticed that process rank 0 with PID 4448 on node limulus
>>>> exited
>>>> on signal 6 (Aborted).
>>>> --------------------------------------------------------------------------
>>>>
>>>>
>>>>
>>
>> --
>> Mailscanner: Clean
>>
>