Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] [omx-devel] Open-mx issue with ompi 1.6.1
From: George Bosilca (bosilca_at_[hidden])
Date: 2012-09-12 12:36:17


I don't recall any major modification in the MX BTL for the 1.6 with the exception of the rework of the initialization part. There patches dealt with avoiding the double initialization (BTL and MTL), so we might want to start looking at those.

mx_finalize is called only from ompi_common_mx_finalize, which in turn is called from both the MTL and the BTL modules. As we do manage the reference counts in the init/finalize of the MX, I can only picture two possible scenarios:

1. Somehow there are pending communications and they are activated after MPI_Finalize (very unlikely)

2. One of the MTL or BTL initialization call ompi_common_mx_finalize without a successful corresponding call to ompi_common_mx_init. As the BTL is the one segfaulting, I guess the culprit is the MTL.

I'll take a look at this more in details.

  george.

On Sep 12, 2012, at 17:39 , Brice Goglin <Brice.Goglin_at_[hidden]> wrote:

> (I am bringing back OMPI users to CC)
>
> I reproduce the problem with OMPI 1.6.1 and found the problem.
> mx_finalize() is called before this error occurs. So the error is
> expected because calling mx_connect() after mx_finalize() is invalid.
> It looks the MX component changed significantly between OMPI 1.5 and
> 1.6.1, and I am pretty sure it worked fine with early 1.5.x versions.
> Can somebody comment on what was changed in the MX BTL component in late
> 1.5 versions ?
>
> Brice
>
>
>
>
>
> Le 12/09/2012 15:48, Douglas Eadline a écrit :
>>> Resending this mail again, with another SMTP. Please re-add
>>> open-mx-devel to CC when you reply.
>>>
>>> Brice
>>>
>>>
>>> Le 07/09/2012 00:10, Brice Goglin a écrit :
>>>> Hello Doug,
>>>>
>>>> Did you use the same Open-MX version when it worked fine? Same kernel
>>>> too?
>>>> Any chance you built OMPI over an old OMX that would not be compatible
>>>> with 1.5.2?
>> I double checked, and even rebuilt both Open MPI and MPICH2
>> with 1.5.2.
>>
>> Running on a 4 node cluster with Warewulf provisioning. See below:
>>
>>>> The error below tells that the OMX driver and lib don't speak the same
>>>> langage. EBADF is almost never returned from the OMX driver. The only
>>>> case is when talking to /dev/open-mx-raw, but normal application don't
>>>> do this. That's why I am thinking about OMPI using an old library that
>>>> cannot talk to a new driver. We have checks to prevent this but we never
>>>> know.
>>>>
>>>> Do you see anything in dmesg?
>> no
>>>> Is omx_info OK?
>> yes shows:
>>
>> Open-MX version 1.5.2
>> build: deadline_at_limulus:/raid1/home/deadline/rpms-sl6/BUILD/open-mx-1.5.2
>> Mon Sep 10 08:44:16 EDT 2012
>>
>> Found 1 boards (32 max) supporting 32 endpoints each:
>> n0:0 (board #0 name eth0 addr 00:1a:4d:4a:bf:85)
>> managed by driver 'r8169'
>>
>> Peer table is ready, mapper is 00:00:00:00:00:00
>> ================================================
>> 0) 00:1a:4d:4a:bf:85 n0:0
>> 1) 00:1c:c0:9b:66:d0 n1:0
>> 2) 00:1a:4d:4a:bf:83 n2:0
>> 3) e0:69:95:35:d7:71 limulus:0
>>
>>
>>
>>>> Does a basic omx_perf work? (see
>>>> http://open-mx.gforge.inria.fr/FAQ/index-1.5.html#perf-omxperf)
>> yes, checked host -> each node it works. And mpich2 compiled
>> with same libraries works. What else
>> can I check?
>>
>> --
>> Doug
>>
>>
>>>> Brice
>>>>
>>>>
>>>> Le 06/09/2012 23:04, Douglas Eadline a écrit :
>>>>> I built open-mpi 1.6.1 using the open-mx libraries.
>>>>> This worked previously and now I get the following
>>>>> error. Here is my system:
>>>>>
>>>>> kernel: 2.6.32-279.5.1.el6.x86_64
>>>>> open-mx: 1.5.2
>>>>>
>>>>> BTW, open-mx worked previously with open-mpi and the current
>>>>> version works with mpich2
>>>>>
>>>>>
>>>>> $ mpiexec -np 8 -machinefile machines cpi
>>>>> Process 0 on limulus
>>>>> FatalError: Failed to lookup peer by addr, driver replied Bad file
>>>>> descriptor
>>>>> cpi: ../omx_misc.c:89: omx__ioctl_errno_to_return_checked: Assertion
>>>>> `0'
>>>>> failed.
>>>>> [limulus:04448] *** Process received signal ***
>>>>> [limulus:04448] Signal: Aborted (6)
>>>>> [limulus:04448] Signal code: (-6)
>>>>> [limulus:04448] [ 0] /lib64/libpthread.so.0() [0x3324e0f500]
>>>>> [limulus:04448] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x33246328a5]
>>>>> [limulus:04448] [ 2] /lib64/libc.so.6(abort+0x175) [0x3324634085]
>>>>> [limulus:04448] [ 3] /lib64/libc.so.6() [0x332462ba1e]
>>>>> [limulus:04448] [ 4] /lib64/libc.so.6(__assert_perror_fail+0)
>>>>> [0x332462bae0]
>>>>> [limulus:04448] [ 5]
>>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__ioctl_errno_to_return_checked+0x197)
>>>>> [0x7fb587418b37]
>>>>> [limulus:04448] [ 6]
>>>>> /usr/open-mx/lib/libopen-mx.so.0(omx__peer_addr_to_index+0x55)
>>>>> [0x7fb58741a5d5]
>>>>> [limulus:04448] [ 7] /usr/open-mx/lib/libopen-mx.so.0(+0xdc7a)
>>>>> [0x7fb587419c7a]
>>>>> [limulus:04448] [ 8] /usr/open-mx/lib/libopen-mx.so.0(omx_connect+0x8c)
>>>>> [0x7fb58741a27c]
>>>>> [limulus:04448] [ 9] /usr/open-mx/lib/libopen-mx.so.0(mx_connect+0x15)
>>>>> [0x7fb587425865]
>>>>> [limulus:04448] [10]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_proc_connect+0x5e)
>>>>> [0x7fb5876fe40e]
>>>>> [limulus:04448] [11]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_btl_mx_send+0x2d4)
>>>>> [0x7fb5876fbd94]
>>>>> [limulus:04448] [12]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_send_request_start_prepare+0xcb)
>>>>> [0x7fb58777d6fb]
>>>>> [limulus:04448] [13]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_pml_ob1_isend+0x4cb)
>>>>> [0x7fb58777509b]
>>>>> [limulus:04448] [14]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_generic+0x37b)
>>>>> [0x7fb58770b55b]
>>>>> [limulus:04448] [15]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_binomial+0xd8)
>>>>> [0x7fb58770b8b8]
>>>>> [limulus:04448] [16]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(ompi_coll_tuned_bcast_intra_dec_fixed+0xcc)
>>>>> [0x7fb587702d8c]
>>>>> [limulus:04448] [17]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(mca_coll_sync_bcast+0x78)
>>>>> [0x7fb587712e88]
>>>>> [limulus:04448] [18]
>>>>> /opt/mpi/openmpi-gnu4/lib64/libmpi.so.1(MPI_Bcast+0x130)
>>>>> [0x7fb5876ce1b0]
>>>>> [limulus:04448] [19] cpi(main+0x10b) [0x400cc4]
>>>>> [limulus:04448] [20] /lib64/libc.so.6(__libc_start_main+0xfd)
>>>>> [0x332461ecdd]
>>>>> [limulus:04448] [21] cpi() [0x400ac9]
>>>>> [limulus:04448] *** End of error message ***
>>>>> Process 2 on limulus
>>>>> Process 4 on limulus
>>>>> Process 6 on limulus
>>>>> Process 1 on n0
>>>>> Process 7 on n0
>>>>> Process 3 on n0
>>>>> Process 5 on n0
>>>>> --------------------------------------------------------------------------
>>>>> mpiexec noticed that process rank 0 with PID 4448 on node limulus
>>>>> exited
>>>>> on signal 6 (Aborted).
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Mailscanner: Clean
>>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users