Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] How to debug segv
From: Alex Margolin (alex.margolin_at_[hidden])
Date: 2012-04-25 13:59:57


I guess you are right.

I started looking into the communication passing between processes and I
may have found a problem with the way I handle "reserved" data requested
at prepare_src()... I've tried to write pretty much the same as TCP (the
relevant code is around "if(opal_convertor_need_buffers(convertor))")
and when I copy the buffered data to (frag+1) the program works. When I
try to optimize the code by allowing the segment to point to the
original location, I get MPI_ERR_TRUNCATE. I've printed out the data
sent and recieved, and what I got ("[]" for sent, "<>" for received,
running osu_latency) is appended below.

Question is: Where is the code which is responsible for writing the
reserved data?

Thanks,
Alex

Always assume opal_convertor_need_buffers - works (97 is the application
data, preceded by 14 reserved bytes):

...
[65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,]
<65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,>
[65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
<65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
[65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
<65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
[65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
<65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,>
[65,0,0,0,1,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
...

Detect when not opal_convertor_need_buffers - fails:

...
[65,0,0,0,0,0,0,0,1,0,0,0,-15,85,]
<65,0,0,0,0,0,0,0,1,0,0,0,-15,85,97,>
[65,0,0,0,1,0,0,0,1,0,0,0,-15,85,]
<65,0,0,0,1,0,0,0,1,0,0,0,-15,85,97,>
[65,0,0,0,0,0,0,0,1,0,0,0,-14,85,]
<65,0,0,0,0,0,0,0,1,0,0,0,-14,85,97,>
[65,0,0,0,1,0,0,0,1,0,0,0,-14,85,]
<65,0,0,0,1,0,0,0,1,0,0,0,-14,85,97,>
[65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,]
1 453.26
[65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,]
<65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,97,>
<65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,97,>
[singularity:13509] *** An error occurred in MPI_Barrier
[singularity:13509] *** reported by process [2239889409,140733193388033]
[singularity:13509] *** on communicator MPI_COMM_WORLD
[singularity:13509] *** MPI_ERR_TRUNCATE: message truncated
[singularity:13509] *** MPI_ERRORS_ARE_FATAL (processes in this
communicator will now abort,
[singularity:13509] *** and potentially your MPI job)
[singularity:13507] 1 more process has sent help message
help-mpi-errors.txt / mpi_errors_are_fatal
[singularity:13507] Set MCA parameter "orte_base_help_aggregate" to 0 to
see all help / error messages
alex_at_singularity:~/huji/benchmarks/mpi/osu-micro-benchmarks-3.5.2$

On 04/25/2012 04:35 PM, George Bosilca wrote:
> Alex,
>
> You got the banner of the FT benchmark, so I guess at least the rank 0 successfully completed the MPI_Init call. This is a hint that you should investigate more into the point-to-point logic of your mosix BTL.
>
> george.
>
> On Apr 25, 2012, at 09:30 , Alex Margolin wrote:
>
>> NAS Parallel Benchmarks 3.3 -- FT Benchmark
>>
>> No input file inputft.data. Using compiled defaults
>> Size : 64x 64x 64
>> Iterations : 6
>> Number of processes : 4
>> Processor array : 1x 4
>> Layout type : 1D
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel