Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] How to debug segv
From: George Bosilca (bosilca_at_[hidden])
Date: 2012-04-25 14:21:45


On Apr 25, 2012, at 13:59 , Alex Margolin wrote:

> I guess you are right.
>
> I started looking into the communication passing between processes and I may have found a problem with the way I handle "reserved" data requested at prepare_src()... I've tried to write pretty much the same as TCP (the relevant code is around "if(opal_convertor_need_buffers(convertor))") and when I copy the buffered data to (frag+1) the program works. When I try to optimize the code by allowing the segment to point to the original location, I get MPI_ERR_TRUNCATE. I've printed out the data sent and recieved, and what I got ("[]" for sent, "<>" for received, running osu_latency) is appended below.
>
> Question is: Where is the code which is responsible for writing the reserved data?

It is the PML headers. Based on the error you reported OMPI is complaining about truncated data on an MPI_Barrier … that's quite bad as the barrier is one of the few operations that do not manipulate any data. I guess the PML headers are not located at the expected displacement in the fragment, so the PML is using wrong values.

  george.

>
> Thanks,
> Alex
>
>
> Always assume opal_convertor_need_buffers - works (97 is the application data, preceded by 14 reserved bytes):
>
> ...
> [65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,]
> <65,0,0,0,1,0,0,0,1,0,0,0,89,-112,97,97,97,97,>
> [65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
> <65,0,0,0,0,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,]
> <65,0,0,0,1,0,0,0,1,0,0,0,90,-112,97,97,97,97,>
> [65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
> <65,0,0,0,0,0,0,0,1,0,0,0,91,-112,97,97,97,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,91,-112,97,97,97,97,]
> ...
>
> Detect when not opal_convertor_need_buffers - fails:
>
> ...
> [65,0,0,0,0,0,0,0,1,0,0,0,-15,85,]
> <65,0,0,0,0,0,0,0,1,0,0,0,-15,85,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,-15,85,]
> <65,0,0,0,1,0,0,0,1,0,0,0,-15,85,97,>
> [65,0,0,0,0,0,0,0,1,0,0,0,-14,85,]
> <65,0,0,0,0,0,0,0,1,0,0,0,-14,85,97,>
> [65,0,0,0,1,0,0,0,1,0,0,0,-14,85,]
> <65,0,0,0,1,0,0,0,1,0,0,0,-14,85,97,>
> [65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,]
> 1 453.26
> [65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,]
> <65,0,0,0,0,0,0,0,-16,-1,-1,-1,-13,85,97,>
> <65,0,0,0,1,0,0,0,-16,-1,-1,-1,-13,85,97,>
> [singularity:13509] *** An error occurred in MPI_Barrier
> [singularity:13509] *** reported by process [2239889409,140733193388033]
> [singularity:13509] *** on communicator MPI_COMM_WORLD
> [singularity:13509] *** MPI_ERR_TRUNCATE: message truncated
> [singularity:13509] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [singularity:13509] *** and potentially your MPI job)
> [singularity:13507] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
> [singularity:13507] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
> alex_at_singularity:~/huji/benchmarks/mpi/osu-micro-benchmarks-3.5.2$
>
> On 04/25/2012 04:35 PM, George Bosilca wrote:
>> Alex,
>>
>> You got the banner of the FT benchmark, so I guess at least the rank 0 successfully completed the MPI_Init call. This is a hint that you should investigate more into the point-to-point logic of your mosix BTL.
>>
>> george.
>>
>> On Apr 25, 2012, at 09:30 , Alex Margolin wrote:
>>
>>> NAS Parallel Benchmarks 3.3 -- FT Benchmark
>>>
>>> No input file inputft.data. Using compiled defaults
>>> Size : 64x 64x 64
>>> Iterations : 6
>>> Number of processes : 4
>>> Processor array : 1x 4
>>> Layout type : 1D
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel