Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] ompi-ps broken or just changed?
From: Ashley Pittman (ashley_at_[hidden])
Date: 2009-05-18 12:06:28


It's certainly helped and now runs for me however if I run mpirun under
valgrind and then opmi-ps in another window Valgrind reports errors and
ompi-ps doesn't list the job so there is clearly something still amiss.
I'm trying to do some more diagnostics now.

==32362== Syscall param writev(vector[...]) points to uninitialised
byte(s)
==32362== at 0x41BF10C: writev (writev.c:46)
==32362== by 0x4EAAD52: mca_oob_tcp_msg_send_handler
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so)
==32362== by 0x4EAC505: mca_oob_tcp_peer_send
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so)
==32362== by 0x4EAEF89: mca_oob_tcp_send_nb
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so)
==32362== by 0x4EA20BE: orte_rml_oob_send
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_rml_oob.so)
==32362== by 0x4EA2359: orte_rml_oob_send_buffer
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_rml_oob.so)
==32362== by 0x4050738: process_commands
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
==32362== by 0x405108C: orte_daemon_cmd_processor
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
==32362== by 0x4260B57: opal_event_base_loop
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x4260DF6: opal_event_loop
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x4260E1D: opal_event_dispatch
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x804B15F: orterun (orterun.c:757)
==32362== Address 0x448507c is 20 bytes inside a block of size 512
alloc'd
==32362== at 0x402613C: realloc (vg_replace_malloc.c:429)
==32362== by 0x42556B7: opal_dss_buffer_extend
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x4256C4F: opal_dss_pack_int32
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x42565C9: opal_dss_pack_buffer
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x403A60D: orte_dt_pack_job
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
==32362== by 0x42565C9: opal_dss_pack_buffer
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x4256FFB: opal_dss_pack
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x40506F7: process_commands
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
==32362== by 0x405108C: orte_daemon_cmd_processor
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
==32362== by 0x4260B57: opal_event_base_loop
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x4260DF6: opal_event_loop
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
==32362== by 0x4260E1D: opal_event_dispatch
(in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)

On Mon, 2009-05-18 at 08:22 -0600, Ralph Castain wrote:
> Aha! Thanks for spotting the problem - I had to move that var init to
> cover all cases, but it should be working now with r21249
>
>
>
> On May 18, 2009, at 8:08 AM, Ashley Pittman wrote:
>
> >
> > Ralph,
> >
> > This patch fixed it, num_nodes was being used initialised and hence
> > the
> > client was getting a bogus value for the number of nodes.
> >
> > Ashley,
> >
> > On Mon, 2009-05-18 at 10:09 +0100, Ashley Pittman wrote:
> >> No joy I'm afraid, now I get errors when I run it. This is a single
> >> node job run with the command line "mpirun -n 3 ./a.out". I've
> >> attached
> >> the strace output and gzipped /tmp files from the machine.
> >> Valgrind on
> >> the opmi-ps process doesn't show anything interesting.
> >>
> >> [alpha:29942] [[35044,0],0] ORTE_ERROR_LOG: Data unpack would read
> >> past
> >> end of buffer in
> >> file /mnt/home/debian/ashley/code/OpenMPI/ompi-trunk-tes/trunk/orte/
> >> util/comm/comm.c at line 242
> >> [alpha:29942] [[35044,0],0] ORTE_ERROR_LOG: Data unpack would read
> >> past
> >> end of buffer in
> >> file /mnt/home/debian/ashley/code/OpenMPI/ompi-trunk-tes/trunk/orte/
> >> tools/orte-ps/orte-ps.c at line 818
> >>
> >> Ashley.
> >>
> >> On Sat, 2009-05-16 at 08:15 -0600, Ralph Castain wrote:
> >>> This is fixed now, Ashley - sorry for the problem.
> >>>
> >>>
> >>> On May 15, 2009, at 4:47 AM, Ashley Pittman wrote:
> >>>
> >>>> On Thu, 2009-05-14 at 22:49 -0600, Ralph Castain wrote:
> >>>>> It is definitely broken at the moment, Ashley. I have it pretty
> >>>>> well
> >>>>> fixed, but need/want to cleanup some corner cases that have
> >>>>> plagued
> >>>>> us
> >>>>> for a long time.
> >>>>>
> >>>>> Should have it for you sometime Friday.
> >>>>
> >>>> Ok, thanks. I might try switching to slurm in the mean-time, I
> >>>> know
> >>>> my
> >>>> code works with that.
> >>>>
> >>>> Can you let me know when it's fixed on or off list and I'll do an
> >>>> update.
> >>>>
> >>>> Ashley,
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> devel_at_[hidden]
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> devel_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> _______________________________________________
> >> devel mailing list
> >> devel_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > <ompi-ps.patch>_______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel