On Sep 5, 2012, at 3:59 AM, Andrea Negri wrote:
> I have tried with these flags (I use gcc 4.7 and open mpi 1.6), but
> the program doesn't crash, a node go down and the rest of them remain
> to wait a signal (there is an ALLREDUCE in the code).
> Anyway, yesterday some processes died (without a log) on the node 10,
I suggest that you should probably start adding your own monitoring. *Something* is happening, but apparently it's not being captured in any logs that you see. For example:
- run your program through valgrind, or other memory-checking debugger
- ask you admin to increase the syslog levels to get more information
- ensure that sys logging is going to both the local disk and to a remote server (in case your machines are getting re-imaged and local disk syslogs get wiped out upon reboot)
- look at dmesg output immediately upon reboot
- look at /var/log/syslog output immediately upon reboot
- when your job launches continually capture some linux statistics (e.g., every N seconds -- pick N to meet your needs), such as:
- top -b -n 9999999 -d N (use the same N value as above)
- numastat -H
- cat /proc/meminfo
When a crash occurs, look an these logs you've made and see if you can find any trends, like running out of memory on any particular NUMA node (or overall), if any process size is growing arbitrarily large, etc.
Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on.
> I logged almost immediately in the node and I found the process
> /usr/sbin/hal_lpadmin -x /org/freedesktop/Hal/devices/pci_10de_267
> What is it? I know that hal is a device demon, but hal_lpadmin?
It has to do with managing printers.
> PS: What is the correct method to reply in this mailing list? I use
> gmail and I usually I hit the reply butt, replace the object, but here
> it seems the I opening a new thread each time I post.
You seem to be replying to the daily digest mail rather than the individual mails in this thread. That's why it creates a new thread in the web mail archives. If you replied to the individual mails, they would thread properly on the web mail archives.
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/