On Dec 10, 2009, at 5:06 PM, Brock Palen wrote:
> I would like to try out the notifier framework, problem is I am having trouble finding documentation for it, I am digging around the website and not finding much.
> Currently we have a problem where hosts are throwing up errors like:
> 631:mca_btl_tcp_endpoint_complete_connect] connect() failed:
> Connection timed out (110)
Yoinks. Any idea why this is happening?
> We would like when this happens to notify us, so we can put time
> stamps on events going on on the network. Is this even possible with
> the frame work? See we don't show any interfaces coming up and down,
> or any errors on interfaces, so we are looking to isolate the problem
> more. Only the MPI library knows when this happens.
It's not well documented. So let's start here...
The first issue is that we currently only have notifier calls down in the openib BTL -- not any of the others. :-( We put it there because there was specific requests to be notified when IB links went down. We then used those as a request for comment from the community, asking "do you like this? do you want more?" We kinda got nothing back, and I'll admit that we kinda forgot about it -- and therefore never added notifier calls elsewhere in the code. :-\
We designed the notifier in OMPI to be trivially easy to use throughout the code base -- it's just adding a single function call where the error occurs. Would you, perchance, be interested in adding any of these in the TCP BTL? I'd be happy to point you in the right direction... :-)
After that, it's just a matter of enabling a notifier:
mpirun --mca notifier syslog ...
Each notifier has some MCA params that are fairly obvious -- use:
ompi_info --param notifier all
to see them. There's 3 notifier plugins:
- command: execute any arbitrary command. It must run in finite (short) time. You use MCA params to set the command (we can pass some strings down to the command; see the ompi_info help string for more details), and set a timeout such that if the command runs for that many seconds without exiting, we'll kill it.
- syslog: because it was simple to do -- we just output a string to the syslog.
- twitter: because it was fun to do. ;-) Actually, the rationale was that you can tweet to a private feed and then slave an RSS reader to it to see if anything happens. It will need to be able to reach the general internet (i.e., twitter.com); proxies are not supported. Set your twitter username/password via MCA params.