Saturday, September 06, 2008

Solution for slow localhost lookups with getaddrinfo on OS X

I noticed that sometimes, running \ foo edit would take several seconds to execute; sometimes as long as a minute. On my laptop, the editors.jedit vocabulary is loaded, meaning the edit hook connects to the running jEdit instance over a socket on localhost, transmitting the file name and line number which defines foo.

After fiddling around for a while with watch and making it print messages when a word was called and returned, I noticed my observing the messages that it was blocking in name resolution. And indeed, resolving localhost seemed to take a while:
( scratchpad ) [ "localhost" 1234  resolve-host drop ] time
==== RUNNING TIME

194 ms

One would expect it to return instantly, but obviously, there's a DNS query happening here, which should not be the case since localhost is defined in /etc/hosts.

Factor uses getaddrinfo() for name resolution. I tried the older, deprecated gethostbyname() function:
( scratchpad ) FUNCTION: void* gethostbyname ( char* name ) ;
( scratchpad ) [ "localhost" gethostbyname drop ] time
==== RUNNING TIME

0 ms

No performance problem there.

The next step was to look at the tcpdump output:
05:50:05.340667 IP 192.168.1.105.62560 > cns.westlandrdc.mi.michigan.comcast.net.domain: 27262+ SRV? _http._tcp.localhost. (38)
05:50:05.374190 IP cns.westlandrdc.mi.michigan.comcast.net.domain > 192.168.1.105.62560: 27262 NXDomain 0/1/0 (113)
05:50:05.374492 IP 192.168.1.105.62561 > cns.westlandrdc.mi.michigan.comcast.net.domain: 518+[|domain]
05:50:05.424012 IP cns.westlandrdc.mi.michigan.comcast.net.domain > 192.168.1.105.62561: 518 NXDomain 0/1/0 (138)
05:50:05.424568 IP 192.168.1.105.62562 > cns.westlandrdc.mi.michigan.comcast.net.domain: 14115+ SRV? _http._tcp.localhost.mn.comcast.net. (53)
05:50:05.474268 IP cns.westlandrdc.mi.michigan.comcast.net.domain > 192.168.1.105.62562: 14115 NXDomain 0/1/0 (105)
05:50:05.474767 IP 192.168.1.105.62563 > cns.westlandrdc.mi.michigan.comcast.net.domain: 56054+ SRV? _http._tcp.localhost.comcast.net. (50)
05:50:05.530655 IP cns.westlandrdc.mi.michigan.comcast.net.domain > 192.168.1.105.62563: 56054 NXDomain 0/1/0 (102)

As you can see, Factor is making some SRV DNS queries here. This made me think of the servname parameter to getaddrinfo(). We always set this to "http", since I found that on FreeBSD, setting it to a string representation of an integer does not work. However, on Mac OS X 10.5, it appears that DNS queries are always made if this field is set, even if the host name is listed in /etc/hosts. Changing the resolver code to pass a NULL pointer as servname seems to fix the problem, and I verified it to work on Windows, FreeBSD, OpenBSD, NetBSD and Linux.

A minor refactoring later, and its as fast as it should be, with no network traffic generated:
( scratchpad ) [ "localhost" 1234  resolve-host drop ] time
==== RUNNING TIME

0 ms

So perhaps this is completely obvious to some people, but I never realized that setting servname to a non-NULL value would cause DNS queries to be performed when resolving localhost. Certainly this isn't documented anywhere, at least not that I can see.

In conclusion, if you are developing a program which communicates with another service running on the same machine via TCP/IP or UDP/IP, and you notice that getaddrinfo() blocks for a long time, try changing the code to pass NULL for servname.

No comments: