OS-2767

build failures: ECONNREFUSED from mget

Status:
Open
Created:
2014-01-09T15:39:44.000-0500
Updated:
2017-11-04T12:08:19.457-0400

Description

Lots of builds are failing because mget is dying on ECONNREFUSED. Example from https://jenkins.joyent.us/job/redis/176/console:

../configure:241: mget -v -q /Joyent_Dev/stor/builds/sapi/master-latest
./configure:241: local latest_dir=
./configure:243: grep ResourceNotFoundError /tmp/mget-output.72185
./configure:248: [[ -z '' ]]
./configure:249: cat /tmp/mget-output.72185
{"name":"mget","hostname":"grell","pid":84125,"component":"MantaClient","path":"/Joyent_Dev/stor/builds/sapi/master-latest","req_id":"2e3f221e-59ff-46fb-b41c-de07ac6f6b0b","level":20,"headers":{"accept":"*/*","date":"Thu, 09 Jan 2014 19:54:03 GMT","x-request-id":"2e3f221e-59ff-46fb-b41c-de07ac6f6b0b"},"id":"2e3f221e-59ff-46fb-b41c-de07ac6f6b0b","query":{},"msg":"get: entered","time":"2014-01-09T19:54:03.850Z","src":{"file":"/root/data/jenkins/workspace/redis/MG/node_modules/manta/lib/client.js","line":619,"func":"get"},"v":0}
{"name":"mget","hostname":"grell","pid":84125,"component":"MantaClient","level":10,"client_req":{"method":"GET","url":"/Joyent_Dev/stor/builds/sapi/master-latest","address":"us-east.manta.joyent.com","headers":{"accept":"*/*","date":"Thu, 09 Jan 2014 19:54:03 GMT","x-request-id":"2e3f221e-59ff-46fb-b41c-de07ac6f6b0b","authorization":"Signature keyId=\"/Joyent_Dev/keys/de:e7:73:9a:b0:91:31:3e:72:8d:9f:62:ca:58:a2:ec\",algorithm=\"rsa-sha1\",signature=\"vEJdaZE2Ra+ncjD9uCGSvvmmW58BgF6GZTKS3RSrll08zvsAUJ8FtWebHbJ1f1q2SiyXmitCO9GA/uxzwyLAIH66Zd3UHSX4fs5LBsLEPZcqPqznHhwb9Lx6eYNKq5nggKLAYmtTsAO9MiwU6aZeihiJA8+AQoNmoxd8IFVbAeMcztrZMUnNGzJM3K2y0HZAgJ56hXS+TB8PTUCuFjNvY/m+NlPoGGGQOadyRNLY+XSj9IMYb7fR4ct1kzq2UvnMfEG/6sSPLM/56EfBqIktYtMEAe5nPDJ5Bko/dCs2yEafl8aRcqfsOECI/vDP1b4ueb6excFLTIYVNnp5CvjfJw==\"","user-agent":"restify/2.6.1 (ia32-solaris; v8/3.11.10.14; OpenSSL/1.0.0f) node/0.8.2","accept-version":"~1.0"}},"msg":"request sent","time":"2014-01-09T19:54:03.913Z","src":{"file":"/root/data/jenkins/workspace/redis/MG/node_modules/manta/node_modules/restify/lib/clients/http_client.js","line":194,"func":"rawRequest"},"v":0}
{"name":"mget","hostname":"grell","pid":84125,"component":"MantaClient","level":10,"err":{"message":"connect ECONNREFUSED","name":"Error","stack":"Error: connect ECONNREFUSED\n    at errnoException (net.js:776:11)\n    at Object.afterConnect [as oncomplete] (net.js:767:19)","code":"ECONNREFUSED"},"msg":"Request failed","time":"2014-01-09T19:54:03.918Z","src":{"file":"/root/data/jenkins/workspace/redis/MG/node_modules/manta/node_modules/restify/lib/clients/http_client.js","line":145,"func":"onError"},"v":0}
{"name":"mget","hostname":"grell","pid":84125,"component":"MantaClient","path":"/Joyent_Dev/stor/builds/sapi/master-latest","req_id":"2e3f221e-59ff-46fb-b41c-de07ac6f6b0b","level":20,"err":{"message":"connect ECONNREFUSED","name":"Error","stack":"Error: connect ECONNREFUSED\n    at errnoException (net.js:776:11)\n    at Object.afterConnect [as oncomplete] (net.js:767:19)","code":"ECONNREFUSED"},"msg":"get: error","time":"2014-01-09T19:54:03.919Z","src":{"file":"/root/data/jenkins/workspace/redis/MG/node_modules/manta/lib/client.js","line":582},"v":0}
mget: Error: connect ECONNREFUSED
./configure:250: fatal 'Could not get latest dir for target sapi'
../configure:49: basename ./configure
./configure:49: echo 'configure: error: Could not get latest dir for target sapi'
configure: error: Could not get latest dir for target sapi
./configure:50: exit 1
./configure:1: errexit 1 1
./configure:54: [[ 1 -ne 0 ]]
./configure:55: fatal 'error exit status 1 at line 1'
../configure:49: basename ./configure
./configure:49: echo 'configure: error: error exit status 1 at line 1'
configure: error: error exit status 1 at line 1
./configure:50: exit 1

Comments (6)

Former user commented on 2014-02-10T14:35:38.000-0500:

Running this repeatedly I was unable to made it happen ad-hoc. However, I can trigger this with:

for i in {1..2000} ; do echo /dev/null ; done | xargs -I {} --max-procs=200 ./bin/mget -q -o {} /Joyent_Dev/stor/builds/sapi/master-latest

But there are two cases where the ECONNREFUSED happens - one is the stack above, which is actually to Manta, the second is to the SSH agent, which happens about 10X as often as far as I can tell. I can't really fix the latter besides adding retries, but we definitely know it throttles and shuts down under any load; for now I'm ignoring and focusing on when it happens to Manta.

Former user commented on 2014-02-10T18:46:39.000-0500 (edited 2014-02-11T12:37:53.000-0500):

Ok. Digging in on this guy, I have three snoops from when this happened, and client side, they all look exactly the same:

1) Client does a request
2) Server responds with Connection: close
3) Server FINs
4) Client FIN/ACKs
5) Server ACKs
6) Client SYN's on the same client port as (4), milliseconds later
7) Server RST/ACK's

I'm pretty sure this is because the connection (server side) is still in TIME_WAIT. At first I thought this would be easy to workaround by disabling SO_REUSEADDR on the client (which node always sets), but even after turning it off I still see this behavior.

I am currently trying to get some D to fire from the LB when this happens, so we can see the stack the server (stud) was in when the RST went out:

dtrace -n 'tcp:::send /args[2]->ip_daddr == "165.225.142.214" && args[4]->tcp_flags & TH_RST/ { trace(args[4]->tcp_dport); stack(); }'

I'm pretty sure this isn't a systemic server-side issue now (phwew), but that this is something related to TCP tuning, either in the OS, node or app, client side.

Former user commented on 2014-02-10T18:51:43.000-0500:

And confirmed that server side the socket was indeed in TIME_WAIT:

12  12430               tcp_send_data:send  52332
              ip`tcp_xmit_ctl+0x28a
              ip`tcp_time_wait_processing+0x300
              ip`tcp_input_data+0x114d
              ip`squeue_enter+0x426
              ip`ip_fanout_v4+0xc7c
              ip`ip_input_local_v4+0x16e
              ip`ire_recv_local_v4+0x132
              ip`ill_input_short_v4+0x4d6
              ip`ip_input_common_v4+0x372
              ip`ip_input+0x2b
              dls`i_dls_link_rx+0x1cd
              mac`mac_rx_deliver+0x37
              mac`mac_rx_soft_ring_process+0x19a
              mac`mac_rx_srs_fanout+0x3b2
              mac`mac_rx_srs_drain+0x256
              mac`mac_rx_srs_process+0x3ce
              mac`mac_rx_classify+0x129
              mac`mac_rx_flow+0x63
              mac`mac_rx_common+0x196
              mac`mac_rx+0xb6

Former user commented on 2014-02-11T12:37:37.000-0500:

Having gone through several snoops, and with all D traces confirming the same stack, it's pretty clear that this is only fixable in the OS, client-side. It's pretty clear from these traces that the client should not be able to reuse a port until it's out of the default TIME_WAIT period, even though it's not in TIME_WAIT client side (but clearly we can assume it to be on the server).

As I have no idea the scope of what this looks like to fix, I will open a small cheezy bug to work-around this in node-manta with a retry.

Former user commented on 2014-02-12T05:19:07.000-0500:

I don't think the client needs to enter TIME_WAIT, unless it's a simultaneous close. Normally, just one end point needs to hold that state: the end point that initiated the close by sending the FIN.

So it's unfortunate that the client would reuse the port so quickly, however, the server needs to deal with that. And the server will be in TIME_WAIT. If conditions for RFC1122 are met, the server can recycle the connection, and the illumos kernel does do this. But it doesn't always work, such as when the sequence number delta becomes negative, which can happen when the 32-bit sequence number cycles, which happens more quickly than you'd think thanks to tcp_iss_incr - a constant that the kernel tacks onto new connections (by default, it's 250,000). For background, see SYSOPS-3479 and http://permalink.gmane.org/gmane.os.illumos.devel/2166, and tcp_time_wait_processing(). If the server RST is related to this, it may be possible to tune it away using tcp_iss_incr and tcp_strong_iss, to reduce sequence number churn.

Perhaps this is exasperated if there are many open connections between the server and the client, such that the client is more likely than one would think to reuse a port so quickly.

Former user commented on 2014-02-12T13:42:38.000-0500:

In this particular case (at least the snoops I have), the sequence number is indeed still greater than the previous one, so it doesn't sound like those tunables would help.

And it probably is related that this happens when I'm "running it hot" – lots of client connections being spawned/closed.