600k concurrent websocket connections on AWS using Node.js

I recently faced the challenge to get as much power as possible out of a AWS EC2 instance at the lowest possible cost using concurrent persistent websockets.

To do this I needed to use a event-driven, non-blocking runtime environment. For this particular purpose Node.js is excellent with its lightweight and fast Chrome V8 engine.

Technical decisions

Socket.io

I started out with using Socket.io for Node.js which worked out nicely as a start but since we are trying to get as much as possible out of the EC2 instance we needed something that is a little bit more light-weight. Also I noticed that since Socket.io v1.0 the cluster module doesn’t work. This removes the possibility to use this library on a environment with high load. Therefore I moved on to another websocket library Websockets/ws.

Websockets/ws

Works good and is lightweight. This is probably the fastest Websocket library for Node.js. The library has no built in keep alive functionality so you have to implement that yourself via the ping/methods available in the lib. Make sure that your AWS loadbalancers timeout is not set lower than your keepalive, if not, it will drop your connections.

Sticky-session

Use the sticky-session Node.js module which enables you to run on all CPUs. Which you have to do in order to reach a high number of connections for one server. One CPU can only handle a certain amount of connections before the V8 GC goes wild and the CPU will stall on 100%.

M3.xlarge

After a lot of testing by generating users to create persistent websocket connections to the server and calculating the numbers up and down I finally decided to use a M3.xlarge EC2 instance to reach 620k idle connections. This gives us 4 CPUs and 15Gb of memory.

At this level of live persistent connections the CPU load is constantly at 100% on all CPUs on the server. The reason behind the high CPU load is the V8:s(Node.js engine) garbage collection. But this is after optimizing the GC. To have a stable runtime environment I suggest that you set the maximum connections to 600k before the CPU load starts to go crazy high, when reaching this connection amount it is definitely time to scale up another instance.

It is possible to reach a higher number of connections on a larger and more expensive EC2 instance that provides more CPU cores and more memory. When experimenting with this I reached 800k idle connections with a M3.2xlarge instance which gives you 8 CPUs and 30Gb of memory. But when you get over 600k connections other factors comes to limit the capacity, like money and the linux network implementation.

These numbers are for idle websocket connections handling only keepalive pings from the server.  I’m sure if you have a high number of requests from the clients, the number of connections that the EC2 instance can handle will also decrease.

 

Configuration to reach 600k persistent connections

Node.js flags

Set the following flags to launch your node.js application:

–nouse-idle-notification

Turns of the idle garbage collection which makes the GC constantly run and is devastating for a realtime server environment. If not turned off the system will get a long hickup for almost a second once every few seconds.

–expose-gc

Use the expose-gc command to enable manual control of the GC from your code. I recommend to call GC once every 30 seconds.

–max-old-space-size=8192

Increases the limit for each V8 node process to use max 8Gb of heap memory instead of the 1,4Gb default on 64-bit machines(512Mb on a 32-bit machine).

–max-new-space-size=2048

Specified in kb and setting this flag optimizes the V8 for a stable allround environment with short pauses and ok high peak performance.

If this flag is not used the pauses will be a little bit longer but the machine will handle peaks a little bit better. What you need in this case depends on the project you are working on. My pick is to have an allround stable server instead of just handling peaks so I stick with this flag.

EC2 configuration

Set the “soft” and “hard” nofile limit to 1000000. Instead of using the “ulimit -n” as some people do I had to specify the “soft” and “hard” limits for both root and all other users, for some reason I had to specify them separately.

/etc/security/limits.d/custom.conf

Now set the amount of possible opened file handles and the size of the NAT ip connection tracking table.

/etc/sysctl.conf

“fs.file-max”

The maximum file handles that can be allocated

“fs.nr_open”

Max amount of file handles that can be opened

“net.ipv4.netfilter.ip_conntrack_max”

Specifies how many connections the NAT can keep track of in the “tracking” table before it starts to drop packets and just break connections, this we totally want to avoid. The default value for this is 65536 so without this setting you wont be able to get more connections than that.

 

46 Comments

  1. This is Great!
    We shared it on twitter(but couldn’t find your twitter username to mention you!)
    :D

  2. Cristian

    Just out of curiosity, what is the cost of the EC2 machines that you mentioned?

  3. wouter

    What did you use to generate 620k connections?
    And did you send a message over every connection every x seconds or was is just a silent connection?

    • Daniel Kleveros

      I created my own client with the same websocket lib websockets/ws which I then deployed on a cluster of M1.Large instances.

      It was silent connections but to keep the connections alive, if you are behind a elastic load balancer, you need to check the idletimeout setting on the loadbalancer and send a ping before that time expires. If not the load balancer will start to drop your connections.

  4. Doug

    Seriously impressive Node.js tuning tips!

    I typically use NGINX as a reverse proxy with my web apps. In the situation you’ve setup here (massive WS concurrency) Is it worth it to use a proxy like NGINX in front of the Node.js instances? Or would it be more trouble than it’s worth? (Obviously you’d get slightly less throughput)

    • Daniel Kleveros

      I tried using NGINX as a reverse proxy but Linux has a limit of 65536 open file handles per port for one ip adress. When NGINX forwards all the web sockets they will come from the same ip and port which is a huge bottleneck. My recommendation is to not use NGINX as a reverse proxy if you have a high number of connections.

      • You can assign multiple ip address to a single EC2 instance, this brings pricing a tid bit up but should allow to overcome the mentioned 65536 limit.

  5. sandeep

    This is awesome!

    do you know if a similar thing is possible using Python Flask+Gevent ? We are very keen to use Python for high performance APIs.

  6. I thought I’d take the time to test your performance claims however it appears one of your parameters is not available?

    STR: node –nouse-idle-notification –expose-gc –max-new-space-size=2048 –max-old-space-size=4096 src/node/server.js
    Error output: node: bad option: –max-new-space-size=2048

    Debug: Ubutntu, node 0.12.2

    Any idea?

    • Daniel Kleveros

      I’m using node version v0.10.31 but you should be good with just removing the option –max-new-space-size=2048 since they probably tuned this in version 0.12

  7. Garbage collection is causing the processor to max out? What is there to garbage collect in a websocket connection? I feel like there should be no heap allocations at all. I am going to test this same setup but with a Go app and see how it does.

  8. Troy

    Interesting read, do you have a git repo of the source you were running?

  9. LG

    Amazing. Thanks!

  10. Gustavo

    Awesome post! Would be nice if you share the code ;)

  11. Hi Daniel Kleveros,

    I would like to know how much messages rate(messages/connection/second) did you get on each connection when you got setup 600K connections.

    We are working on the stock market project in which we are providing the horizontal and vertical scaling on websockets, we are also getting good results but not as much as you have mentioned in this blog based on the machine configurations you have used.

    In our case we are getting the messages rate from 20-40 messages/second/connection and providing the realtime stock market streaming to the users.

    gurjantsingh73@gmail.com is my email id, we can discuss here.

    -Gurjant

    • Daniel Kleveros

      I will make this more clear in the blogpost but the connections are idle just handling ping/pong events from the server. If you have 20-40 message/second on each connection that will certainly decrease the number of connections the server can handle.

  12. Bunty

    Excellent!

    Can you let us know how many messages did you receive per second per connection?

    • Daniel Kleveros

      I will make this more clear in the blogpost but the connections are idle just handling ping/pong events from the server.

  13. Gurjant

    Hi Daniel Kleveros,

    Yes it can setup the connections as much as we want just for the ping-pong but for the realtime streaming there will be required good infrastructure. Really NodeJs gives excellent horizontal and vertical scaling. I am using this in number of projects, in one project I am storing about 15-20MB data in 8-10 mins from each mobile connection and targeting more thank 100K users in very first phase of project.

  14. Hi,

    This is a very nice blog – Kudos !

    However, I was wondering.

    Since you can only have 65,535 ports per IP, and a normal EC2 instance can’t have more then 1 IP (at least not a public one)

    How did you manage to create 600K connections over a single Public IP ?

    (NAT can allow you to maybe use more IPs and Ports but eventually, the outgoing IP and Port will still limit you to 65535 connections)

    Thanks,
    Refael.

    • Daniel Kleveros

      Hi Refael,

      Thanks!

      The rule of thumb here is “65536 connections per incoming IP adress for one port”. This means that you in theory can have 65536 connections from each client.

      But the server limitation is instead bound to the amount of open file handles, this is why we need to do the configuration changes on the server.
      It’s also the reason why NGINX doesn’t work as a reverse proxy in this case. All the clients will connect to the server and will be forwarded to the Node.js application via one IP and on one port. This will limit the EC2 instance to have 65536 connections.

  15. susan

    Great post!!! Would you mind sharing the code you used to see how you implemented WS (things such as a the ping, etc).

    Regards

  16. Rohit Tailor

    Hi,

    We are using two Node.js servers in Active-Passive (When one server fails traffic will route to another server) mode with load balancer. Following is the configuration of linux server:
    4gb RAM / dual core ( 1vcpu ) Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz / 100Gb Disk / fs.file-max = 6815744

    Using -ulimit command we increase limit from 1024 to 2048. But during peak time of app we can see many established connections on both servers and will not allow further connections and server will not respond on any connection.

    After restart node.js servers it will start work again.

    Need your help to support minimum 10k connections simultaneously. Please suggest configuration.

    Thanks,
    Rohit

  17. efkan

    Hi Daniel,

    Here (https://github.com/primus/primus#can-i-use-cluster) says about sticky-session;
    “it uses the remoteAddress of the connection. For some people this isn’t a problem but when you add this behind a load balancer the remote address will be set to the address of the load balancer that forwards the requests. So all in all it only causes more scalability problems instead of solving them. ”

    Could you tell me if you’ve encountered such a problem?

  18. Jonathan

    Great post. I have two questions.

    1. From your post it seems WebSocket servers running in AWS can be load balanced with ELB. Is that true?

    2. If I start a WebSocket server from a terminal and later close the terminal, how do I still keep the WebSocket server running?

    Thanks.
    Jonathan

  19. Rob

    If you’d like to exceed 600k connections, take a look at G-WAN.
    http://gwan.ch/benchmark/babel.html

  20. douwe

    Thank you very much for this post. One question, I don’t understand how you can simulate multiple connections. Hoe does that work?

  21. Awesome results and a fantastic write up, thanks Daniel!

  22. darin hensley

    What about a cloud server that charges for outgoing bandwidth? Will websockets run up charges in comparison to ajax calls?

  23. Tommi

    Can I ask how you managed multiple instances running concurrently? did you have to specify each port for each instance + a cluster code in the node.js file?

    • Daniel Kleveros

      Do you mean for generating client connections or for running the server with multiple instances?

      • Matt Larson

        I would be curious in an answer to both of these. How you generated the clients, and how you configured multiple instances to talk to each other.

        • Daniel Kleveros

          To generate a high amount of clients you can create a small snippet with ws that creates and connects clients in a loop, let’s say that you limit the loop to 50k. Then you deploy this application on 12 EC instances to get 600k clients.

          In this scenario, there is no need for the instances to talk to each other. But if you are deploying the server in a real scaling environment the question on how the server instances communicate with each other will be a longer explanation and totally depends on your requirements and needs.

  24. We are using NewRelic to monitor our whole app, and appears we got some latency with Node.js (not related to any database, neither from anything else outside Node). So I started to make some Google Search and found this very nice article.

    We are using PM2 (from KeyTronics.io) to manage more requests and to let all CPU to be used on our m4.xlarge server. Thanks to comments found here, it seems we will be in need to remove our Nginx reverse-proxy.

    But how do you monitor and find those concurrent websockets connections ?

  25. Pedro Diogo

    Thanks for sharing your findings – very valuable to me!

    How would you scale this horizontally? Would a HAProxy working as a load balancer with sticky-session support suffice? I would like to scale my solution via multiple cloud providers…
    Thanks in advance.

    Cheers,
    Pedro

Trackbacks for this post

  1. RealTimeWeekly | RealTimeWeekly #76
  2. Web Operations Weekly No.11 | ENUE Blog
  3. 600k concurrent websocket connections on AWS using Node.js | Kiến thức MEANKiến thức MEAN
  4. Membagi worker proses nodeJS memakai message queue Apache Kafka (Part 1/2) | Stories of jefri in a blog
  5. Nodejs Stock Market | Best Stock Market Place

Leave a Reply