future · 2022年1月17日 0

The article clearly explains how million concurrency needs to be analyzed and designed

Now there are many people on the Internet who say how to support millions of concurrent, open a look at nothing, equal to 0, how can this be endured? It is impossible not to let me learn. So how to analyze and design it to support millions of concurrent (simultaneously online), today Xiangzi brother will take you to analyze how to design?

First of all, a few concepts and problems will be elaborated:

  • Linux system open file descriptors are limited. The file descriptor is abbreviated as FD (File Descrption), and one of the core ideas of Linux is that everything is described in FD, so corresponding to a network connection, in fact, the Linux server opens an fd.
  • In Linux, FD is divided into three layers, system level, user level, and process level. All three of these FD systems have a default number of fds, and we can modify the configuration information. Since we say million concurrency, we can set it up like this:
System layer User layer Process layer
1,000,000 500,000 100,000
  • A system can set how many fd and memory size + hard disk size is related (here to explain why it is related to the size of the hard disk, now the operating system, Windows, Linux will have some swap area, what is the swap area, that is, the operating system has a part of the size of the hard disk area as memory to use, from here it is seen that the SSD hard disk is better than the mechanical hard disk, because the SSD read and write faster than the mechanical hard disk)
  • Socket quad (source IP, source PORT, destination IP, destination PORT), the quad uniquely identifies a connection. So it is wrong for a service seen online to only open up to a maximum of 65535 connections. As a client, you can bind multiple IPs, an IP can connect up to 65535 (this also depends on the port range port_range, the default 0-1024 is not open to users to use) Here we assume that in addition to the 0 port, the others are provided to us to use, here to explain why 65535, because the port number is identified by 16 bits at the operating system level, 2 of the 16th power – 1 (0 this port) = 65535. If a client is bound to 2 IPs, then it connects to a fixed server IP and can open the number of IPs * 65535 connections. How many connections the server can open depends on how many fds are configured for the process.

    Now let's start with the system architecture:
    First of all, the picture above:

    Generally, the company's front-end is to use LVS+Nginx as the front-end architecture. There are several considerations for the LVS+nginx design:
  • LVS is located in the fourth layer of the 7-layer network (transport layer), nginx is located in the 7th layer of the ISO7 network, the four-layer load performance is better (now the new version of nginx can do both layer 4 and layer 7 load, enterprises use more OR LVS to do layer 4 load)
  • If nginx does the cluster load, the previous use of LVS to do load sharing is the most appropriate, why nginx to do clustering, because a single nginx can not support millions of concurrent (simultaneously online). Analyzing how much concurrency nginx can support, we will talk about it below.
  • LVS is a four-layer load, its high availability using Keepalived middleware to do monitoring, here we use a main and a backup LVS, the main hanging, VIP will automatically switch to the standby machine, what is VIP (virtual IP virtual IP), how to switch, we will also explain later. Here it is only necessary to know that the user is a VIP address when accessing the system. LvS owners and owners will bind this VIP address to provide external services.
  • This is the most important point, the user accesses the IP address of LVS, LVS receives the user's request, according to the load balancing algorithm, for example, polling, random, forward the request to the nginx node to process, nginx here to do the request processing, and nginx is doing the reverse proxy, the user's request to nginx when the time does not need to return to the lvs server, direct response to the customer, here how to do, is also a VIP function, The next article will explain the knowledge points here together 🙂
  • Pit point: Remember, LVS main standby + nginx needs to be deployed in a local area network, why emphasize the local area network, here is still related to VIP.

Start data analysis:

If the real business logic processing, 1 tomcat service to do the core logic processing TPS is 1000 (the actual TPS needs to do stress testing to get), then 1,000,000 simultaneous online concurrency requires more than 1000 tomcat servers (or node pod [K8s concept, because now the cloud platform deployment is more convenient], more than 1000 because of the need to consider service redundancy), here many people will feel that 1000 servers are many, But the scenario we assume here is that we want to support 1 million concurrently here, which means that the tps that need to be supported are 1Million/s.

To analyze how many nginx needs, ngnix has 2 configuration properties configuration is very critical: the number of workers and work_connection, because nginx is a master-worker architecture, the master is mainly responsible for receiving requests, forwarding requests to the worker process, in fact, the worker process is the actual process that processes the request. nginx official said that nginx can support more than 60,000 concurrent performance is the best, then here we set the nginx process can open up to 60,000 fd, if it is a 4-core CPU (memory we assume that it is large enough 32G, 64G, etc.), we set the number of workers to 4, then the number of worker_connection is 15,000. So how many concurrency numbers can nginx provide to the outside world? You think it's 60,000? Then you are wrong, because nginx acts as a reverse proxy, a user's request actually accounts for 2 fd, one fd is the user request, and one fd is connected to the background service. So an nginx can support 30,000 concurrency under our configuration parameters like this. So how many nginx clusters are needed for million concurrent fronts, at least more than 40, some people may ask, 1 million / 30,000 is not = 33 units, I give up to 34 units is enough, non-also, we also have to consider the fault tolerance of nginx clusters, so we definitely need more than 33 nginx.

So how many machines do we need for lVS? In fact, two lvs clusters are enough, one lvs cluster = one primary + one standby + keepalived. The reason why there are two clusters is mainly because I am afraid that if one cluster is hung, the access of external users will be completely broken. Since it is two clusters, we need two VIPs. So which VIP does the user access? This is simple, we will not take the initiative to provide users with VIP addresses, now most of them are provided domain names, a domain name is bound to multiple VIP addresses, so that when users visit, DNS can return the VIP addresses of these two clusters, many people like to ask the bottom of the matter, as if a domain name can only be bound to an IP, non-also, do you think so can you change the angle? There are many IPS service providers in China (what is IPS, that is, network providers, Unicom, telecommunications, mobile, etc.), an IPS service provider we bind an IP, then we can not parse to three IP 🙂 In fact, the truth is almost the same, everyone understands it this way. Know that a domain name can be bound to multiple IPs. So then again, would anyone think two clusters would be enough? In fact, the main network entrance bandwidth is enough, the LVS server configuration is good enough, no problem at all, and now the servers are 10 Gigabit network cards. LVS only does the receive request here, forwards the user's packet to nginx, and then finishes the work, as for maintaining a long link, it has nothing to do with him, it does not need to respond to the user's request. This is also a feature of VIP. In large Internet companies, generally will do off-site multi-activity architecture design, if there is an interested partner, next time we can also talk about how to do off-site multi-life, off-site multi-life problem points :).

Finally, let's discuss the gateway zuul, the function of the gateway is mainly to have operation audit, permission authentication control, traffic control, multi-tenant control, etc., it does not do specific business logic. (Now there are too many gateways, such as istio with sidecar can be done, application layer, springcloud gateway, etc., cloud platform and application layer can be used as gateways) If the zuul stress test concurrent throughput is 10,000, then we need to deploy more than 100 zuul server gateways.

Finally, to summarize the architectural design that we want to support millions of concurrent (simultaneously online), we can design the data like this:

The system setting is 1 million FD, 500,000 at the user level, and 100,000 at the process level
2 LVS clusters (one primary and one standby + keepalived)
40 nginx servers
At least 100 zuul gateway servers (10,000 for 1 zuul gateway server tps)
At least 1000 tomcat servers (1,000 tps for one tomcat business logic)
Many people are not frightened, or our company does not have so many servers can also support millions of concurrent ah, non also, most of your scenarios are not 1million /TPS, not every second to withstand 1 million concurrent requests, this everyone should be clear > to realize yo.

Otherwise take a business scenario to analyze, our GPS data reporting volume is a device 10S sent a request for GPS information, now there are 10,000 devices, so we need to undertake 10,000 / TPS per second, and then the middle 9 seconds do not process any GPS data? So can we design it this way, using kafka to receive all 10,000 pieces of data / 10 seconds to send once. If we can process 1,000 pieces a second with one server, then my two business servers can process 2,000 pieces of data a second, because we are 10,000 GPS messages in 10 seconds, then I only need 2 business servers to process 5 seconds to process, and 5 seconds can be idle. You say yes :), so how many servers need to be designed according to the business scenario, which also shows once again that technology cannot be separated from the business. Haha what problems can be left below to communicate ~ ~ ~ ~

Please indicate that reprint, otherwise reprinting is prohibited:https://wp.me/paCouF-2Q