A revelation! Reading this article is enough for the design of multi-site active architecture

76 Visits / 0 Comments / Favorite

01关于基础架构About Infrastructure

信息技术的发展,渗透到人们各类活动的方方面面,应对的问题五花八门,纷繁错杂,催生了面向各种业务而非常复杂的软件系统。架构的核心目的就在于解决软件系统的复杂性问题,在互联网分布式系统下的大体量的业务,其复杂性尤其高,主要来自下面几个方面:

  1. 高可用,分布式系统中节点众多,引发故障不可避免,如何减少故障的影响,尽快从故障中恢复,就是高可用设计的关键;
  2. 高性能,大体量业务的海量请求,需要软件系统能够应对大并发量能力,具备强大的吞吐量,而且要有更短的响应时延;
  3. 高扩展,功能迭代、请求模型、外部环境都是变化不定的,软件系统需要针对这种种变化,作出良好设计,以便灵活应对;
  4. 低成本,软件系统往往是一种商业行为,需要关注商业价值,其构建要衡量投入产出比,最小化成本实现最大化商业价值;
  5. 安全性,软件系统安全性要求,需防范数据泄露、保护用户隐私、防止非法访问及操作,确保系统稳定可靠,维护用户权益;
  6. 多功能,业务是多变的,架构设计的前瞻性归根到底是有局限性的,未知的未知对架构的破坏性巨大,是复杂性的根本所在。

架构是一个庞大的话题,设计原则、抽象方法、业务解耦、领域模型、模块划分等等,每一个方面都是大有文章。不过,一般来说,不管多么复杂的软件系统,都可以抽象为“基于数据的一系列处理逻辑组合,供目标用户接入使用的系统”,接入、逻辑、数据,这就是本文讨论的“基础架构”,如下图所示。基础架构着重关注高可用、高性能、高扩展的要求,本文将从后台的视角展开,看看这几个要素对于基础架构的设计影响。

The development of information technology has penetrated into all aspects of people's activities, dealing with a wide variety of complex issues, and has given rise to very complex software systems for various businesses. The core purpose of the architecture is to solve the complexity problem of the software system. The complexity of large-scale business under the Internet distributed system is particularly high, mainly from the following aspects:

  1. High availability. There are many nodes in the distributed system, and failures are inevitable. How to reduce the impact of failures and recover from failures as soon as possible is the key to high-availability design;
  2. High performance. The massive requests of large-scale business require the software system to be able to cope with large concurrency capabilities, have strong throughput, and have shorter response delays;
  3. High scalability. Function iteration, request model, and external environment are all changing. The software system needs to be well designed for these changes in order to respond flexibly;
  4. Low cost. Software systems are often a kind of business behavior. It is necessary to pay attention to commercial value. Its construction must measure the input-output ratio and minimize costs to maximize commercial value;
  5. Security. The security requirements of the software system need to prevent data leakage, protect user privacy, prevent illegal access and operation, ensure the stability and reliability of the system, and safeguard user rights;
  6. Multifunctional, business is changeable, the forward-looking nature of architecture design is ultimately limited, and the unknown unknowns are extremely destructive to the architecture, which is the root of complexity.

Architecture is a huge topic, and there are a lot of articles in each aspect, such as design principles, abstract methods, business decoupling, domain models, module division, etc. However, generally speaking, no matter how complex the software system is, it can be abstracted as "a series of processing logic combinations based on data, for target users to access and use the system", access, logic, data, this is the "infrastructure" discussed in this article, as shown in the figure below. The infrastructure focuses on the requirements of high availability, high performance, and high scalability. This article will start from the perspective of the background to see how these factors affect the design of the infrastructure.

cac2b5dc61214a236c1fbfbe79bbdfed.jpg

02关于异地多活About Multi-site Live

需要考虑异地多活的业务,大概率是要把高可用当作核心目标,高可用重点关注的是软件系统面对故障的应对方案。互联网分布式系统的任何组成部分,都不是百分百绝对可靠的,总是会有发生故障的可能,要保障系统的可用性,就需要针对故障做容灾设计。容灾的本质,在于提供冗余以避免单点故障问题,当系统的某个组成部分发生故障时,可以由冗余部分接管服务使服务整体不受(或少受)影响。

在业务发展的不同阶段,业务体量和规模决定其对于灾难的接受程度是不同的,容灾要应对的单点故障类型也不一样:

  1. 单机器,业务起步阶段,体量非常小,单机部署就能够支撑,这时候面临的单机故障可能会是磁盘损坏、操作系统异常、数据误删等,为了应对这种故障,避免数据丢失,需要做一些数据备份,搭建主从;
  2. 单机房,业务规模增长,体量比较大,需要用到相当多的机器,这时候有了更高的追求,在部署的时候,会将机器部署到不同的机房,以规避单个机房故障带来的影响;
  3. 单城市,业务持续增长,体量非常大,同城市内的多个机房已经不能满足业务的容灾需要,如果发生城市级别的灾害,例如台风、地震、洪水等灾害,会使城市成为整个服务的单点。

解决城市级别的单点故障,也就是本文题目中的“异地多活”了。城市单点问题,和单机器、单机房的单点问题,有着巨大的不同。城市在台风、地震、洪水等极端剧烈灾害时成为单点,而这些灾害往往影响大片区域,该区域内的多个城市会一起受到牵连。所以,要解决城市单点问题,就需要将冗余做到距离较远的另一个区域,例如同属于珠三角城市圈的广州深圳,同属于京津唐城市圈的北京天津,同属于长三角城市圈的上海杭州,聚集在一个城市圈内的城市,往往共享了很多基础设施,这样的距离不能满足容灾的需要。要达到异地多活的容灾目标,基本都需要将服务分别部署在千里之外,例如深圳上海分布,北京上海分布等。

For businesses that need to consider multi-site active-active, high availability is likely to be the core goal. High availability focuses on the response plan of the software system to failures. No component of the Internet distributed system is 100% reliable. There is always the possibility of failure. To ensure the availability of the system, disaster recovery design is required for failures. The essence of disaster recovery is to provide redundancy to avoid single point failure problems. When a component of the system fails, the redundant part can take over the service so that the service as a whole is not (or less) affected.

At different stages of business development, the business volume and scale determine its acceptance of disasters, and the types of single point failures that disaster recovery needs to deal with are also different:

  1. Single machine: In the initial stage of business, the volume is very small, and a single machine deployment can support it. At this time, the single machine failure faced may be disk damage, operating system abnormality, data deletion, etc. In order to deal with this kind of failure and avoid data loss, some data backups need to be made and master-slave systems need to be built;
  2. Single computer room: As the business scale grows, the volume is relatively large, and a considerable number of machines are needed. At this time, there is a higher pursuit. When deploying, the machines will be deployed to different computer rooms to avoid the impact of a single computer room failure;
  3. Single city: The business continues to grow and the volume is very large. Multiple computer rooms in the same city can no longer meet the disaster recovery needs of the business. If a city-level disaster occurs, such as a typhoon, earthquake, flood, etc., the city will become a single point of the entire service.

Solving the city-level single point failure is the "multi-site active" in the title of this article. The city single point problem is very different from the single point problem of a single machine or a single computer room. Cities become single points when facing extreme disasters such as typhoons, earthquakes, and floods. These disasters often affect large areas, and multiple cities in the area will be affected together. Therefore, to solve the single point problem in cities, redundancy needs to be done in another area that is farther away. For example, Guangzhou and Shenzhen belong to the Pearl River Delta urban circle, Beijing and Tianjin belong to the Beijing-Tianjin-Tangshan urban circle, and Shanghai and Hangzhou belong to the Yangtze River Delta urban circle. Cities gathered in an urban circle often share a lot of infrastructure, and such distances cannot meet the needs of disaster recovery. To achieve the disaster recovery goal of multi-site active-active, it is basically necessary to deploy services thousands of miles away, such as Shenzhen and Shanghai, Beijing and Shanghai, etc.

03写时延是关键Write latency is key

3.1 核心在于数据层的写操作

在基础架构中,一般来说,逻辑层负责计算,是无状态的,可以做到无缝切换接管。逻辑层服务的根本是对数据的读取、处理、写入,数据层的故障,涉及到数据的同步、搬迁、恢复,要保证其完整性和一致性才可以切换投入使用,所以,基础架构的容灾关键在于数据层。

数据层的操作涉及读和写,因为读操作不涉及到数据状态的变更,可以通过副本的方式方便扩展,而写操作为了保证写入数据在多份冗余之间的完整性和一致性,需要做数据复制,所以,数据层的关键,在于写请求的处理上。

3.2 写时延在跨城时发生质变

跨城的写时延发生质变是因为,在做跨城级别之前的容灾时,基本上所有业务对于时延都是能接受的,数据的复制直接采用同步的方式即可。在做跨城的时候,业务是否能接受更高的时延,就需要慎重斟酌了,而这也将影响具体的应对方案。

更长的距离意味着更长的时延,通过 ping 工具,测得的时延情况大致是:

  1. 同机房内的往返耗时大约 0.5ms 以内;
  2. 同城跨机房往返耗时大约在 3ms 以内;
  3. 千里之外的深圳上海跨城市耗时大约在 30ms 以内,北京上海,深圳天津的耗时会更长一些。

3.1 The core lies in the write operation of the data layer

**In the infrastructure, generally speaking, the logic layer is responsible for computing, is stateless, and can achieve seamless switching and takeover. The fundamental service of the logic layer is to read, process, and write data. The failure of the data layer involves synchronization, relocation, and recovery of data. Its integrity and consistency must be guaranteed before it can be switched and put into use. Therefore, the key to disaster recovery of the infrastructure lies in the data layer. **

The operation of the data layer involves reading and writing. Because the read operation does not involve changes in the data state, it can be easily expanded through copies. In order to ensure the integrity and consistency of the written data between multiple redundancies, the write operation needs to replicate the data. Therefore, the key to the data layer lies in the processing of write requests.

3.2 The write latency changes qualitatively when crossing cities

The write latency of the cross-city changes qualitatively because, when doing disaster recovery before the cross-city level, basically all businesses are acceptable to the latency, and data replication can be directly synchronized. When doing cross-city, whether the business can accept higher latency needs to be carefully considered, which will also affect the specific response plan.

Longer distance means longer latency. The latency measured by the ping tool is roughly as follows:

  1. Round trips within the same data center are about 0.5ms;
  2. Round trips between data centers in the same city are about 3ms;
  3. Cross-city latency between Shenzhen and Shanghai, thousands of miles away, is about 30ms. Beijing, Shanghai, Shenzhen and Tianjin will take longer.

546bb8b98abbb2589b926b1bcd9402d1.jpg

当时延达到 30ms 的级别,业务可用性面临另一个层面的考验,业务是否能接受数据写入的跨城耗时,这里不单单是 30ms 的问题:

  1. 跨城容灾的场景,涉及到数据写入和副本复制,一次写请求将产生两倍延时,即 60ms;
  2. 业务写请求是否要串行发起 n次写请求,即 60ms 的 n倍;
  3. 需要做一些前瞻性设计,当前是能接受 60ms 的 n倍,后续的 n 会不会扩大,会不会扩大到不能接受的地步,在后续的设计中是否能贯彻这一依赖,都需要重点关注;
  4. 跨城情况下,网络状况可能会差一些,例如容易产生抖动,这个倒不是大问题,一般来说有能力搭建跨城网络的公司,也有能力保障网络的稳定性,在测试的过程中也发现跨城的耗时挺稳定的,在长达近4个半小时的测试中,最大抖动不超过8毫秒,而且最大耗时在30毫秒以内。

如果业务能够接受跨城写时延,那么问题就退化到同城容灾,直接采用跨城同步复制即可。如果不能够接受写入延时,就不能走长距离跨城的同步复制,必须找退而求其次的方案,下面聊聊两种考虑方向。

When the latency reaches 30ms, business availability faces another level of test: whether the business can accept the cross-city time consumption of data writing. This is not just a 30ms issue:

  1. In the cross-city disaster recovery scenario, data writing and replica replication are involved. A write request will generate twice the latency, that is, 60ms;
  2. Whether the business write request needs to initiate n write requests in series, that is, n times of 60ms;
  3. Some forward-looking designs need to be made. Currently, n times of 60ms can be accepted. Whether the subsequent n will expand, whether it will expand to an unacceptable level, and whether this dependency can be implemented in subsequent designs all need to be focused on;
  4. In the case of cross-city, the network condition may be worse, for example, it is easy to generate jitter. This is not a big problem. Generally speaking, companies that are capable of building cross-city networks are also capable of ensuring the stability of the network. During the test, it was found that the cross-city time consumption was quite stable. In the nearly 4.5-hour test, the maximum jitter did not exceed 8 milliseconds, and the maximum time consumption was within 30 milliseconds.

If the business can accept cross-city write latency, then the problem degenerates to disaster recovery in the same city, and cross-city synchronous replication can be used directly. If the write latency is not acceptable, long-distance cross-city synchronous replication cannot be used, and a second-best solution must be found. The following discusses two considerations.

1246748074dc7765b14d5784276eb4af.jpg

3.3 同步复制缩短距离降目标

缩短距离,不做千里之外,而是选择做距离较近的跨城,例如做广州-深圳、上海-杭州、北京-天津的跨城,距离在100-200公里,时延在5-7ms,这样依然可以用同步复制的方式,但是,如前面提到的,这种方式是达不到跨城异地多活的真实目标的。

3.4 异步复制就近分片做有损

不做同步复制,首先要做的就是将数据根据地理位置做分片,异地多活不能接受延时的情况下,不同业务的分片规则可能会有差异,例如某多、某东、某宝的电商业务,和饿某某和某团的外卖业务的分片规则肯定是不同的,但基本上都是基于用户地理位置来做的:让离哪个城市近的用户数据尽量放到对应城市去。如下图所示,针对用户做了就近分片后,数据写入不需要做跨城同步复制,写入主写点后,直接对外返回成功,而不需要等数据同步到异地。

3.3 Synchronous replication shortens distance and reduces targets

Shorten the distance, do not do it thousands of miles away, but choose to do it across cities with shorter distances, such as Guangzhou-Shenzhen, Shanghai-Hangzhou, Beijing-Tianjin, with a distance of 100-200 kilometers and a latency of 5-7ms. In this way, synchronous replication can still be used, but as mentioned earlier, this method cannot achieve the real goal of cross-city multi-site active.

3.4 Asynchronous replication is lossy for nearby sharding

Without synchronous replication, the first thing to do is to shard the data according to the geographical location. When multi-site active cannot accept latency, the sharding rules of different businesses may be different. For example, the e-commerce business of a certain company, a certain company, and a certain treasure, and the sharding rules of the takeaway business of a certain company and a certain group are definitely different, but they are basically based on the user's geographical location: let the user data close to a certain city be placed in the corresponding city as much as possible. As shown in the figure below, after the nearby sharding is done for the user, the data writing does not need to be synchronized across cities. After writing to the main write point, it is directly returned to the outside successfully without waiting for the data to be synchronized to the remote location.

10884edbe95e9582faa6723311c6f41b.jpg

采用异步复制,这样必然会出现灾难发生时数据没有及时复制到异地的情况,如下图所示:

With asynchronous replication, data will inevitably not be replicated to a remote location in time when a disaster occurs, as shown in the following figure:

257d9ea9cef01fc08fd00bae5eb20ba7.jpg

对于数据一致性要求不高的业务,例如微博、视频等,可以接受数据重复的情况,自由切换即可,结合一些业务层的去重逻辑,例如结合灾难情况,将灾难发生期间的重复数据做一些去重,基本也就够用了。

对于数据一致性要求高的业务,例如金融、支付等,就必须要保证在做灾难切换的时候,为了将影响的数据尽量减少,需要根据业务的特点,圈出来可能影响的数据,并针对这些相关数据的所属用户的相关操作拒绝服务。下文的数据复制架构中会提到。

For businesses with low data consistency requirements, such as Weibo and video, data duplication can be accepted and switched freely. Combined with some deduplication logic at the business layer, such as disaster situations, duplicate data during the disaster period can be deduplicated, which is basically enough.

For businesses with high data consistency requirements, such as finance and payment, it is necessary to ensure that when doing disaster switching, in order to minimize the affected data, it is necessary to circle the data that may be affected according to the characteristics of the business, and refuse services for the related operations of the users of these related data. This will be mentioned in the data replication architecture below.

04写量大拆分片Split the shards when the write volume is large

写请求量大,单个写入点的容量扛不住,这种情况下,就不能让所有数据的写入都归到同一个写入点来处理,需要做分片,将完整的数据拆分成几部分,各个部分分别有独立的写入点。单纯考虑写量大,并不要求做就近分片,但是就近分片还是能收获一些益处,例如减少 30ms 的耗时等,所以,一般也还是会做就近的分片。下图所示的写量大拆分片的情况,数据写入的时候,等把数据同步复制到异地之后,才认为请求处理成功,给上层返回。

The write request volume is large and the capacity of a single write point cannot handle it. In this case, all data writes cannot be processed at the same write point. Sharding is required to split the complete data into several parts, each with an independent write point. Considering the large write volume alone, it is not required to do sharding nearby, but sharding nearby can still gain some benefits, such as reducing the time consumption by 30ms, so sharding nearby is generally still done. In the case of sharding due to large write volume shown in the figure below, when data is written, the request is considered to be successfully processed only after the data is synchronously copied to a different location, and it is returned to the upper layer.

1850f1c17b07d5632ddfd9f5bfeee986.jpg

另外,写量大的业务产生的数据如果是膨胀型的(例如,电商业务的订单数据),会随着时间累积,数据量不断增加。这类数据往往多呈现为流水型特征,写入一段时间后即不会再次访问或更新;对访问频率很低甚至为 0 的数据,其占用的在线业务库存储空间,造成了大量硬件资源浪费,堆高企业的 IT 成本。这种情况根据膨胀的情况,做分库分表以及老数据存档即可,不会产生数据分片而需要实例隔离的这种影响。

In addition, if the data generated by a business with a large write volume is of the inflated type (for example, order data of an e-commerce business), the data volume will continue to increase as time goes by. This type of data often presents a stream-type feature, and will not be accessed or updated again after being written for a period of time; for data with a very low or even zero access frequency, the online business database storage space it occupies causes a large amount of hardware resources to be wasted, and the IT costs of the enterprise are increased. In this case, according to the expansion situation, it is sufficient to shard the database and table and archive the old data, without the impact of data sharding and the need for instance isolation.

05做隔离拆分片Make an isolated split

做隔离,是为了减少故障/异常情况下对整个业务系统的影响,核心思想就是“不要把鸡蛋放在一个篮子里”。这个对于数据层的影响和上文“写量大拆分片”的效果是一样的,即把全部数据分片拆分成多份,每一份出问题的时候不影响其他数据。做隔离,其实与异地多活的跨城容灾关系不大,在做同城容灾的时候,也是一种常用手段。

业务系统,除了自身的数据层之外,往往还涉及其他的依赖,例如相关的基础组件,更底层的一些服务,运营操作平台等。所以,做隔离的时候,往往不局限于数据层的隔离,而是会把各种依赖,甚至上层的逻辑层也统一囊括进来整体考虑。这种串联上下依赖的隔离方案,名字比较多,例如“单元化”、“SET 化”、“条带化”等。下图是示意图:

  1. 接入层,根据请求中的相关信息做单元分片路由选择;
  2. 逻辑层,只处理归属于本单元分片内的请求;
  3. 数据层,单从隔离角度出发,可以做跨城数据的同步复制;
  4. 单元化隔离与本文讨论的异地多活跨城容灾的关系不是特别紧密,不过多展开,对于路由的影响在下文中会提到。

Isolation is done to reduce the impact of failures/abnormal situations on the entire business system. The core idea is "don't put all your eggs in one basket". The impact on the data layer is the same as the effect of "sharding when writing a lot of data" mentioned above, that is, all data shards are split into multiple copies, and when a problem occurs in each copy, it does not affect other data. Isolation is actually not related to cross-city disaster recovery with multi-active in different locations. It is also a common method when doing disaster recovery in the same city.

In addition to its own data layer, business systems often involve other dependencies, such as related basic components, some lower-level services, and operating platforms. Therefore, when isolating, it is often not limited to the isolation of the data layer, but will include various dependencies, even the upper-level logic layer, as a whole. This isolation solution that connects upper and lower dependencies in series has many names, such as "unitization", "SETization", "striping", etc. The following figure is a schematic diagram:

  1. Access layer, make unit shard routing selection based on relevant information in the request;
  2. Logical layer, only process requests belonging to the unit shard;
  3. Data layer, from the perspective of isolation alone, can do synchronous replication of cross-city data;
  4. Unit isolation is not particularly closely related to the cross-city disaster recovery of multi-site active-active discussed in this article, so I will not elaborate on it. The impact on routing will be mentioned below.

1de7ef13ee57d85c01ead97ba1806f63.jpg

06其他影响因素

在讨论数据模型的时候,常常会聊到“读多写少”、“读写频繁”、“读写分离”等情况,可见,读也是决定数据模型的重要因素。不过,和上面写延时、写量级、隔离性等因素会导致数据分片不同,读的影响主要在副本管理、缓存机制和连接管理上。读是一个后置的二级考虑因素,即首先确定是否要做分片之后,再基于分片的基础来考虑。

6.1 读时延可就近

读操作,根据业务场景的需要,可以分为两种情况:

  1. 写后立即读,这种情况,要求读到写入之后的最新值,是一种强一致性的诉求,须通过读写点的方式来解决,实际上就归入到写操作的范畴里面去了;
  2. 适当延迟读,这种情况,可以接受读取到历史旧值,满足最终一致性要求即可,可以通过读写入之后同步数据的副本来应对,是本部分讨论的内容。

一般来说,业务对于读操作的时延要求相较写操作有更严苛的要求,例如,写一条微博,发布一段视频,下定一件商品,发起一笔转账,用户对于适当等待是有所预期,而看微博、刷视频、浏览商品、查看账户余额等操作,用户如果感到卡顿,基本上就要流失了。大部分的读场景,都可以接受适当延迟,看不到最新的内容,用户基本上“刷新一下”就可以了。理清楚场景需要,读时延的解决方案就很明显了:提供离用户更近的副本供读取。如下图所示,从上海到访的用户,访问上海的备份副本即可,不需要到深圳去读取数据。

When discussing data models, we often talk about "more reading and less writing", "frequent reading and writing", "read-write separation" and other situations. It can be seen that reading is also an important factor in determining the data model. However, unlike the above factors such as write latency, write volume, isolation, etc. that lead to data sharding, the impact of reading is mainly on replica management, cache mechanism and connection management. Reading is a post-secondary consideration, that is, after first determining whether to do sharding, then consider it based on the basis of sharding.

6.1 Read latency can be close

Read operations can be divided into two situations according to the needs of business scenarios:

  1. Read immediately after writing. In this case, it is required to read the latest value after writing. It is a strong consistency demand and must be solved by reading and writing points. In fact, it is included in the category of write operations;
  2. Appropriate delay reading. In this case, it is acceptable to read the old historical value to meet the final consistency requirements. It can be dealt with by synchronizing the copy of the data after reading and writing. This is the content discussed in this section.

Generally speaking, the latency requirements for read operations are more stringent than those for write operations. For example, when writing a Weibo post, posting a video, ordering a product, or initiating a transfer, users expect to wait for a while. However, if users feel a lag when reading Weibo, watching videos, browsing products, or checking account balances, they will basically lose their users. For most read scenarios, appropriate delays are acceptable. Users can basically just "refresh" the data if they cannot see the latest content. Once the scenario requirements are clarified, the solution to read latency is obvious: provide a copy closer to the user for reading. As shown in the figure below, users visiting from Shanghai can access the backup copy in Shanghai and do not need to go to Shenzhen to read data.

41e3e8493a9644e58c7c5b283ce49a0c.jpg

6.2 读量大扩副本

和上文“读时延可就近”类似,这里依然讨论的是能够接受适当延迟读的场景。很多业务都是读多写少,大量的读请求,可以通过扩充副本来满足。不过,需要注意,当副本扩展到一定规模后,由于需要做读副本的数据复制,会增加对写点的负载,可以通过级联同步的方式来解决。另外,还会通过添加缓存的方式来进一步提升读请求的吞吐量,这里不做展开了。总体来说,读量一般都不会像写延时和写量一样产生数据分片而需要实例隔离的这种影响。下图是级联复制的示意。

6.2 Expanding replicas for large read volumes

Similar to the above "Read latency can be close", here we are still discussing scenarios where appropriate delayed reads are acceptable. Many businesses have more reads than writes, and a large number of read requests can be met by expanding replicas. However, it should be noted that when the replicas are expanded to a certain scale, the load on the write point will increase due to the need to replicate the data of the read replicas, which can be solved by cascading synchronization. In addition, the throughput of read requests can be further improved by adding caches, which will not be expanded here. In general, read volumes generally do not cause data fragmentation and require instance isolation like write latency and write volume. The following figure is a schematic diagram of cascading replication.

9ee55e81804001592e5935509d3c8e78.png

6.3 连接多加代理

数据层是业务的根本,虽然通过分片、副本、缓存等操作,将落到 DB 的请求量减少到可接受的地步,但是逻辑层作为数据层的调用方,还是不可避免的需要建立和数据层 DB 的连接,如果逻辑层的调用方过多,则会需要和 DB 构建更多的连接数。增加连接数能够增加 DB 的并发度,支持更多的调用方,提升吞吐量。但是,数据库的性能并不是可以无限扩展的,当达到一个阈值以后,由于高并发导致的资源抢占、线程上下文切换,反而会导致数据库的整体性能下降。比较普遍的做法,是为数据层 DB 添加一层代理,避免逻辑层调用方直连 DB,由代理来收拢和 DB 之间的连接。

6.3 Add more proxies to connect

The data layer is the foundation of the business. Although the number of requests to the DB can be reduced to an acceptable level through operations such as sharding, replicas, and caching, the logic layer, as the caller of the data layer, still inevitably needs to establish a connection with the data layer DB. If there are too many callers in the logic layer, more connections will need to be established with the DB. Increasing the number of connections can increase the concurrency of the DB, support more callers, and improve throughput. However, the performance of the database is not infinitely scalable. When a threshold is reached, resource preemption and thread context switching caused by high concurrency will cause the overall performance of the database to decline. A more common practice is to add a layer of proxy to the data layer DB to prevent the logic layer caller from directly connecting to the DB, and let the proxy collect the connection between the DB and the logic layer.

bfb0cdba0ab43de9046fc9c81bb0fb00.jpg

07数据复制架构Data replication architecture

上文的讨论中,对于数据的复制,都是采用了一主一备的简化表述。事实上,要达到容灾的效果,一主一备是不够的,下面来看一下几种典型的数据复制架构。

7.1 三地五中心

要想达到容灾的效果,基本上都是要用到多数派协议的方式来做,比较经典的模式是三地五中心架构:

  1. 搭建1主4从5实例的架构,分布在3个城市5个 IDC 机房中;
  2. 写请求要保证对应的数据写入到1主4从中的3个实例中,即写入主后,要同步到另外2个备,达成多数派要求;
  3. 在发生城市故障的情况下,不管哪个城市发生故障,在该城市以外,都有完整的数据可以满足容灾要求。

In the above discussion, the data replication is simplified as one master and one backup. In fact, one master and one backup is not enough to achieve the effect of disaster recovery. Let's take a look at several typical data replication architectures.

7.1 Three locations and five centers

To achieve the effect of disaster recovery, it is basically necessary to use the majority protocol method. The more classic model is the three-location five-center architecture:

  1. Build an architecture of 1 master, 4 slaves and 5 instances, distributed in 5 IDC computer rooms in 3 cities;
  2. Write requests must ensure that the corresponding data is written to 3 instances of 1 master and 4 slaves, that is, after writing to the master, it must be synchronized to the other 2 backups to meet the majority requirements;
  3. In the event of a city failure, no matter which city fails, there is complete data outside the city to meet the disaster recovery requirements.

adf9122705cdbf22a267ee616a44c1ac.jpg

7.2 三地三中心

三地三中心,可以形成最小的多数派,也能满足容灾需要,不过考虑到下面几点,一般都没有采纳:

  1. 一般来说,跨城切换更复杂,成本更高;
  2. 机器以及机房的故障概率要远高于城市故障;
  3. 在某个 DB 实例发生故障的时候,尽量让其发生做同城内进行切换,如果是三地三中心,只要有故障就会发生跨城切换;
  4. 三地五中心相对于三地三中心会有机器资源浪费的情况,对于跑不满资源的情况,可以采用混布的方式,通过一些资源隔离的(例如 CGroup)机制,来提高资源利用率的同时,又可避免混布业务之间互相影响。

7.2 Three Locations and Three Centers

Three locations and three centers can form the smallest majority and meet the needs of disaster recovery. However, considering the following points, it is generally not adopted:

  1. Generally speaking, cross-city switching is more complicated and more costly;
  2. The probability of machine and computer room failure is much higher than that of city failure;
  3. When a DB instance fails, try to make it fail within the same city. If it is three locations and three centers, cross-city switching will occur as long as there is a failure;
  4. Compared with three locations and three centers, three locations and five centers will waste machine resources. For the situation where resources are not fully utilized, a mixed distribution method can be adopted. Through some resource isolation mechanisms (such as CGroup), resource utilization can be improved while avoiding mutual influence between mixed distribution businesses.

d733af7c8323af74464a4a9cd9b94db0.jpg

7.3 同城三中心

在不能接受跨城时延的场景中,会用到同城三中心的复制架构。

7.3 Three centers in the same city

In scenarios where cross-city latency is unacceptable, a replication architecture of three centers in the same city will be used.

e9a15b10f52b86d4cd57fc23a029f1e9.jpg

在多个城市配备多套互为对等的同城三中心,可以做到有损的跨城容灾,如下图所示:

  1. 某城市故障时,将写请求放到异地对等的同城三中心处理,所以,下图中每个同城三中心的实例中都有一部分对等同城三中心实例的数据;
  2. 这种容灾模式,在发生城市故障的时候,可能会发生产生重复数据的问题,例如用户在蓝色的 set1 种新增一条数据,发生了城市故障,数据来不及同步到异地的异步备,故障切换,用户的请求已经切换到对等的绿色 set2,此时,用户读不到刚刚新增的数据,就会重新操作;
  3. 数据读取时,可以整合本城市的写点数据和对等同城三中心的异步备数据。

By equipping multiple sets of three-centers in multiple cities with peers, lossy cross-city disaster recovery can be achieved, as shown in the following figure:

  1. When a city fails, the write request is placed in the three-center in the same city at a different location for processing. Therefore, each instance of the three-center in the same city in the following figure contains some data of the three-center instance in the same city;
  2. In this disaster recovery mode, when a city fails, duplicate data may occur. For example, if a user adds a piece of data in the blue set1, a city failure occurs, and the data is not synchronized to the asynchronous backup in a different location in time, and the user's request has been switched to the peer green set2 during the failover, the user cannot read the newly added data and will have to re-operate;
  3. When reading data, the write point data of the city and the asynchronous backup data of the three-center in the same city can be integrated.

b0aff226e5b9a4ce64416e744ad61394.jpg

7.4 双主互复制

双主互复制的架构,是每一个实例中都有完整的数据,不过,每一个主里面在一个时刻只处理其中一部分数据的写入,规避写冲突。

7.4 Dual-master mutual replication

In the dual-master mutual replication architecture, each instance has complete data. However, each master only processes the writing of a part of the data at a time to avoid write conflicts.

4fd517447b74908efa26f678124419d9.jpg

这种模式,在做跨城容灾的时候,通过记录同步时间位点的方式来决定跨城容灾时候的数据写入逻辑,也是有损的:

  1. 维护一个统一的时间位点发生器,每次写操作,都记录时间位点,新增记录记为 Ti(insert);

  2. 数据做异步复制的时候,记录复制到的时间位点,记为 Ts(sync);

  3. 发生故障时,将故障所在的实例禁写,禁写时间记为 Tb(ban);

  4. 有损的情况:

    1. Ts < Ti < Tb,即在主写新增的记录在没有复制到备的情况下,用户由于读不到之前在主写写入的数据,而重新尝试写操作,就会产生重复数据;
    2. 对于所有新增时间 Ti < Tb && Ts < Tb 的数据,即在故障禁写之前就存在的数据,不能在切换到新写点之后进行更新,因为在 Tb 之前总有数据写操作没有复制到位,如果直接更新,就可能产生写冲突。

This mode is also lossy when performing cross-city disaster recovery. The data writing logic during cross-city disaster recovery is determined by recording the synchronization time point:

  1. Maintain a unified time point generator. Each write operation records the time point, and the newly added record is recorded as Ti (insert);
  2. When data is replicated asynchronously, the time point of the replication is recorded as Ts (sync);
  3. When a failure occurs, the instance where the failure occurs is prohibited from writing, and the prohibition time is recorded as Tb (ban);
  4. Lossy situation:
    1. Ts < Ti < Tb, that is, when the newly added records of the primary write are not copied to the backup, the user cannot read the data previously written in the primary write, and retry the write operation, which will generate duplicate data;
    2. For all newly added data with time Ti < Tb && Ts < Tb, that is, data that existed before the failure prohibition, it cannot be updated after switching to the new write point, because there are always data write operations before Tb that have not been copied in place. If it is updated directly, write conflicts may occur.

cab4de631fa0aee84cdb0093ce5a3e65.jpg

7.5 未同步名单

前面提到的对等同城三中心和双主互复制,都有可能产生重复数据,根源在于不知道哪些数据没有同步到。如果可以明确知道哪些数据没有复制到位,那么就可以针对性的拒绝这些没有复制到位的数据的操作。

业务接受不了 N 次跨城带来的 N 倍 60ms 的影响,但是一般,能接受一次跨城 30ms 的延时,这种模式的工作机制是:

  1. 在具体进行写操作之前,通过做一次跨城调用记录下该写操作对应的数据属主到未同步名单中;
  2. 待该写操作同步到跨城实例后,再将写操作的数据属主从未同步名单中清理掉;
  3. 在做数据写入之前,先检查写操作的数据属主是否存在未同步名单中,如果存在,则拒绝该请求;
  4. 这种模式,依然是有损的,只是牺牲了极少部分未复制到位的用户,而且数据一致性得到了保障,配合双主互复制使用,可以达到很好的效果。

7.5 Unsynchronized list

The aforementioned three-center peer-to-peer same-city and dual-master mutual replication may generate duplicate data. The root cause is that we don’t know which data is not synchronized. If we can clearly know which data is not replicated in place, we can specifically reject the operation of these unreplicated data.

The business cannot accept the impact of N times 60ms brought by N cross-city operations, but generally, it can accept a 30ms delay in one cross-city operation. The working mechanism of this mode is:

  1. Before the specific write operation, record the data owner corresponding to the write operation in the unsynchronized list by making a cross-city call;
  2. After the write operation is synchronized to the cross-city instance, the data owner of the write operation is cleared from the unsynchronized list;
  3. Before writing data, check whether the data owner of the write operation is in the unsynchronized list. If so, reject the request;
  4. This mode is still lossy, but it sacrifices a very small number of users who have not been replicated in place, and data consistency is guaranteed. It can achieve good results when used with dual-master mutual replication.

2370804ef46f09fe4fd24349e490aad8.jpg

08数据影响路由Data affects routing

考虑从上面分析的几种因素,不同业务的数据形态可能不同,可以分为三类:

  1. 跨城全局数据,数据不分片,主从之间做跨城同步复制;
  2. 就近分片数据,数据需分片,由于业务不能接受跨城串行写入的耗时,只能做同城的同步复制,跨城则采用异步复制;
  3. 跨城分片数据,数据需分片,每个分片的主从之间做跨城同步复制。

下面来看看这三种模式对路由的影响。

8.1 跨城全局数据就近路由

数据不分片,写入点只有一个,多副本的方式提供就近读取。这种情况,路由适合全链路就近的方式,按照机房-城市-全局的优先级就近选取路由。如果考虑逻辑层的隔离,也可以在接入层进行路由分流,不过,意义并不大,因为,对于全局数据来说,就近访问,已经具备了较好的隔离性。

Considering the factors analyzed above, the data forms of different businesses may be different and can be divided into three categories:

  1. Global data across cities, data is not sharded, and cross-city synchronous replication is performed between the master and slave;
  2. Nearby sharded data, data needs to be sharded. Since the business cannot accept the time-consuming cross-city serial write, only synchronous replication can be performed in the same city, and asynchronous replication is used across cities;
  3. Cross-city sharded data, data needs to be sharded, and cross-city synchronous replication is performed between the master and slave of each shard.

Let's take a look at the impact of these three modes on routing.

8.1 Nearby routing of global data across cities

Data is not sharded, there is only one write point, and multiple copies are used to provide nearby reading. In this case, routing is suitable for the full-link nearby method, and the route is selected nearby according to the priority of the computer room-city-global. If the isolation of the logical layer is considered, routing diversion can also be performed at the access layer, but it is not very meaningful, because for global data, nearby access already has good isolation.

44b9cbd91c01bd47169d157f11f02f59.jpg

8.2 就近分片数据接入分流

就近分片数据,重点在于解决写请求穿行N次写操作的跨城时延问题,所以,需要在业务执行写请求之前,把请求提前路由到数据所在的城市(机房),这样穿行的 N 次写操作就是同城(同机房)操作,免去了跨城的耗时。这就要求在执行具体写请求的逻辑层之上做路由分流,考虑到逻辑层的隔离,一般都会把路由分流放在接入层做。读请求也可以和写请求采用同样的路由策略,这样针对同一个分片的读写请求就都在一处了。每一个分片里面的同城三中心的复制架构,在下图中省略了。

8.2 Access and diversion of data in the nearest shard

The key point of data sharding in the nearest shard is to solve the problem of cross-city latency of write requests going through N write operations. Therefore, it is necessary to route the request to the city (computer room) where the data is located before the business executes the write request. In this way, the N write operations going through are operations in the same city (same computer room), eliminating the time-consuming cross-city operation. This requires routing and diversion on the logical layer that executes specific write requests. Considering the isolation of the logical layer, routing and diversion are generally placed in the access layer. Read requests can also use the same routing strategy as write requests, so that read and write requests for the same shard are all in one place. The replication architecture of the three centers in the same city in each shard is omitted in the figure below.

0949d0915940f3634e00dc47dd8fa97a.jpg

8.3 跨城分片数据接入分流

跨城分片数据能够接受跨城延时,针对写操作是支持跨城同步的,在已经做了分片的基础上,出于就近和影响隔离的考虑,基本上都会做就近,在拆分片的时候,将用户做聚集,并把各个分片的主写点分布到不同城市和机房。跨城分片数据对于路由的影响和就近分片基本上差不多。差别在于跨城分片需要引入第三个城市做数据的完整容灾,如下图的天津,第三城市一般只是为了形成数据容灾的多数派,不会做流量接入,也不会考虑做分片就近部署。

8.3 Cross-city shard data access and diversion

Cross-city shard data can accept cross-city delays, and supports cross-city synchronization for write operations. On the basis of sharding, for the consideration of proximity and impact isolation, it is basically done in proximity. When splitting shards, users are gathered and the main write points of each shard are distributed to different cities and computer rooms. The impact of cross-city shard data on routing is basically the same as that of proximity sharding. The difference is that cross-city sharding requires the introduction of a third city for complete data disaster recovery. For example, in Tianjin in the figure below, the third city is generally only for the purpose of forming the majority of data disaster recovery, and will not provide traffic access, nor will it consider sharding for proximity deployment.

b0d9bc670c53fb023af812cb0073a8aa.jpg

09架构选型模式Architecture Selection Model

在具体进行架构设计的时候,可以参考如下步骤考量评估。下图中只是聊到了本文描述到的一些比较普遍的关键信息,刨除了很多也许在某些业务场景看来是决定性的因素,例如成本,在做多个跨城三地五中心分片的情况下,比只做就近同城三中心,吞吐量可能下降,而机器资源却上升,也有可能会成为决定采用哪种模型的决定性因素。总之,架构设计是一个非常复杂的过程,要考虑的因素繁杂多样,还是要根据业务具体情况具体分析。

When designing the architecture, you can refer to the following steps for consideration and evaluation. The figure below only talks about some of the more common key information described in this article, excluding many factors that may be decisive in some business scenarios, such as cost. In the case of multiple cross-city three-site five-center shards, the throughput may decrease, while the machine resources increase compared to only three centers in the same city, which may also become a decisive factor in deciding which model to adopt. In short, architecture design is a very complex process, and there are many and varied factors to consider, so it is necessary to analyze it according to the specific business situation.

9364af47586b7cb0fc6c83649f380e82.jpg

All comments

Top