other · 2022年2月9日 0

Single database large table table sub-table design

background

Our business system GPS data volume is very large, a day 10,000 devices have about 36 million data, this scale of data self-query with mysql is a problem, so we use clickhouse storage, but CK for batch modification of deletion support are not good, and CK does not support transactions, our scenario is to save after the need to do some data enrichment, so consider the CK stored data back in mysql, Because we are a master-slave mode (there is no multi-master synchronization), so there is a single node to retain all the data, this is the theme of this article (a single database large table for spin-off table design), spin-off table can be extended to multi-node multi-master mode Ha, everyone note 🙂 some people may say that mycat to do sharding table, we have actually demonstrated, it is not high for our flexibility. The UDD engine we use is actually supported.

Design ideas

1. Group management is carried out according to device ID (eid) + tenant ID (tid).

2. The data for each table is 50 million rolling to create a new physical table

The data volume size can be configured in the following table

3. Create a t_md_shard_table table

Tenant ID (tid) Application ID (aid) Device ID (eid) Logical Tables (l_table) Actual Storage Table (r_table) Start Timestamp (Closed Interval) (start_timestamp)
10201 746 E00001 t_gps t_gps_A0 0
10202 747 E00002 t_gps t_gps_B0 0
10203 748 ES0003 t_gps t_gps_C0 0
End Timestamp (Open Interval) (end_timestamp) Whether to finish scrolling (is_finished_rolling)
+∞ N
+∞ N
+∞ N

t_md_shard_table initial stage, grouped according to the device ID (eid), every 1000 device data is stored in a table, for example, device numbers 0 to 999 are stored in t_gps_A0, 1000 ~ 1999 is stored in the t_gps_B0, the start time is 0, and the end time is +∞

4. t_md_statistic_table table

ID(id) Physical Tables (r_table) Table Current Data Row (current_rows)
1 t_gps_A0 45,000,000
2 t_gps_B0 42,000,000
3 t_gps_C0 43,000,000

t_md_statistic_table table to count the current data storage of each row of physical tables (r_table) every hour

5. t_md_shard_confg table

ID(id) Data size per table (table_max_size)
1 50,000,000

When t_md_statistic_table .current_rows is greater than or equal to t_md_shard_confg.table_max_size, the current timestamp is given to all data t_md_shard_table.r_table=t_gps_A0. Then automatically create a new physical table, scroll from the physical table t_gps_A0 to the t_gps_A1, using the following DDL statement: create table t_gps_A1 like t_gps_A0; this allows the old table from the previous one to be copied completely

6. The current state of the data after rolling

  • t_md_statistic_table
ID(id) Physical Tables (r_table) Table Current Data Row (current_rows)
1 t_gps_A0 45,000,000
2 t_gps_B0 42,000,000
3 t_gps_C0 43,000,000
4 t_gps_A1 1,000
  • t_md_shard_table table data
Tenant ID (tid) Application ID (aid) Device ID (eid) Logical Tables (l_table) Actual Storage Table (r_table) Start Timestamp (Closed Interval) (start_timestamp)
10201 746 E00001 t_gps t_gps_A0 0
10202 747 E00002 t_gps t_gps_B0 0
10203 748 ES0003 t_gps t_gps_C0 0
10201 746 E00001 t_gps t_gps_A1 2022-01-20 10:00:00
End Timestamp (Open Interval) (end_timestamp) Whether to finish scrolling (is_finished_rolling)
2022-01-20 10:00:00 Y
+∞ N
+∞ N
+∞ N

7. Process logical rules

UDD intercepts the SQL and then performs separate operations based on CRUD operations:

  1. Insert statement, query the t_md_shard_table table, according to the tenant ID (tid) + application ID (aid) + device ID (eid) + whether the scrolling is complete (is_finished_rolling)[N] to get the actual storage table, insert the data into the t_md_shard_table r_table.

Explain the use of the following timestamp:

In delete and modify operations, only exact timestamps are supported (in other words: only a single data operation is supported, not multiple data operations)

  1. Gets the timestamp T1 of a single piece of data, and then matches the data in the t_md_shard_table table according to T1> =start_timestamp AND T1<end_timestamp 来定位到实际r_table的表,然后把数据定位到具体存储表的ID行来删除

If multiple row deletions are supported, the storage timestamp range is distributed in multiple tables, and it is possible that the data is distributed in one table or multiple tables

  1. Delete statement, query the t_md_shard_table table, according to the tenant ID (tid) + application ID (aid) + device ID (eid) + timestamp range to locate the deletion data row is landed in which actual storage table (r_table).

  2. The Update statement queries the t_md_shard_table table, and locates the actual storage table (r_table) where the modified data rows fall according to the range of tenant ID+ application ID + device ID + timestamp.

  3. Select statement, query the t_md_shard_table table, and handle it according to the tenant ID+ application ID + device ID and the following scenarios:
    Passing is a timestamp range T1-T2[这种查询场景对应的是对应的某一天A点到B点,某一个时刻A点到B点], there may be a return of a r_table, two r_table, N r_table, then split before executing the real SQL:

    For example: SQL001=SELECT * FROM t_gps WHERE gps_start_time > "2022-01-21 10:00:00" AND gps_start_time < "2022-01-21 12:00:00"

    When executing for the business logic that needs to be paged, first import the Bean:DruidEngine class at the service code level, pass SQL001 into the engine.exec(sql), the UDD engine will return all the converted SQL statements according to the incoming SQL statements, if the SQL001 data is stored in the t_gps_A0 and t_gps_A1 the two tables, then the following two SQL will be returned:

    1. Subsql1 = SQL001=SELECT * FROM t_gps_A0 WHERE gps_start_time > "2022-01-21 10:00:00" AND gps_start_time < "2022-01-21 12:00:00"
    2. Subsql2 = SQL001=SELECT * FROM t_gps_A1 WHERE gps_start_time > "2022-01-21 10:00:00" AND gps_start_time < "2022-01-21 12:00:00"

    The application layer itself obtains N SQL statements (N≥1), and then performs parallel multithreaded queries, assembles the data sorting and paging itself

    Note: The larger the range segment, the more r_table tables queried, and the memory that needs to be consumed is very large, so it is recommended that the SQL statement be accurate to the query field and reduce the query time range.