200 likes | 484 Views
Innodb 表空间数据文件格式. xbjiang@corp.netease.com. 内容. 表空间数据文件内部格式介绍 记录格式介绍及 innodb 的 B+ 树应用。 实例演示,通过简单的工具来查看内部的实现。. innodb 数据文件组成. 表空间数据文件可以分为 2 种: 系统表空间: 保存了 innodb 数据库系统数据,也可以保存数据和索引。 ( 在 innodb 中以索引来保存数据 ) 2. 单独表空间: 当设置 innodb_file_per_table 参数时,数据库保存表的数据和索引到单独的文件,即表空间. 系统表空间的介绍.
E N D
Innodb 表空间数据文件格式 xbjiang@corp.netease.com
内容 • 表空间数据文件内部格式介绍 • 记录格式介绍及innodb的B+树应用。 • 实例演示,通过简单的工具来查看内部的实现。
innodb数据文件组成 • 表空间数据文件可以分为2种: • 系统表空间: 保存了innodb数据库系统数据,也可以保存数据和索引。(在innodb中以索引来保存数据) 2. 单独表空间: 当设置innodb_file_per_table参数时,数据库保存表的数据和索引到单独的文件,即表空间
系统表空间的介绍 • 系统表空间可以有多个文件组成,在配置中指定。 • 系统表空间既包括了innodb数据库的表定义元数据信息,又包括了表的数据和索引信息,还包括一些系统信息如:Insert Buffer, rollback segment, log文件信息.
单独表空间的介绍 • 单独表空间只有一个数据文件组成。保存了表的数据和索引。 • 单独表空间在数据库中被描述的元数据被保存在系统表空间中,所以innodb的单独表空间不能像myisam表一样简单的导入到别的数据库下。
表空间文件的物理组成 • 数据文件的最小单位是页,一页的默认大小是16K(8K 16K 32K 64K),可以在编译的时候修改源代码。 /* The universal page size of the database */ #define UNIV_PAGE_SIZE (2 * 8192) /* NOTE! Currently, this has to be a power of 2 */ /* The 2-logarithm of UNIV_PAGE_SIZE: */ #define UNIV_PAGE_SIZE_SHIFT 14
表空间文件的物理组成 • Innodb对64个page组成1个extent来管理。 • extent的使用可以分为2种:full extent, fragment extent. full extent 被分配的时候,64个page一起被分配。 fragment extent 可以从中分配单独的1个page。
表空间文件的逻辑组成 • innodb在其内部用segment为来组织数据的存储。 • 1个segment的物理页面有多个full extent和从fragment extent中申请的单独的page组成。 segment以3个链表来保存申请的full extent. 分别表示完全使用,部分使用和空闲。 在1个32slot的数组来保存从fragment extent申请的page,所以一个segment最多有32个从fragment extent分配的page。
old-style record /* Offsets of the bit-fields in an old-style record. NOTE! In the table the most significant bytes and bits are written below less significant. (1) byte offset (2) bit usage within byte downward from origin -> 1 8 bits pointer to next record 2 8 bits pointer to next record 3 1 bit short flag 7 bits number of fields 4 3 bits number of fields 5 bits heap number 5 8 bits heap number 6 4 bits n_owned 4 bits info bits */
new-style record /* Offsets of the bit-fields in a new-style record. NOTE! In the table the most significant bytes and bits are written below less significant. (1) byte offset (2) bit usage within byte downward from origin -> 1 8 bits relative offset of next record 2 8 bits relative offset of next record the relative offset is an unsigned 16-bit integer: (offset_of_next_record - offset_of_this_record) mod 64Ki, where mod is the modulo as a non-negative number; we can calculate the the offset of the next record with the formula: relative_offset + offset_of_this_record mod UNIV_PAGE_SIZE 3 3 bits status: 000=conventional record 001=node pointer record (inside B-tree) 010=infimum record 011=supremum record 1xx=reserved 5 bits heap number 4 8 bits heap number 5 4 bits n_owned 4 bits info bits */
DATA_TRX_PTR /*Builds a roll pointer dulint. */ UNIV_INLINE dulint trx_undo_build_roll_ptr( /* out: roll pointer */ ibool is_insert, /* in: TRUE if insert undo log */ ulint rseg_id, /* in: rollback segment id */ ulint page_no, /* in: page number */ ulint offset) /* in: offset of the undo entry within page */ { #if DATA_ROLL_PTR_LEN != 7 # error "DATA_ROLL_PTR_LEN != 7" #endif ut_ad(rseg_id < 128); return(ut_dulint_create(is_insert * 128 * 256 * 256 + rseg_id * 256 * 256 + (page_no / 256) / 256, (page_no % (256 * 256)) * 256 * 256 + offset)); }
innodb数据库的元数据 • 被保存在系统表空间中,也是以表的结构方式来保存的。和自己创建的表数据的存储方式一样。 • 有以下几个表: SYS_TABLES SYS_COLUMNS SYS_INDEXES SYS_FIELDS SYS_FOREIGN SYS_FOREIGN_COLS
SYS_TABLES dict_mem_table_create("SYS_TABLES", DICT_HDR_SPACE, 8, 0); dict_mem_table_add_col(table, heap, "NAME", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "ID", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "N_COLS", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "TYPE", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "MIX_ID", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "MIX_LEN", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "CLUSTER_NAME", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "SPACE", DATA_INT, 0, 4); dict_mem_index_create("SYS_TABLES", "CLUST_IND", DICT_HDR_SPACE, DICT_UNIQUE | DICT_CLUSTERED, 1); dict_mem_index_add_field(index, "NAME", 0); dict_mem_index_create("SYS_TABLES", "ID_IND", DICT_HDR_SPACE, DICT_UNIQUE, 1); dict_mem_index_add_field(index, "ID", 0);
SYS_COLUMNS dict_mem_table_create("SYS_COLUMNS", DICT_HDR_SPACE, 7, 0); dict_mem_table_add_col(table, heap, "TABLE_ID", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "POS", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "NAME", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "MTYPE", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "PRTYPE", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "LEN", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "PREC", DATA_INT, 0, 4); dict_mem_index_create("SYS_COLUMNS", "CLUST_IND", DICT_HDR_SPACE, DICT_UNIQUE | DICT_CLUSTERED, 2); dict_mem_index_add_field(index, "TABLE_ID", 0); dict_mem_index_add_field(index, "POS", 0);
SYS_INDEXES dict_mem_table_create("SYS_INDEXES", DICT_HDR_SPACE, 7, 0); dict_mem_table_add_col(table, heap, "TABLE_ID", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "ID", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "NAME", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "N_FIELDS", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "TYPE", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "SPACE", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "PAGE_NO", DATA_INT, 0, 4); dict_mem_index_create("SYS_INDEXES", "CLUST_IND", DICT_HDR_SPACE, DICT_UNIQUE | DICT_CLUSTERED, 2); dict_mem_index_add_field(index, "TABLE_ID", 0); dict_mem_index_add_field(index, "ID", 0);
SYS_FIELDS dict_mem_table_create("SYS_FIELDS", DICT_HDR_SPACE, 3, 0); dict_mem_table_add_col(table, heap, "INDEX_ID", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "POS", DATA_INT, 0, 4); dict_mem_table_add_col(table, heap, "COL_NAME", DATA_BINARY, 0, 0); dict_mem_index_create("SYS_FIELDS", "CLUST_IND", DICT_HDR_SPACE, DICT_UNIQUE | DICT_CLUSTERED, 2); dict_mem_index_add_field(index, "INDEX_ID", 0); dict_mem_index_add_field(index, "POS", 0);
SYS_FOREIGN & SYS_FOREIGN_COLS error = que_eval_sql(NULL, "PROCEDURE CREATE_FOREIGN_SYS_TABLES_PROC () IS\n" "BEGIN\n" "CREATE TABLE\n" "SYS_FOREIGN(ID CHAR, FOR_NAME CHAR," " REF_NAME CHAR, N_COLS INT);\n" "CREATE UNIQUE CLUSTERED INDEX ID_IND" " ON SYS_FOREIGN (ID);\n" "CREATE INDEX FOR_IND" " ON SYS_FOREIGN (FOR_NAME);\n" "CREATE INDEX REF_IND" " ON SYS_FOREIGN (REF_NAME);\n" "CREATE TABLE\n" "SYS_FOREIGN_COLS(ID CHAR, POS INT," " FOR_COL_NAME CHAR, REF_COL_NAME CHAR);\n" "CREATE UNIQUE CLUSTERED INDEX ID_IND" " ON SYS_FOREIGN_COLS (ID, POS);\n" "COMMIT WORK;\n" "END;\n" , FALSE, trx);
SYS_IBUF_TABLE_0 dict_mem_table_create(“SYS_IBUF_TABLE_0”, space, 2, 0); dict_mem_table_add_col(table, heap, "PAGE_NO", DATA_BINARY, 0, 0); dict_mem_table_add_col(table, heap, "TYPES", DATA_BINARY, 0, 0); table->id = ut_dulint_add(DICT_IBUF_ID_MIN, space); index = dict_mem_index_create( “SYS_IBUF_TABLE_0”, "CLUST_IND", space, DICT_CLUSTERED | DICT_UNIVERSAL | DICT_IBUF, 2); dict_mem_index_add_field(index, "PAGE_NO", 0); dict_mem_index_add_field(index, "TYPES", 0);
B+树介绍 • 在innodb中,B+树的叶子节点和非叶子节点分别存在2个不同的segment中。 • index通过前缀压缩来提高索引存储量。 • 数据都保存在叶子节点。 • innodb的索引和数据的保存和myisam很不一样。处处都是索引,并且secondary index包括primary key,所以要注意primary key的选择。