420 likes | 491 Views
Writing Storage Engines Brian Aker Director of Architecture Montreal PHP Conference March 2005 MySQL AB. Who am I?. Brian Aker Director of Architecture, MySQL AB
E N D
Writing Storage EnginesBrian AkerDirector of ArchitectureMontreal PHP ConferenceMarch 2005 MySQL AB
Who am I? • Brian Aker • Director of Architecture, MySQL AB • Author of mod_layout, the apache streaming services mod_mp3, Slash (Slashdot’s CMS System), and lot of other things on Freshmeat.... • http://mysql.com/ • http://krow.net/
MySQL • 5 million installations • 200 employees, 20+ countries • Most North American Developers live in Seattle • We even have an office in Seattle • No, developers never go there
MySQL Server • High Performance RDBMS • SQL-Based, aiming to be SQL-99 compliant • Stable • Scalable • Embedded in hardware (including JMX) • Extremely high load applications • Master/Slave Replication • Easy to use • Modular • “Storage Engines” • Many features can be disabled at runtime and/or compile time to conserve resources
Client Library Support • Libmysql c-library (think OCI) • JDBC – Type IV JDBC Driver • ODBC • Perl DBD::DBI • PHP (built in) • ADO.Net, OleDB, Ruby, Erlang, Eiffel, Smalltalk, etc, etc. provided by third parties
Goals • Overview of MySQL Architecture • Understanding of Storage Engine Architecture • Knowledge of required methods • Starting points for coding • sql/ is for the kernel • mysys/ is the portable runtime • mysql-test is for you test cases
What does it take? • All code is written in simplified C++ • An example storage engine • Your Ideas
Server’s Kernel Parser Optimizer Storage Engine MyISAM Innodb NDB HEAP Merge
What is a Storage Engine? • “Data formats on Disk” • Examples • Innodb • MyISAM • BDB • Cluster • HEAP • CSV • Your’s!
Example Table • CREATE TABLE foo ( • a int, • b char(4), • c varchar(9), • d blob) • ENGINE = MYISAM;
Rows • Rows are made up of Fields INT CHAR VAR BLOB NULL 4 4 L + 9 L + P NULL
Fields • Rows are made up of fields C H A R NULL
What do I need to do to add one? • Subclass Field in field.h • Implement a few methods: • Storage: store(string), store(long long), store(double) • Retrieve:val_real(), val_int(), val_str() • Other: field_cast_type(), result_type(), cmp(), sort_string(), max_length()
Field Store Example • int Field_ipaddrv4::store(const char *from, uint length, CHARSET_INFO *cs){ int count; count= sscanf(from, "%u.%u%u.%u", ptr, (ptr +1), (ptr +2), (ptr +3)); if (count != 4) { bzero(ptr, 4); return -1; } return 0;}
Field val Example • String *Field_ipaddrv4::val_str(String *val_buffer __attribute__((unused)), String *val_ptr){ int count; count= snprintf(buffer, 15, "%u.%u.%u.%u", ptr[0], ptr[1], ptr[2], ptr[3]); val_ptr->set((const char*) buffer,count, &my_charset_latin1); return val_ptr;}
Break Down of Storage Engine Methods • Table Control • Optimizer • SQL Modifiers • SQL Reads
Table Control • ::create() • ::open() • ::close() • ::delete_table()
ha_example::create() • int ha_example::create(const char *name, TABLE *table_arg, HA_CREATE_INFO *create_info){ DBUG_ENTER("ha_example::create"); /* This is not implemented but we want someone to be able that it works. */ DBUG_RETURN(0);}
ha_example::open() • int ha_example::open(const char *name, int mode, uint test_if_locked){ DBUG_ENTER("ha_example::open"); if (!(share = get_share(name, table))) DBUG_RETURN(1); thr_lock_data_init(&share->lock,&lock,NULL); DBUG_RETURN(0);}
ha_example::close() • int ha_example::close(void){ DBUG_ENTER("ha_example::close"); DBUG_RETURN(free_share(share));}
ha_example::delete_table() • int ha_example::delete_table(const char *name){ DBUG_ENTER("ha_example::delete_table"); /* This is not implemented but we want someone to be able that it works. */ DBUG_RETURN(0);}
Optimizer • ::info() • ::records_in_range()
info() • void ha_heap::info(uint flag) • { records = info.records; deleted = info.deleted; errkey = info.errkey; mean_rec_length=info.reclength; data_file_length=info.data_length; index_file_length=info.index_length; max_data_file_length= info.max_records* info.reclength; delete_length= info.deleted * info.reclength; if (flag & HA_STATUS_AUTO) auto_increment_value= info.auto_increment;}
records_in_range() • ha_rows ha_example::records_in_range(uint inx, key_range *min_key, key_range *max_key){ DBUG_ENTER("ha_example::records_in_range"); DBUG_RETURN(10); // low number to force index usage}
SQL Modifiers • delete_row() • write_row() • update_row()
ha_example::delete_row() • int ha_example::delete_row(const byte * buf){ DBUG_ENTER("ha_example::delete_row"); DBUG_RETURN(HA_ERR_WRONG_COMMAND);}
ha_archive::insert_row() • int ha_archive::write_row(byte * buf){ char *pos; z_off_t written; DBUG_ENTER("ha_archive::write_row"); statistic_increment(ha_write_count,&LOCK_status); if (table->timestamp_default_now) update_timestamp(buf+table->timestamp_default_now-1); written= gzwrite(share->archive_write, buf, table->reclength); DBUG_RETURN(0);}
ha_tina::write_row() • int ha_tina::update_row(const byte * old_data, byte * new_data){ int size; DBUG_ENTER("ha_tina::update_row"); size= encode_quote(new_data); if (chain_append()) DBUG_RETURN(-1); if (my_write(share->data_file, buffer.ptr(), size, MYF(MY_WME | MY_NABP))) DBUG_RETURN(-1); DBUG_RETURN(0);}
SQL Reads • Scan Reads • rnd_init(), rnd_next(), position(), rnd_pos() • Index Reads • index_read(), index_next(), index_prev(), index_first(), index_last()
ha_tina::rnd_init() • int ha_tina::rnd_init(bool scan){ DBUG_ENTER("ha_tina::rnd_init"); current_position= next_position= 0; records= 0; chain_ptr= chain; if (scan) (void)madvise(share->mapped_file,share->file_stat.st_size,MADV_SEQUENTIAL); DBUG_RETURN(0);}
ha_tina::rnd_next() • int ha_tina::rnd_next(byte *buf){ DBUG_ENTER("ha_tina::rnd_next"); current_position= next_position; if (!share->mapped_file) DBUG_RETURN(HA_ERR_END_OF_FILE); if (HA_ERR_END_OF_FILE == find_current_row(buf) ) DBUG_RETURN(HA_ERR_END_OF_FILE); records++; DBUG_RETURN(0);}
ha_tina::position() • void ha_tina::position(const byte *record){ DBUG_ENTER("ha_tina::position"); ha_store_ptr(ref, ref_length, current_position); DBUG_VOID_RETURN;}
ha_tina::rnd_pos() • int ha_tina::rnd_pos(byte * buf, byte *pos){ DBUG_ENTER("ha_tina::rnd_pos"); current_position= ha_get_ptr(pos,ref_length); DBUG_RETURN(find_current_row(buf));}
ha_example::index_read() • int ha_example::index_read(byte * buf, const byte * key, uint key_len __attribute__((unused)), enum ha_rkey_function find_flag __attribute__((unused))){ DBUG_ENTER("ha_example::index_read"); DBUG_RETURN(HA_ERR_WRONG_COMMAND);}
ha_example::index_next() • /* Used to read forward through the index.*/int ha_example::index_next(byte * buf){ DBUG_ENTER("ha_example::index_next"); DBUG_RETURN(HA_ERR_WRONG_COMMAND);}
Table Scan • ha_example::store_lockha_example::external_lockha_example::infoha_example::rnd_initha_example::extra Cash record in HA_rrnd()ha_example::rnd_nextha_example::rnd_nextha_example::rnd_nextha_example::extra End cacheing of records (def)ha_example::external_lockha_example::extra Reset database to after open
That is All? • Transaction methods • Bulk load methods • Defrag methods • Lot more (read handler.h)
Autoconf • Autoconf files in the top-level source directory • acconfig.h • acinclude.m4 • config.in
Additional Files • Basic server files modified under sql/ • sql/Makefile.am • sql/handler.h • sql/mysql_priv.h • sql/handler.cc • sql/mysqld.cc • sql/set_var.cc
Test Cases • Test cases created under mysql-test • mysql-test/include/have_mmap.inc • mysql-test/t/mmap.test • mysql-test/r/mmap.result
Other Thoughts • What are your goals? • Read only? • Durable? • Network?
More Information • sql/ha_example.[h|cc] • Look at Documents on mysql.com • lists.mysql.com • MySQL Support Contracts