160 likes | 183 Views
Learn about data migration, importing/linking, data analysis, and more in database development practices. Hands-on exercises included.
E N D
English Info BSc: State Of The Art Database Systems Practice 2 Dr. Gábor Pauler, Associate Professor, PTE-TTK, Room F104, 6th Ifjusag str. Pecs, Hungary Mobile: 30/9015-488, Skype: gjpauler E-mail:pauler@t-online.hu Facebook and Open FTP sites of the course: http://www.facebook.com/groups/278606362188127/ ftp://gamma.ttk.pte.hu/pub/pauler/StateOfTheArtDataBase/
Content of the Practice • Checking Home Assignment 1: BPD of own project • Practice 2-1: Creating Friends Table • Problems of Data Migration Into Databases • Importing/Linking • From Clipboard • From Text Files • Fixed Column Width • Delimiter Character Based • Import specifications • From MS Excel • From Other Relatinal Databases • Practice 2-2: Importing Stores Table • Table Analysis Wizard • GUI • Decomposition • Cleaning Data • Evaluation • Home Assignment 2: Analyze Sample Table • Literature
Practice 2-1: Creating Friends Table • Create a databasein MS Access calledFriends • Sourcedatawasgivenby a secretary (whodoesnothaveanyexpertiseindatabases) as text (seebelow) withthefollowingfields: Friends LastName, FirstName, City, Address, ZIPCode, Phone, Birthday, Debt, /Whethertheguy has anydebtforus?/ GreetingCard) /Shouldwesende-greetingcardforbirthday?/ • Pleasefind out inhowmanytablesitcan be stored,and whatfields? • Whatshould e thetype, format, lenght of fields? • Fillthedatabasetablesfromthe text below: Missing/Emptycellshould be NULLvalued (notzero!) Samplesolution: 2-1PracticeSolution.ppt Kovács, Jakab, Kishely, Patak str. 77., 7632, +36-32-342-222, 1992.12.11, -2000, Yes Lutor, Mihály, Kishely,Baranyai str. 11., 7544, , 1992.12.12, 1700, Yes Bruncsi, Endre, Pécs, Bruncsistr. 11. 7622, , 1981.11.11., 7000, No Éberling, László, Pécs, Kassa str. 83., 7624, , 1994.04.21., 0, No Vasvári, Tibor, Pécs, Csáktornya str. 33., 7623, +36-72-472-333, 1980.09.01, Yes Bányai, János, Nagykanizsa, Kazinczi str. 23., 2939, +36-63-233-423, , 0, No Agócs, Attila, Kishely, Királyi Pál str. 7., 1322, , 1981.02.03., 0, No Pék, Zoltán, Budapest, Erdész str. 30., 1122, , 1980.11.22., 4000, Yes
Databases in normal transaction processing are usually loaded with data through (Forms) of GUI. But during database development, it is important task of developers to migrate data from paper or (Legacy Systems): earlier word processor text, spreadsheet, binary, database files. We show the options in MS Access Data Import Wizard. Access is a low grade database sofware, but accessible everywhere and has some useful auxiliary utilities. Migration into Access can be done 2 ways: (Importing): Migrating data into Database Table independent from data source Migration is slower as it first completely reads up and checks table, we can read it only after that Database table will be fully editable Database table can be queried fast once the migration finished (Linking): It defines only a table structure of interpreting external data source into a temporary (View Table) record by record, data is still stored in data source It reserves minimal space only for the table structure, Only those records are interpreted from the data source, which are currently under use, so you can see partial results of migration faster. This is useful if you want to skim into a foreign database table, getting a sample of it View table will be read-only, its content cannot be modified Querying the view table is much slower than querying a database table, as records are interpreted from data source on the fly Except if we are linking table from another Access database. Intis case it can be read equally fast as local tables. E.g. As Access database stores all tables in single file, they size together cannot exceed the maximal file size of Windows. But, if we put all tables in separate database files, and link them together, individual table sizes can reach maximal file size of OS. Problems of Data Migration Into Databases: Importing vs. Linking?
There are several possible data sources of Importing/Linking: (ClipBoard): In small businesses, shockingly important legacy data can reside in Word, HTML tables, as „Our secretary could handle only this…” After selecting the table in data source( ), we can copy it to clipboard with Ctrl+C It can be pasted in an Access database table directly, if the table has compatible record structure: same named fields with the same type, in the same sequence Select all field headers of the target table with Shift+Click then press Ctrl+V If the Access table is empty, it will paste as many records as are in the data source If the Access table is not empty, it will paste only the first record of the data source, regardless its record number because of a bug! Migrating through clipboard will work for some 1000s of records at best, above that, it will freeze. Data Migration into Database: Types of Data Sources: HTML Tables in Clipboard Ctrl Ctrl C V Shift +Click
Data Migration into Database: Types of Data Sources: Text Files • (TextFile) (*.dat, *.txt, *.asc) is still the most popular I/O file format, even recently XML format starts to replace them: • Numbers and dates are also stored in text wastes space, slower reading by database software • Rows are separated by CR+LF (ASCII13+ASCII10) characters • They does not contain their (Metadata): complete description of data sturcture, • They can be opened by word processor software and visually checked. • They have standardized coding (Unicode character table) • There are several sub-types of text files: • (Fixed-Width Text File): all columns are pre-determined number of charaters wide (eg. 8, 20, 12,…), no column separator neccessary • Opening at word processor, we can se nice vertically aligned columns (if we see it in fixed width font type, eg. Courier New) • If the data content of a column in a given line does not fully fill it, Space (ASCII32) characters are written after the data • Most of the time, it does not contain column names in 1st line: • Column names often more long than the predetermined width of the column (e.g. a binary (0,1) valued column is 1 char) • If long column name is still inserted in the first line above a short column, we need lot of filler spaces after data wasting space 161916 Kovács János 7636 Pécs Galamb str. 36/b 161917 Légrádi Péter 7310 Kishalom Kossuth str. 11 161918 Sass Lajos 6740 Őrmüge Petőfi str. 1
Data Migration into Database: Types of Data Sources: Text Files: Fix Col Width: Access • In Access, you can launch Text Import Wizard with External Data|Text File: • It asks Path and Name of source file: e.g FixColWidthTextFile.txt and whether to Import or Link • It shows sample from the beginning of the file and you can check Fixed Column Width • Wizard tries to auto-detect column borders, but you can override them pulling the mouse • Then you can name Columns and check out whether their type is correctly auto-detected: • It auto-detects column type from the first 5 lines,so they shouldn’t be empty valued • It recognizes Date, Time and Real values only if they comply Windows|Control panel|International|Date/Number (eg. Hungarian: YYYY.MM.DD HH:MM:SS, 9.999,99)!!! All others are processed as text (you can notice that they will be aligned left while numbers/dates right). How can you fix it during import? • Press Settings button for Import Specification panel. Here you can set: • Character table used • Date- and decimal separators • Column Name, Type • Startingcharacter position and column width • Omitting column from import • Specification can be Saved into Access file and opened next time with Specification button, so you do not have to set it again and again. click click click pull click click click click click click • Then it asks whether there is any (Primary Key): column identify-ing lines uniquely, or to Create new PK • At closing wizard, you can give Table Name
Data Migration into Database:Types of Data Sources:Text Files: Delimited Files • At (Delimited Text File) (*.dat, *.txt, *.asc, *.csv, *.tab) there are special delimiter characters between content of columns • So at all columns/lines, there can be variable lenght of data content, there is no need for filler Spaces, avoiding wasting of storage space. • If we open the file with word processor, columns won’t be nicely aligned vertically, but we will see lot of identical column separator characters! • Column names are usually stored in the first line, as their width is independent from width of data content • Reading the file in database is somewhat slower, as it can be read only sequentially: starting position of lines/columns cannot be computed in advance, because of varying lenghts! • Two consecutive separators are imported by default as empty column with NULL value • Type of column separators can be: • Space (ASCII32) (*.asc): • Not very paractical, as text typed columns also can contain Spaces, can be mixed • To prevent this, content of text columns can be put between (Text Marker) characters (e.g.”John Doe” (ASCII34)) • Space is usually used as additional column separator in fixed column width text files for safety reasons • Comma (ASCII44): (Comma Separated Values) (*.csv): • It was very popular in older hardware communicating through serial port • Recently it is largely replaced with XML format communication • Not very popular in Germany and Hungary, where decimal part separators of real numbers are also comma characters, can be mixed, therefore we use semi-colon ; • Tab (ASCII9): (Tab Delimited File) (*.dat, *.tab): • Tab is a special character separating columns, it cannot be mixed with anything else LoyaltyCard,Name,ZIP,City,Street,HouseNum 21114,Kiss Lajos,6411,Szabadszállás,Kossuth str.,11 22116,Sass Iván,7621,Pécs,Rákóczi str.,72 23104,Katzenberger Gyuláné,,Bakonydér,Teleki Pál str.,44
Data Migration into Database:Types of Data Sources:Text Files:Delimited:Access • In Access, you can launch Text Import Wizard with External Data|Text File: • It asks Path and Name of source file: e.g DelimitedTextFile.txt and whether to Import or Link • It shows sample from the beginning and you can check Delimited Columns • Wizard tries to auto-detect column delimiter character, but we can override it adding alternative column delimiter- and text marker characters also. • We can set whether it reads column names from the first line • The rest is the same as previously • Importing from Excel: you can import from a worksheet, where first line contains column names, and there are same typed values within a column. We can launch wizard with External data| Excel button: • It asks Path and Name of the file • It asks name of Worksheet • It auto-detects last filled line and column on workseet and imports until that: If there is any „garbage” on the worksheet „downstairs”, it will import many emty rows and columns in vain! • Any text cell will be Text(255) column, any number Double (8byte), any Date or Time DateTime (8byte) • Import from other Access file: External data| Access: You can select Tables, Queries, Forms, Macros to import/link in current file, type matching is not a problem here HataroloJelesSzovegFajl.txt click click click click click • Import from other database: External data| ODBC: You can import from MSSQL, Oracle, Dbase, MySQL if they have installed ODBC driver in Windows|Control panel|Data sources, but only Tables and Queries, not Forms!
Content of the Practice • Checking Home Assignment 1: BPD of own project • Practice 2-1: Creating Friends Table • Problems of Data Migration Into Databases • Importing/Linking • From Clipboard • From Text Files • Fixed Column Width • Delimiter Character Based • Import specifications • From MS Excel • From Other Relatinal Databases • Practice 2-2: Importing Stores Table • Table Analysis Wizard • GUI • Decomposition • Cleaning Data • Evaluation • Home Assignment 2: Analyze Sample Table • Literature
Practice 2-2: Importing Stores Table From Excel To Access • Stores.xls contains at its first worksheet data of 160 supermarkets of a USA-based supermarket chain in regularly typed columns • Import data into an Access database table named Stores, assign column StoreID as primary key of table! • Sample Solution 2-2: Stores.mdb • We suspect that data structure of this table is full with nested 1:many relations, but we still do not have enough experience in relational database design to normalize it. Therefore we will use Table Analysis Wizard of MS Access to help us to decompose this un-normalized table into normalized tables and 1:many relations connecting them
Table Analyzer Wizard 1 • AfterselectingtableStoresinStores.accdbclickonDatabaseTools|Analyzetable • SelectStoresasbasetable of analysis • SelectWizarddecidesdecomposingfieldsintotables: itcomputesfrequenciesforallvalue-pairsfrom 2 fieldsinthetable, repea-tingitforallpossiblepairing of 2 fields • Basedonfrequencies, itestimates1:1 and 1:many relations betweenfields, decompo-sesthemintotables, tries to findtheirpri-marykeys, defines relations betweenthem • Youcanrenametableswith( ) basedonit. • Ifyouthinkthatsomefieldswereplacedinwrongtable, youcanselectthemwithShift +Click and Drag&Dropthemintoanothertable, or a newtabledroppingatthecanvas • Ifitcouldnotassignprimarykeyin a tableautomatically, youcanassignmanuallyauniquevaluedfieldwith( ), or add artificialnumericprimarykeywith( ) • Intheresultyoucanseethatthewizarddetects most important relations correctly: 1CitycanhavemanyStore 1Region) canhavemanyCity 1County) canhavemanyCity • However, decomposition is notperfect: • ItdoesnotdetectState:Region=1:manyrelation, becausethere is a State (=„OH”), wherethere is only1Regionbyactualdata • Itdoesnotdecomposeline phoneArea: City=many:manyrelation, asitcannotdetectmany:many relations inonestep • Itputsproblematicfiels (eg. Area) beforetheprimarykey (eg. City) inresulttables click click click click click click click click click
Table Analyzer Wizard 2 • In the next step wizard tries to detect fake relations because of mistypings: eg. now City:Area=many:many, which the wizard cannot handle. But it notices that it could be many:1 as there are only 3 Cities with multiple Area codes. (We know that in the USA City: Area= many:many relation is really correct) • The wizard assumes that manyArea codes at 1City are only mistypings, and offers us to select 1 valid Area code for 1City, replacing mistyped values with it. • If we do it, then City:Area=many:1 and Area wont’ be a problematic field in table Cities1 any more, but a foreign key • After data cleaning, wizard offers to pre-pare a (View Table Query): temp table computed on the fly from decomposed tables simulating the original table. If you compare its content with the original Stores table, you can make sure that no data is lost or mixed during decomposing • Pressing Finish, the wizard writes per-manently the decomposed tables, their relations (and the query) in the database, leaving the original table there. • At Database Tools|Relations, you can open an empty canvas, add all decom-posed tables with( ), and see (Entity Relationship Diagram, ERD) of database • You can save aligned ERD with Ctrl+S click click katt click click click click click click click click click Ctrl S
MS Access: Adatbázis eszközök: Tábla Analizáló Varázsló 4 • If the wizard cannot extract a relation because of insufficient data (eg. State: Region=1:many), you can re-launch the wizard starting from the insufficiently decomposed table (eg. Regions1) and make decomposition manually: • In manual mode, you can select problematic fields (State, StateName) in the table with Shift+Click and Drag&Drop them to the canvas to form a new table and a relation connecting it to the original one. • Then you can rename the table with( ) and assign its primary key with( ) • Sample solution:StoresNormalized.accdb • Evaluation of the table analyzer wizard: • It can make automatic decomposition if it gets enough (Representative) data in the source table: it should contain all practical variations of the data proportionally to their real occourence • It tries to auto-detect natural primary key fields • It auto-detects if possible mis-typings confuse up a relation and proposes a solution for data cleaning • It can detect only 1:many, independent:dependent relations, while its theoretical basis (frequency tables) could handle many:many as well • It connects n decomposed tables with minimal possible number of relations n-1 click pull pull • So it does not care about how to accelerate queries working on decomposed tables later • It is absolutely useless if there is not enough representative data in legacy system to be replaced with new database developement. It is 95% true at paper-based systems. • Minimal number of the representative records neccessary to successfully decompose n tables explodes non-linearly: it requires 16/6×n×(n-1) perfectly representative records (which means 10-20 times more in real, unperfect data): • Eg. For decomposing 100 tables (a medium sized corporate database) we would need at least 26,400 ideally representative theoretic records, and 264,000..528,000 real ones! • We can conclude that the wizard is mainly an aide in learning relational database designe on more simple problems (comparing you solution with the wizard’s solution), but it cannot be used at designing real size applications. • Therefore, manual relational database design cannot be avioded by current standing of Artificial Intelligence, because its high requirement of „common sense”
Home Assignment 2-1: Analyzing Sample Table After creating the BPD of your own project, select an activity from that, which can be served by an Empirical Data Structure, EDS, wher sample data are available from some kind of legacy system (paper, internet, etc.), Eg. with fields FirstName, LastName, Gender, MothersMaidenName, PlaceOfBirth, DateOfBirth, Citizenship, etc. Get at least 40-50 records with very regular, usual data, and some exceptions also (eg. What is the gender of a transgender person? To which was born? To which was transformed?): Import the table into MS Access and run the Table Analyzer Wizard on it, Name the decomposed table with names startin by 3 character group code! Using PrintScreen save the ERD created in Access into an image and upload it to the FB group of the course! (5 points) Sample solution: HomeAssign2Solution.doc
Literature • MS Access On-line Help: http://office.microsoft.com/assistance/offhelp/offxp/access/