400 likes | 571 Views
PostgreSQL Workshop: Working with Network Data. PGCon 2014 Workshop James Hanson 21-May-14. About the presenter …. James Hanson jamesphanson@yahoo.com , http://www.linkedin.com/in/jamesphanson http://jamesphanson.wikispaces.com/ Reformed Oracle RDBMS developer / administrator
E N D
PostgreSQL Workshop:Working with Network Data PGCon 2014 WorkshopJames Hanson 21-May-14
About the presenter … • James Hansonjamesphanson@yahoo.com,http://www.linkedin.com/in/jamesphansonhttp://jamesphanson.wikispaces.com/ • Reformed Oracle RDBMS developer / administrator • I am interested in network data and I see PostgreSQL as powerful, underutilized tool for storing and analyzing network data PGCon14 Working with Network Data Types
About the presenter … • James Hanson Parsons, Inc. james.hanson@parsons.com PGCon14 Working with Network Data Types
What are we doing in this workshop? • Using PostgreSQL to explore the question “What can I find out about the requester_IP addresses in my Apache log?”…plus some side trips … • Parsing & loading an Apache access log and associating the requesting IPs with GeoLocated, known malicious and other IP lists. NOTE: All workshop data is free and publicly available. PGCon14 Working with Network Data Types
What will we get from the workshop? • Hands-on experience using PG to load, relate and analyze moderate-sized network data sets • Simple & useful techniques for loading network data • Simple and useful techniques for exploring network data • Experience exploring some public data sets • Get jazzed to use the techniques and code samples on your network data • Meet other folks with complementary interests and expertise PGCon14 Working with Network Data Types
What’s not in this workshop? • GUI displays or application integration.We focus on the data • PG optimization, tuning or extensions other than PostGIS • RDBMS normalizing, surrogate keys or orthodoxy* • Automated workflow • Handling of exceptions, errors and boundary conditions * We’ll revisit this later PGCon14 Working with Network Data Types
Workshop setup and logistics • Using an EnterpriseDB PostgreSQL 9.3 + PostGISwith no additional extensions • Hosted on a Centos 6.5 x64 virtual machine with 3GB RAM and 2* CPU cores • Using three workshop downloads … • The data and DDL queries are zipped and available on GoogleDocs • The PowerPoint is on Slide Share • The SQL and Python files are on GitHub • Everything is linked off PGCon14 site PGCon14 Working with Network Data Types
Workshop setup errata • PGAdmin Query Tool Show NULL values as <NULL> • Centos yum packages:policycoreutils-guisetroubleshootlzop samba dos2unix system-config-lvm.noarch • PostgreSQL non-default parameters Workshop value Enterprise DB defaultmax_connections 10 100 effective_cache_size 262,144 16,384maintenance_work_mem 524,288 16,384max_stack_depth 2,048 100shared_buffers 98,304 1,024temp_buffers 131,072 1,024work_mem 1,048,576 1,024fsync off onsynchronous_commit off onencoding UTF-8 PGCon14 Working with Network Data Types
PostgreSQL network data types An inetrange type is a natural extension that complements subnets CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet ); PGCon14 Working with Network Data Types
PostgreSQL network operators PGCon14 Working with Network Data Types
PostgreSQL ranges and arrays • Ranges • 6 built in types. Easy to create custom types • Can by inclusive, [, or exclusive, ( We’ll use inclusive lower and upper bounds, […] • Use the “contains element”, @>, and “element is contained by”, <@, operators • Arrays … just the tip or the iceberg • Use for parsing raw data and alternative storage models • Search arrays with the ANY operator PGCon14 Working with Network Data Types
Q: Why are we doing this with PostgreSQL?I was told that “the cloud” was the only place to work with network data! PGCon14 Working with Network Data Types
A1: Because RDBMSs have some advantages • Cloud technologies have a difficult time with semantically similar but syntactically different data such as: • Subnets:192.168.100.24/25 =192.168.100.24 255.255.255.128 • Date / time:21-May-14 = 05/21/14 = 1400638460 • RDBMS are (still) much faster with GUI responses on small / moderate data sets PGCon14 Working with Network Data Types
A2: Because PostgreSQL has some useful and unique capabilities • PostgreSQL network data types and operators enable: • Much faster Layer3-type operations … because you can think like a router • Data validation (NOTE: custom solutions are nearly impossible for IPv6) • PG arrays can simplify SQL syntax and ERDs. • PG regexp_split_to_array simplifies ETL… but COPY still lacks important functionality PGCon14 Working with Network Data Types
Our workshop approach • Our approach: What useful and tough problems can I solve with PostrgreSQL’s unique capabilities and low cost? • ! Our approach: Given that you are using a relational database management system, these are the reasons why PostgreSQL is the best brand to choose. PGCon14 Working with Network Data Types
Workshop logistics • Preamble / instruction • Brief overview of our datasets • Transform and load the data together -- Intermezzo -- • Query the data together • Questions, follow-up & next steps. NOTE: Code snippets are at the end of the PPT. All files are linked off of PGCon14 PGCon14 Working with Network Data Types
Our buffet of data … All free and publicly available … • OurAirports Airport GeoLochttp://ourairports.com • CSC InfoChimps US IPGeohttp://www.infochimps.com/datasets • Dragon Research Group Malicious IPshttp://www.dragonresearchgroup.org/ • National Vulnerabilities Database IAVAshttp://nvd.nist.gov/home.cfm • GeoLite IPv6 Geohttp://geolite.maxmind.com/ PGCon14 Working with Network Data Types
… Our buffet of data • MyIP Malicious IPshttp://myip.ms • Cowgar Apache loghttp://notebook.cowgar.com • The Internet Traffic Archive tracertftp://ita.ee.lbl.gov/ • Cooperative Association for Internet Data Analysis (CAIDA) DNShttp://www.caida.org • NIL ARP table http://wiki.nil.com/ PGCon14 Working with Network Data Types
Dive into Airport data … ../020_Airports/ • 4 csv files: • airports, runways, countries and regions AirportDDL.sql PGCon14 Working with Network Data Types
INTERMEZZO for 10 minutes PGCon14 Working with Network Data Types
Let’s query our data … ../Queries/ • Where else have the Apache_logrequesting_ip addresses appeared? USIPGeo_Queries.sql PGCon14 Working with Network Data Types
Revisit normalization, questions and wrap up PGCon14 Working with Network Data Types
*Normalizing and surrogate keys NOTE: This is a revisited topic from the workshop pre-amble. • Normalization is very useful for handling ACID transactions and transaction-entered data (vs. bulk-loaded). • Surrogate keys are loved by Hibernate, may help in some situations and are part or many ORM methodologies.But that does not generally apply to immutable network data intended for analysis … like we have. PGCon14 Working with Network Data Types
*Normalizing and surrogate keys(2) • And normalization has a steep cost in … • Increased complexity to load and query data • Hard coded assumptions in a complex ERD • Reduced performance ** • Intellectual distance from the data • The reasons to normalize and use surrogate keys don’t apply here but the costs do, so I don’t do it.(feel free to explore this topic off-line with me) ** This is not true in all cases. I refer to index-based joins and searches PGCon14 Working with Network Data Types
Q: Why did you use arrays and custom data types? Doesn’t that violate RDBMS principles plus lock you in to custom PostgreSQL syntax? A: Yes. Here’s why I suggest you consider that: Dr’s. Boyce and Codd did not include arrays in their RDBMS theoretical work and arrays are not in the ANSI-SQL standard … but that work is focused on ACID transactions, which do not apply here. PostgreSQL network types are unique … so you are already committed to custom code. Why not take full advantage of PGs capabilities. PGCon14 Working with Network Data Types
Wrap up and follow-on … • IP and IP/subnet data types enable you to think of network data as a Layer-3 device does • This will always be faster and simpler than treating IPs as strings or numbers. • Ex. “Which of these IPs is contained in that large set if IP subnets and ranges?” • 1-line WHERE clause in PostgreSQL • … still working on it in other environments PGCon14 Working with Network Data Types
Wrap up and follow-on … • An inetrange type is a natural extension that complements subnets CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet ); • If you are using PG network data types with read-only data for analysis …… consider using custom data types and arrays • You have already committed to custom-code so you may as well reap the (substantial) benefits. PGCon14 Working with Network Data Types
XML and XPath notes • The idealized ETL process for XML (and JSON) is ETL … don’t Extract or Transform, just Load • PostgreSQL has good, but not great, XML support • Whatever you do, don’t write custom code to shred XML to RDBMS then build XML again for delivering to applications. Please. • Very good XPath resources are:http://zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPathshttp://stackoverflow.com/questions/17799790/using-xpath-to-extract-data-from-an-xml-column-in-postgres for details PGCon14 Working with Network Data Types
PostgreSQL: Working with Network Data • James Hanson jamesphanson@yahoo.com,http://www.linkedin.com/in/jamesphansonhttp://jamesphanson.wikispaces.com/ • Please provide feedback.This presentation is new and I’d like to make it better. (… and please let me know if you see / have any cool data sets too!) PGCon14 Working with Network Data Types
James Hanson / jamesphanson@yahoo.com Questions and Comments … PGCon14 Working with Network Data Types
Code snippets and techniques from AirportsDDL.sql -- Countries CREATE TABLE countries ( id BIGINT PRIMARY KEY, code CHAR(10) NOT NULL UNIQUE, name VARCHAR(100), continent CHAR(4), wikipedia_link VARCHAR(500), keywords VARCHAR(300)); COMMENT ON TABLE countries IS 'Data from http://ourairports.com/data/'; COPY countries FROM '/Pgcon14/Scripts/020_Airports/countries.csv' WITH DELIMITER ',' CSV HEADER; PGCon14 Working with Network Data Types
Code snippets and techniques from AirportsDDL.sql -- Update the GEOMETRY column from the long / lat UPDATE airports SET location = ST_GeomFromText('POINT('||longitude_deg||' '|| latitude_deg||')',4326); PGCon14 Working with Network Data Types
Code snippets and techniques from USIPGeoDDL.sql -- COPY into temporary table for processing -- Create temporary table CREATE TABLE us_ipgeo_temp( id SERIAL, col1 TEXT); -- Load temporary table COPY us_ipgeo_temp( col1) FROM '/Pgcon14/Scripts/030_USIPGeo/infochimps_dataset_13358_download_16680/ip_blocks_us_geo.tsv‘ WITH DELIMITER '|' -- use a character that is not in the dataset CSV; PGCon14 Working with Network Data Types
Code snippets and techniques from USIPGeoDDL.sql -- Use a CTE (Common Table Expression) syntax and -- regexp_split_to_array to look at the data in our temporary table. WITH us_ipgeo_array AS ( SELECT regexp_split_to_array(col1, '\t') AS col1 FROM us_ipgeo_temp -- LIMIT 10 ) SELECT col1[1] AS col1_1, col1[2] AS col1_2, col1[3] AS col1_3, col1[4] AS col1_4, col1[5] AS col1_5, col1[6] AS col1_6, col1[7] AS col1_7, col1[8] AS col1_8, col1[9] AS col1_9, col1[10] AS col1_10, col1[11] AS col1_11, col1[12] AS col1_12, col1[13] AS col1_13, col1[14] AS col1_14 FROM us_ipgeo_array; PGCon14 Working with Network Data Types
Code snippets and techniques from USIPGeoDDL.sql -- Manipulate and CAST the fields as needed WITH us_ipgeo_array AS ( SELECT regexp_split_to_array(col1, '\t') AS col1 FROM us_ipgeo_temp LIMIT 10000 -- check more records too ) SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14 FROM us_ipgeo_array ORDER BY metro_code NULLS LAST; PGCon14 Working with Network Data Types
Code snippets and techniques from USIPGeoDDL.sql -- Create and populate us_ipgeo from us_ipgeo_temp -- Try it once with a LIMIT statement SELECT * INTO us_ipgeo FROM ( WITH us_ipgeo_array AS ( SELECT regexp_split_to_array(col1, '\t') AS col1 FROM us_ipgeo_temp -- LIMIT 10000 -- check more records too ) SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14 FROM us_ipgeo_array ) AS tmp_table; PGCon14 Working with Network Data Types
Code snippets and techniques from ApacheAccessLogs_DDL.sql -- COPY into temporary table for processing -- Create temporary table CREATE TABLE apache_log_temp ( id SERIAL, col1 TEXT); -- Set the client_encoding for the same reason as before SET client_encoding TO 'latin1'; -- Load temporary table COPY apache_log_temp( col1) FROM '/Pgcon14/Scripts/080_ApacheLogs/access.log' WITH DELIMITER '^' -- use a character that is not in the dataset -- CSV ; PGCon14 Working with Network Data Types
Code snippets and techniques from ApacheAccessLogs_DDL.sql -- Try it once with a LIMIT statement SELECT * INTO apache_log FROM ( WITH apache_log_array AS ( SELECT regexp_split_to_array(col1, ' ') AS col1 FROM apache_log_temp -- LIMIT 1000 ) SELECT col1[1]::inet AS requesting_ip, to_timestamp(RIGHT(col1[4], -1) || LEFT(col1[5], -6), 'DD/Mon/YYYY:HH24:MI:SS') AT TIME ZONE INTERVAL '-04:00' AS time, RIGHT(col1[6], -1)::VARCHAR(10) AS action, col1[7]::VARCHAR(100) AS resource, LEFT(col1[8], -1)::VARCHAR(10) AS protocol, NULLIF(col1[9], '-')::int AS status_to_client, NULLIF(col1[10], '-')::int AS object_size_bytes, REPLACE(NULLIF(col1[11], '"-"'), '"', '')::VARCHAR(100) AS referrer, REPLACE(COALESCE(col1[12], '') || COALESCE(col1[13], '') || COALESCE(col1[14], '') || COALESCE(col1[15], '') || COALESCE(col1[16], '') || COALESCE(col1[17], '') || COALESCE(col1[18], '') || COALESCE(col1[19], '') || COALESCE(col1[20], ''), '"', '')::VARCHAR(200) AS user_agent FROM apache_log_array ) AS tmp_table; PGCon14 Working with Network Data Types
Code snippets and techniques from tracert_DDL.sql -- Lets play a bit with an array structure for the tracert table -- DROP TYPE tracert_trip; CREATE TYPE tracert_trip AS ( hop INTEGER, from_ip INET, to_ip INET ); SELECT source_file, dest_name, dest_ip, array_agg((hop, from_ip, to_ip)::tracert_trip ORDER BY hop) tracert_trip FROM tracert GROUP BY source_file, dest_name, dest_ip LIMIT 10000; PGCon14 Working with Network Data Types
Code snippets and techniques from 03_MyIP_500TopHosts_Array_DDL.sql -- Want to use a range for the IPs -- Create custom type CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet ); -- Check out our data -- SELECT inclusive, [], range for our inetrange SELECT website, ww_site_rank, hosting_country, hosting_company, owner_country, parent_hosting_company, parent_country, inetrange(owner_ip_from, owner_ip_to, '[]') AS owner_range, -- [] means upper and lower inclusive parent_ip_range FROM tophosts ORDER BY owner_range; PGCon14 Working with Network Data Types