1 / 40

PostgreSQL Workshop: Working with Network Data

PostgreSQL Workshop: Working with Network Data. PGCon 2014 Workshop James Hanson 21-May-14. About the presenter …. James Hanson jamesphanson@yahoo.com , http://www.linkedin.com/in/jamesphanson http://jamesphanson.wikispaces.com/ Reformed Oracle RDBMS developer / administrator

darci
Download Presentation

PostgreSQL Workshop: Working with Network Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PostgreSQL Workshop:Working with Network Data PGCon 2014 WorkshopJames Hanson 21-May-14

  2. About the presenter … • James Hansonjamesphanson@yahoo.com,http://www.linkedin.com/in/jamesphansonhttp://jamesphanson.wikispaces.com/ • Reformed Oracle RDBMS developer / administrator • I am interested in network data and I see PostgreSQL as powerful, underutilized tool for storing and analyzing network data PGCon14 Working with Network Data Types

  3. About the presenter … • James Hanson Parsons, Inc. james.hanson@parsons.com PGCon14 Working with Network Data Types

  4. What are we doing in this workshop? • Using PostgreSQL to explore the question “What can I find out about the requester_IP addresses in my Apache log?”…plus some side trips … • Parsing & loading an Apache access log and associating the requesting IPs with GeoLocated, known malicious and other IP lists. NOTE: All workshop data is free and publicly available. PGCon14 Working with Network Data Types

  5. What will we get from the workshop? • Hands-on experience using PG to load, relate and analyze moderate-sized network data sets • Simple & useful techniques for loading network data • Simple and useful techniques for exploring network data • Experience exploring some public data sets • Get jazzed to use the techniques and code samples on your network data • Meet other folks with complementary interests and expertise PGCon14 Working with Network Data Types

  6. What’s not in this workshop? • GUI displays or application integration.We focus on the data • PG optimization, tuning or extensions other than PostGIS • RDBMS normalizing, surrogate keys or orthodoxy* • Automated workflow • Handling of exceptions, errors and boundary conditions * We’ll revisit this later PGCon14 Working with Network Data Types

  7. Workshop setup and logistics • Using an EnterpriseDB PostgreSQL 9.3 + PostGISwith no additional extensions • Hosted on a Centos 6.5 x64 virtual machine with 3GB RAM and 2* CPU cores • Using three workshop downloads … • The data and DDL queries are zipped and available on GoogleDocs • The PowerPoint is on Slide Share • The SQL and Python files are on GitHub • Everything is linked off PGCon14 site PGCon14 Working with Network Data Types

  8. Workshop setup errata • PGAdmin Query Tool Show NULL values as <NULL> • Centos yum packages:policycoreutils-guisetroubleshootlzop samba dos2unix system-config-lvm.noarch • PostgreSQL non-default parameters Workshop value Enterprise DB defaultmax_connections 10 100 effective_cache_size 262,144 16,384maintenance_work_mem 524,288 16,384max_stack_depth 2,048 100shared_buffers 98,304 1,024temp_buffers 131,072 1,024work_mem 1,048,576 1,024fsync off onsynchronous_commit off onencoding UTF-8 PGCon14 Working with Network Data Types

  9. PostgreSQL network data types An inetrange type is a natural extension that complements subnets CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet ); PGCon14 Working with Network Data Types

  10. PostgreSQL network operators PGCon14 Working with Network Data Types

  11. PostgreSQL ranges and arrays • Ranges • 6 built in types. Easy to create custom types • Can by inclusive, [, or exclusive, ( We’ll use inclusive lower and upper bounds, […] • Use the “contains element”, @>, and “element is contained by”, <@, operators • Arrays … just the tip or the iceberg • Use for parsing raw data and alternative storage models • Search arrays with the ANY operator PGCon14 Working with Network Data Types

  12. Q: Why are we doing this with PostgreSQL?I was told that “the cloud” was the only place to work with network data! PGCon14 Working with Network Data Types

  13. A1: Because RDBMSs have some advantages • Cloud technologies have a difficult time with semantically similar but syntactically different data such as: • Subnets:192.168.100.24/25 =192.168.100.24 255.255.255.128 • Date / time:21-May-14 = 05/21/14 = 1400638460 • RDBMS are (still) much faster with GUI responses on small / moderate data sets PGCon14 Working with Network Data Types

  14. A2: Because PostgreSQL has some useful and unique capabilities • PostgreSQL network data types and operators enable: • Much faster Layer3-type operations … because you can think like a router • Data validation (NOTE: custom solutions are nearly impossible for IPv6) • PG arrays can simplify SQL syntax and ERDs. • PG regexp_split_to_array simplifies ETL… but COPY still lacks important functionality PGCon14 Working with Network Data Types

  15. Our workshop approach • Our approach: What useful and tough problems can I solve with PostrgreSQL’s unique capabilities and low cost? • ! Our approach: Given that you are using a relational database management system, these are the reasons why PostgreSQL is the best brand to choose. PGCon14 Working with Network Data Types

  16. Workshop logistics • Preamble / instruction • Brief overview of our datasets • Transform and load the data together -- Intermezzo -- • Query the data together • Questions, follow-up & next steps. NOTE: Code snippets are at the end of the PPT. All files are linked off of PGCon14 PGCon14 Working with Network Data Types

  17. Our buffet of data … All free and publicly available … • OurAirports Airport GeoLochttp://ourairports.com • CSC InfoChimps US IPGeohttp://www.infochimps.com/datasets • Dragon Research Group Malicious IPshttp://www.dragonresearchgroup.org/ • National Vulnerabilities Database IAVAshttp://nvd.nist.gov/home.cfm • GeoLite IPv6 Geohttp://geolite.maxmind.com/ PGCon14 Working with Network Data Types

  18. … Our buffet of data • MyIP Malicious IPshttp://myip.ms • Cowgar Apache loghttp://notebook.cowgar.com • The Internet Traffic Archive tracertftp://ita.ee.lbl.gov/ • Cooperative Association for Internet Data Analysis (CAIDA) DNShttp://www.caida.org • NIL ARP table http://wiki.nil.com/ PGCon14 Working with Network Data Types

  19. Dive into Airport data … ../020_Airports/ • 4 csv files: • airports, runways, countries and regions AirportDDL.sql PGCon14 Working with Network Data Types

  20. INTERMEZZO for 10 minutes PGCon14 Working with Network Data Types

  21. Let’s query our data … ../Queries/ • Where else have the Apache_logrequesting_ip addresses appeared? USIPGeo_Queries.sql PGCon14 Working with Network Data Types

  22. Revisit normalization, questions and wrap up PGCon14 Working with Network Data Types

  23. *Normalizing and surrogate keys NOTE: This is a revisited topic from the workshop pre-amble. • Normalization is very useful for handling ACID transactions and transaction-entered data (vs. bulk-loaded). • Surrogate keys are loved by Hibernate, may help in some situations and are part or many ORM methodologies.But that does not generally apply to immutable network data intended for analysis … like we have. PGCon14 Working with Network Data Types

  24. *Normalizing and surrogate keys(2) • And normalization has a steep cost in … • Increased complexity to load and query data • Hard coded assumptions in a complex ERD • Reduced performance ** • Intellectual distance from the data • The reasons to normalize and use surrogate keys don’t apply here but the costs do, so I don’t do it.(feel free to explore this topic off-line with me) ** This is not true in all cases. I refer to index-based joins and searches PGCon14 Working with Network Data Types

  25. Q: Why did you use arrays and custom data types? Doesn’t that violate RDBMS principles plus lock you in to custom PostgreSQL syntax? A: Yes. Here’s why I suggest you consider that: Dr’s. Boyce and Codd did not include arrays in their RDBMS theoretical work and arrays are not in the ANSI-SQL standard … but that work is focused on ACID transactions, which do not apply here. PostgreSQL network types are unique … so you are already committed to custom code. Why not take full advantage of PGs capabilities. PGCon14 Working with Network Data Types

  26. Wrap up and follow-on … • IP and IP/subnet data types enable you to think of network data as a Layer-3 device does • This will always be faster and simpler than treating IPs as strings or numbers. • Ex. “Which of these IPs is contained in that large set if IP subnets and ranges?” • 1-line WHERE clause in PostgreSQL • … still working on it in other environments PGCon14 Working with Network Data Types

  27. Wrap up and follow-on … • An inetrange type is a natural extension that complements subnets CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet ); • If you are using PG network data types with read-only data for analysis …… consider using custom data types and arrays • You have already committed to custom-code so you may as well reap the (substantial) benefits. PGCon14 Working with Network Data Types

  28. XML and XPath notes • The idealized ETL process for XML (and JSON) is ETL … don’t Extract or Transform, just Load • PostgreSQL has good, but not great, XML support • Whatever you do, don’t write custom code to shred XML to RDBMS then build XML again for delivering to applications. Please. • Very good XPath resources are:http://zvon.org/comp/r/tut-XPath_1.html#Pages~List_of_XPathshttp://stackoverflow.com/questions/17799790/using-xpath-to-extract-data-from-an-xml-column-in-postgres for details PGCon14 Working with Network Data Types

  29. PostgreSQL: Working with Network Data • James Hanson jamesphanson@yahoo.com,http://www.linkedin.com/in/jamesphansonhttp://jamesphanson.wikispaces.com/ • Please provide feedback.This presentation is new and I’d like to make it better. (… and please let me know if you see / have any cool data sets too!) PGCon14 Working with Network Data Types

  30. James Hanson / jamesphanson@yahoo.com Questions and Comments … PGCon14 Working with Network Data Types

  31. Code snippets and techniques from AirportsDDL.sql -- Countries CREATE TABLE countries ( id BIGINT PRIMARY KEY, code CHAR(10) NOT NULL UNIQUE, name VARCHAR(100), continent CHAR(4), wikipedia_link VARCHAR(500), keywords VARCHAR(300)); COMMENT ON TABLE countries IS 'Data from http://ourairports.com/data/'; COPY countries FROM '/Pgcon14/Scripts/020_Airports/countries.csv' WITH DELIMITER ',' CSV HEADER; PGCon14 Working with Network Data Types

  32. Code snippets and techniques from AirportsDDL.sql -- Update the GEOMETRY column from the long / lat UPDATE airports SET location = ST_GeomFromText('POINT('||longitude_deg||' '|| latitude_deg||')',4326); PGCon14 Working with Network Data Types

  33. Code snippets and techniques from USIPGeoDDL.sql -- COPY into temporary table for processing -- Create temporary table CREATE TABLE us_ipgeo_temp( id SERIAL, col1 TEXT); -- Load temporary table COPY us_ipgeo_temp( col1) FROM '/Pgcon14/Scripts/030_USIPGeo/infochimps_dataset_13358_download_16680/ip_blocks_us_geo.tsv‘ WITH DELIMITER '|' -- use a character that is not in the dataset CSV; PGCon14 Working with Network Data Types

  34. Code snippets and techniques from USIPGeoDDL.sql -- Use a CTE (Common Table Expression) syntax and -- regexp_split_to_array to look at the data in our temporary table. WITH us_ipgeo_array AS ( SELECT regexp_split_to_array(col1, '\t') AS col1 FROM us_ipgeo_temp -- LIMIT 10 ) SELECT col1[1] AS col1_1, col1[2] AS col1_2, col1[3] AS col1_3, col1[4] AS col1_4, col1[5] AS col1_5, col1[6] AS col1_6, col1[7] AS col1_7, col1[8] AS col1_8, col1[9] AS col1_9, col1[10] AS col1_10, col1[11] AS col1_11, col1[12] AS col1_12, col1[13] AS col1_13, col1[14] AS col1_14 FROM us_ipgeo_array; PGCon14 Working with Network Data Types

  35. Code snippets and techniques from USIPGeoDDL.sql -- Manipulate and CAST the fields as needed WITH us_ipgeo_array AS ( SELECT regexp_split_to_array(col1, '\t') AS col1 FROM us_ipgeo_temp LIMIT 10000 -- check more records too ) SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14 FROM us_ipgeo_array ORDER BY metro_code NULLS LAST; PGCon14 Working with Network Data Types

  36. Code snippets and techniques from USIPGeoDDL.sql -- Create and populate us_ipgeo from us_ipgeo_temp -- Try it once with a LIMIT statement SELECT * INTO us_ipgeo FROM ( WITH us_ipgeo_array AS ( SELECT regexp_split_to_array(col1, '\t') AS col1 FROM us_ipgeo_temp -- LIMIT 10000 -- check more records too ) SELECT -- col1[1] AS col1_1, -- col1[2] AS col1_2, -- col1[3] AS col1_3, -- col1[4] AS col1_4, col1[5]::inet AS ip, REPLACE(col1[6], '"', '')::CHAR(2) AS country, REPLACE(col1[7], '"', '')::CHAR(2) AS state, REPLACE(col1[8], '"', '')::VARCHAR(80) AS city, REPLACE(col1[9], '"', '')::CHAR(5) AS zip_code, col1[10]::real AS latitude, col1[11]::real AS longitude, ST_GeomFromText('POINT('||col1[11]::real||' '|| col1[10]::real||')',4326) AS location, NULLIF(col1[12], '')::int AS metro_code, -- A few values were empty string, IS NOT NULL and does not CAST to INT col1[13]::text AS area_code -- col1[14] AS col1_14 FROM us_ipgeo_array ) AS tmp_table; PGCon14 Working with Network Data Types

  37. Code snippets and techniques from ApacheAccessLogs_DDL.sql -- COPY into temporary table for processing -- Create temporary table CREATE TABLE apache_log_temp ( id SERIAL, col1 TEXT); -- Set the client_encoding for the same reason as before SET client_encoding TO 'latin1'; -- Load temporary table COPY apache_log_temp( col1) FROM '/Pgcon14/Scripts/080_ApacheLogs/access.log' WITH DELIMITER '^' -- use a character that is not in the dataset -- CSV ; PGCon14 Working with Network Data Types

  38. Code snippets and techniques from ApacheAccessLogs_DDL.sql -- Try it once with a LIMIT statement SELECT * INTO apache_log FROM ( WITH apache_log_array AS ( SELECT regexp_split_to_array(col1, ' ') AS col1 FROM apache_log_temp -- LIMIT 1000 ) SELECT col1[1]::inet AS requesting_ip, to_timestamp(RIGHT(col1[4], -1) || LEFT(col1[5], -6), 'DD/Mon/YYYY:HH24:MI:SS') AT TIME ZONE INTERVAL '-04:00' AS time, RIGHT(col1[6], -1)::VARCHAR(10) AS action, col1[7]::VARCHAR(100) AS resource, LEFT(col1[8], -1)::VARCHAR(10) AS protocol, NULLIF(col1[9], '-')::int AS status_to_client, NULLIF(col1[10], '-')::int AS object_size_bytes, REPLACE(NULLIF(col1[11], '"-"'), '"', '')::VARCHAR(100) AS referrer, REPLACE(COALESCE(col1[12], '') || COALESCE(col1[13], '') || COALESCE(col1[14], '') || COALESCE(col1[15], '') || COALESCE(col1[16], '') || COALESCE(col1[17], '') || COALESCE(col1[18], '') || COALESCE(col1[19], '') || COALESCE(col1[20], ''), '"', '')::VARCHAR(200) AS user_agent FROM apache_log_array ) AS tmp_table; PGCon14 Working with Network Data Types

  39. Code snippets and techniques from tracert_DDL.sql -- Lets play a bit with an array structure for the tracert table -- DROP TYPE tracert_trip; CREATE TYPE tracert_trip AS ( hop INTEGER, from_ip INET, to_ip INET ); SELECT source_file, dest_name, dest_ip, array_agg((hop, from_ip, to_ip)::tracert_trip ORDER BY hop) tracert_trip FROM tracert GROUP BY source_file, dest_name, dest_ip LIMIT 10000; PGCon14 Working with Network Data Types

  40. Code snippets and techniques from 03_MyIP_500TopHosts_Array_DDL.sql -- Want to use a range for the IPs -- Create custom type CREATE TYPE inetrange AS RANGE ( SUBTYPE = inet ); -- Check out our data -- SELECT inclusive, [], range for our inetrange SELECT website, ww_site_rank, hosting_country, hosting_company, owner_country, parent_hosting_company, parent_country, inetrange(owner_ip_from, owner_ip_to, '[]') AS owner_range, -- [] means upper and lower inclusive parent_ip_range FROM tophosts ORDER BY owner_range; PGCon14 Working with Network Data Types

More Related