160 likes | 298 Views
Using SAS and Perl for Large Datasets. March 21, 2007. The Strong Points of SAS. SAS is designed to handle large datasets. SAS language is robust PROC SORT PROC SUMMARY PROC SQL PROC REG. The Strong Points of Perl. Free Well-documented CPAN – Comprehensive Perl Archive Network
E N D
Using SAS and Perl for Large Datasets March 21, 2007
The Strong Points of SAS • SAS is designed to handle large datasets. • SAS language is robust • PROC SORT • PROC SUMMARY • PROC SQL • PROC REG
The Strong Points of Perl • Free • Well-documented • CPAN – Comprehensive Perl Archive Network • Efficient at file handling.
The Netflix Data • Download as a tar.gz file. • ReadMe file. • movie_titles.txt… • RMSE Perl script… • Training Set…
Movie_Titles.txt 1,2003,Dinosaur Planet 2,2004,Isle of Man TT 2004 Review 3,1997,Character 4,1994,Paula Abdul's Get Up & Dance 5,2004,The Rise and Fall of ECW 6,1997,Sick
Movies • Ziggy Stardust and the Spiders From Mars: The Motion Picture • Learning HTML: No Brainers • Godzilla vs. The Sea Monster • Frank Lloyd Wright • Rabbit-Proof Fence
RMSE • Root Mean Square Error • S2 = (1/n) Σ (Xi – Xbar)2 where: • n = number of observations • Xi = ith observation out of n • Xbar = mean of X (or in our case the predicted X) • RMSE = square root of S2
The Training Set • 17770 files
The Training Set 1: 1488844,3,2005-09-06 822109,5,2005-05-13 885013,4,2005-10-19 30878,4,2005-12-26 823519,3,2004-05-03 893988,3,2005-11-17 124105,4,2004-08-05 1248029,3,2004-04-22 Movie: Person,Rating,Date Person,Rating,Date Person,Rating,Date . . .
Data Marts • Data marts are subsets of a larger data set. • Contents are determined by the problem at hand. • Contents may change over time or remain static.
Why Use Data Marts • Increases query performance. • Decreases storage costs. • Decreases risk. • Proving-ground for new code or equipment. • Allows for the optimization of effort toward problem solving instead of data management.
Perl for Data Mart Assembly What we have… 1: 1488844,3,2005-09-06 What we want… 1,1488844,3,2005-09-06
Perl2.pl while (<*.txt>) { $file=$_; print $file, "\n"; open(IN, "< $file"); while (<IN>) { print "$_"; } close IN; }
Perl3.pl open(OUT, "> output.tx"); while (<*.txt>) { $file=$_; print $file, "\n"; open(IN, "< $file"); while (<IN>) { if( $_ =~ /[0-9]*:/){ } else { print OUT "$file,$_"; } } close IN; } close OUT;
SAS Tips and Tricks • Getting started • OPTIONS OBS=0; • WHERE… • Avoid temporary data sets. • Make permanent SAS data sets. • Access data sets with SQL.