180 likes | 292 Views
CSA405: Unix Trix for Empirical CL. How to use Unix as a toolbox for NLP applications. Acknowledgements. Contents of this lecture is inspired by Gerald Gazdar, University of Sussex Ken Church, AT&T Thanks. Unix Tools. grep : search for pattern sort : sort a file
E N D
CSA405: Unix Trixfor Empirical CL How to use Unix as a toolbox for NLP applications Unix Trix for Emprirical CL
Acknowledgements Contents of this lecture is inspired by Gerald Gazdar, University of Sussex Ken Church, AT&T Thanks Unix Trix for Emprirical CL
Unix Tools • grep: search for pattern • sort: sort a file • uniq: eliminate duplicates • tr: translate characters • wc: count words • sed: edit string • awk: pattern based programming language • cut: cut out selected fields of each line of a file • paste: merge corresponding or subsequent lines of files • comm: select or reject lines common to two files • join: relational database operator • man command for further details of these Unix Trix for Emprirical CL
Text I was intrigued by the article "Cloning a human being a long way off`" (December 3). I attended the well-presented lecture by Dr Bruce Campbell, wherein the cutting edge of the new cloning technology for the harvesting of human stem cells was explained. This involves the transfer of the nucleus from an adult human mature cell, such as skin, hair or mucosa, into the denucleated human ovum of a female of the species, which is then allowed to start developing for a few days to the stage where the placental precursor cells separate from the cells destined to become the foetus. Unix Trix for Emprirical CL l
Punctuation 1 • sed –f markpunct.sed • file contents s/"/ xzzdoublequotezzx /g s/'/ xzzquotezzx /g s/`/ xzzquotezzx /g s/(/ xzzleftparenzzx /g • I was intrigued by the article xzzdoublequotezzx Cloning a human being xzzquotezzx Unix Trix for Emprirical CL
Punctuation 2 • sed –f angle.sed • file contents s/xzz/</g s/zzx/>/g • I was intrigued by the article <doublequote> Cloning a human being <quote> Unix Trix for Emprirical CL
Case • tr 'A-Z''a-z' • i was intrigued by the article "cloning a human being `a long wayoff`" (december 3). i attended the well-presented lecture by dr brucecampbell, wherein the cutting edge of the new cloning technology forthe harvesting of human stem cells was explained. Unix Trix for Emprirical CL
Tokenisation • tr –sc 'a-zA-Z' '\012' • IwasintriguedbythearticleCloninga Unix Trix for Emprirical CL
Sorting • tr 'A-Z''a-z'| tr –sc 'a-zA-Z' '\012' | sort • aaaaadultallowedanarticleasattendedbecomebeingbrucebyby Unix Trix for Emprirical CL
Making a Wordlist • tr 'A-Z''a-z'| tr –sc 'a-zA-Z' '\012' | sort | uniq • aadultallowedanarticleasattendedbecomebeingbrucebycampbellcellcellscloning Unix Trix for Emprirical CL
Counting • tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq -c • 4 a1 adult1 allowed1 an1 article1 as1 attended1 become1 being1 bruce2 by1 campbell1 cell3 cells2 cloning Unix Trix for Emprirical CL
Sorted Frequency List • tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq –c | sort –r • 13 the5 of4 human4 a3 to3 cells2 was2 i2 from2 for2 cloning2 by1 which1 wherein Unix Trix for Emprirical CL
Sorted Frequency List • tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'|sort|uniq –c | sort –r | cat -n • 1 13 the 2 5 of 3 4 human 4 4 a 5 3 to 6 3 cells 7 2 was 8 2 i 9 2 from10 2 for11 2 cloning12 2 by Unix Trix for Emprirical CL
Zipf • Principle of least effort: people act so as to minimise their probable average rate of work. • Speaker’s effort is conserved by having a small no of very frequent words, whilst hearer’s effort demands large number of rare words. • Consequence (according to Zipf): relationship between word frequency and rank. • Frequency x Rank = constant Unix Trix for Emprirical CL
Zipf Curve Frequency Rank Unix Trix for Emprirical CL
paste and tail • paste: The default operation of paste will concatenate the corresponding lines of the input files. The NEWLINE character of every line except the line from the last input file will be replaced with a TAB character. • tail: The tail utility copies the named file to the standard output beginning at a designated place. • These two utilities can be used to work with n-grams Unix Trix for Emprirical CL
Bigrams tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012‘> foo tr 'A-Z' 'a-z'| tr –sc 'a-zA-Z' '\012'| tail +2 > foo1 paste foo foo1 | sort : human being human mature human ovum human stem : the article the cells the cutting the denucleated the foetus Unix Trix for Emprirical CL
grep Unix Trix for Emprirical CL