730 likes | 887 Views
Lecture 2 26/04/2011. Statistical Natural Language Processing. Outline. Overview of Python Lists and sets Functions and loops Strings and file I/O Dictionaries and tuples Modules and classes. Recommended reading. Slides from LING 508, Fall 2010
E N D
Lecture 2 26/04/2011 Statistical Natural Language Processing
Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes
Recommended reading • Slides from LING 508, Fall 2010 • http://www.u.arizona.edu/~echan3/508.html • Python tutorial • http://docs.python.org/tutorial/
For Java programmers • Python & Java: A Side-by-Side Comparison • Quick comparison of the two languages • http://pythonconquerstheuniverse.wordpress.com/category/java-and-python/ • Python for Java Programmers • Incomplete tutorial on Python, with Java examples side-by-side • http://python.computersci.org/Main/TableOfContents
Install these • Python 2.6 • http://www.python.org/ • NumPy 1.5.1 • http://numpy.scipy.org/ • Matplotlib 1.0.1 • http://matplotlib.sourceforge.net/ • Contains the pyplot module
Mac OS X 10.6 Snow Leopard • Python and NumPy are already built in, but Matplotlib is not • Matplotlib is incompatible with the built-in versions of Python and NumPy • So you’ll need to download and install Python, NumPy, and Matplotlib
Alternative: install PyLab • PyLab includes NumPy and Matplotlib • http://www.scipy.org/PyLab • So, instead of: >>> import matplotlib • You can do: >>> from pylab import matplotlib
Set Python environment variable • Create a directory mypythoncode for your Python code • Example: C:\Users\Arizona\Desktop\539\mypythoncode\ • Set environment variable so Python knows where to find your code • Windows Vista: • right-click on My Computer • choose "Advanced system properties" • add a new User variable called PYTHONPATH • set the value of the variable to mypythoncode
Set Python environment variable • OS X, Unix/Linux, etc.: • csh • edit .cshrc • setenv PYTHONPATH /home/me/mypythoncode • bash • edit .bashrc or .bash_profile • export PYTHONHPATH=/home/me/mypythoncode
Beware of Windows… • When you repeatedly execute code and cancel execution (with control-C), sometimes the processes continue anyway, and after a while IDLE won’t let you run your code • Solution: • press ctrl-alt-delete • start Task Manager • select lowest pythonw.exe processes • click on End Process • sometimes you might have to restart IDLE
Python in this course • The Marsland book uses Python and numerical Python. • Next lecture: NumPy • You’ll need to learn some Python in order to: • Read the code • Use it and modify it for some assignments • Not all assignments will involve Python, and portions of assignments may be completed in other languages.
Why Python • NLP community uses it • Language features • Datatypes built into language: strings, lists, hash tables • Automatic garbage collection • Dynamic typing • Easy to read: • Forced indentation • Code is concise, not verbose like Java • Gentle learning curve
Hello World in Python and Java • Python 2.X: print 'Hello World!' • Java: class HelloWorld { public static void main(String[] args) { System.out.println("Hello World!"); } }
Comments • Comments are ignored by the Python interpreter but are useful for describing the purpose of a section of code a = 1 # everything after hash mark is a comment b = 2 c = 3 # statement below does not execute because # it is within a comment # d = 4
Overview of basic data types • Integers: • 3, 8, -2, 100 • Floating-point numbers • 3.14159, 0.0001, -.101010101, 2.34e+18 • Booleans • True, False • Strings • 'hello', "GOODBYE" • Python does not have characters; a single character is a string of length 1 • None • Value is None
Overview of compound data types • Lists • [1,2,3,4,5], ['how', 'are', 'you'] • Elements are indexed: element at index 0 of [1,2,3,4,5] is 1 • Tuples • (1, 2, 3, 'a', 'b') • Sets: these are the same: • set([1,2,3,4,5]) • set([1,1,2,2,3,3,4,4,5,5]) • Elements are not indexed • Dictionaries (hash table) • {'a':1, 'b':2, 'c':3} • Map 'a' to 1, map 'b' to 2, map 'c' to 3 • Example application: represent the frequencies of letters in a text
Python is dynamically typed • Don’t explicitly specify types of variables • Python interpreter keeps track of types b = 3 # b is an integer b = False # b is now a boolean def myfunc(x, y): # type not specified return x + y
Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes
Creating lists a = [1, 2, 3] # list of integers b = [True, False] # list of booleans c = ['a', 'b', 'cde'] # list of strings d = [7, 'cat', False] # can mix types e = [] # empty list f = [[1,2,3], [4,5,6]] # list of 2 lists g = [[], [[]]] # list of 2 lists
List indices • Positive indices • Negative indices 1 2 3 4 0 'a' 'b' 'c' 'd' 'e' -4 -3 -2 -1 -5
Indexing into lists >>> mylist = ['a', 'b', 'c', 'd', 'e'] >>> mylist[0] 'a' >>> mylist[1] 'b' >>> mylist[2] 'c' >>> mylist[3] 'd' >>> mylist[4] 'e' 1 2 3 4 0 'a' 'b' 'c' 'd' 'e' -4 -3 -2 -1 -5
Negative indexing 1 2 3 4 0 >>> mylist[-1] 'e' >>> mylist[-2] 'd' >>> mylist[-3] 'c' >>> len(mylist) # built-in function length 5 >>> mylist[len(mylist)-1] 'e' 'a' 'b' 'c' 'd' 'e' -4 -3 -2 -1 -5
Creating new lists through slices • Syntax: mylist[start_idx:stop_idx(:step_size)] • start_idx • begin accessing at this index (inclusive) • default: 0 • stop_idx • stop accessing at this index (exclusive) • default: len(list) • step_size: (optional) • number of items to step through • default: 1
Creating new lists through slices >>> L = ['a', 'b', 'c', 'd', 'e'] >>> L[:2] # up to but not including index 2 ['a', 'b'] >>> L[2:] # beginning at index 2 ['c', 'd', 'e'] >>> L[2:5] ['c', 'd', 'e'] >>> L[2:4] ['c', 'd'] 1 2 3 4 0 'a' 'b' 'c' 'd' 'e'
Built-in functions: len, sorted >>> L = [4,3,1,5,2] >>> len(L) 5 >>> sorted(L) [1, 2, 3, 4, 5] >>> L # does not modify original [4, 3, 1, 5, 2] >>> L = sorted(L) # create a new >>> L # sorted list [1, 2, 3, 4, 5]
Built-in rangefunction • Returns a list containing an arithmetic progression of integers • Syntax: range([start,] stop[, step]) • Examples: >>> range(5) [0,1,2,3,4] >>> range(3,6) [3,4,5] >>> range(3,8,2) [3,5,7]
List methods >>> L = [1, 2, 2] >>> L.append(4) # append a single object >>> L [1, 2, 2, 4] >>> L.extend([3,4]) # extend with a list >>> L [1, 2, 2, 4, 3, 4] >>> [1,2] + [3,4] # same as extend [1, 2, 3, 4]
List methods >>> L [2, 2, 5, 3, 4] >>> L.reverse() # reverse list, modify it >>> L [4, 3, 5, 2, 2] >>> L.sort() # sort the list, modify it >>> L [2, 2, 3, 4, 5] >>> L = [3,2,1] # return a sorted list, >>> sorted(L) # but don’t modify list [1, 2, 3] >>> L [3, 2, 1]
Sets >>> S = set() # call set constructor >>> S set([]) >>> S = set([1,2,2,3,3]) >>> S set([1, 2, 3]) • Won’t necessarily display in sorted order: >>> set([6,65,4,21,3,4,7,1]) set([65, 3, 4, 6, 7, 1, 21])
Searching in a list vs. a set • Linear time to search for an item in a list >>> L = range(1000000) >>> 999999 in L # takes 0.24 seconds True • Constant time to search for an item in a list >>> S = set(range(1000000)) >>> 999999 in S # takes 0.0 seconds True
Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes
Functions: can have default values for arguments def f(a, b): return a + b def g(a, b=7): # consequence: only one return a + b # function definition, # unlike java, where you # have a function definition # for each combination of # arguments used f(3, 4) g(3, 4) # returns 7 g(3) # returns 10
Functions: default return type is None >>> def f(): print 'hello' >>> x = f() hello >>> print x None
For loops L = [1, 2, 3, 4, 5] L2 = [] L3 = [] for i in range(len(L)): L2.append(L[i] * 2) for x in L: L3.append(x * 2)
Bubble sort def bubblesort(L): for i in range(len(L)-1): swap_made = False for j in range(len(L)-1): if L[j+1] < L[j]: L[j], L[j+1] = L[j+1], L[j] swap_made = True if swap_made==False: # list is sorted break
Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes
Declaring strings • Strings can be enclosed in single quotes or double quotes s1 = 'spam' s2 = "spam"
Indexing and slicing strings,just like lists >>> s = 'python' >>> s[3] 'h' >>> s[:3] 'pyt' >>> s[3:] 'hon' >>> s[2:4] 'th' >>> s[2:-2] 'th'
Strings are immutable >>> s = 'python' >>> s[0] = 'x' Traceback (most recent call last): File "<pyshell#25>", line 1, in <module> mystring[0] = 'x' TypeError: 'str' object does not support item assignment >>> L = [1,2,3,4] # but lists are mutable >>> L[0] = 5 >>> L [5, 2, 3, 4]
Concatenation >>> s1 = 'python' >>> s2 = 'big ' + s1 >>> s2 'big python' >>> s1 = 'big ' + s1 >>> s3 = s2[:4] + 'ball' + s2[4:] >>> s3 'big ball python'
Built-in functions >>> len('spam') 4 • Type conversion through type constructors: useful for reading data from files (convert string to numeric types) >>> int('356') 356 >>> float('3.56') 3.5600000000000001 >>> str(356) '356'
Length of a string >>> s = 'hello' >>> len(s) 5
String methods >>> s = 'howXareXyou' >>> s.split('X') ['how', 'are', 'you'] >>> s = 'how are\tyou\n' >>> s.split() # splits on whitespace ['how', 'are', 'you'] >>> s = ' how are you\n' >>> s.strip() # also lstrip and rstrip 'how are you'
String methods >>> ''.join(['how', 'are', 'you']) 'howareyou' >>> 'X'.join(['how', 'are', 'you']) 'howXareXyou'
String methods >>> s = 'goodmorning' >>> s.startswith('goo') True >>> s.endswith('ning') True >>> s[:3]=='goo' # slices and equality True >>> s[-4:]=='ning' True
String methods >>> s = 'how are you\n' >>> s.upper() # shows return value 'HOW ARE YOU\n' >>> s # variable was not modified 'how are you\n' >>> s = s.upper() # modify it >>> s 'HOW ARE YOU\n' >>> 'hello'.isupper() False
Input from a file:print each word in a corpus # open file for reading f = open('C:/myfile.txt', 'r') # read each line one at a time # “line” is a newline-terminated string for line in f: # convert to a list of strings tokens = line.split() # perform operation on each string for tok in tokens: print tok
File output outfilename = '/home/echan3/myfile.txt' of = open(outfilename, 'w') # write # write a string # should include newline character of.write('hello Python\n') of.close() # need to close output file # in reading a file, you don’t have to # call close()
Outline • Overview of Python • Lists and sets • Functions and loops • Strings and file I/O • Dictionaries and tuples • Modules and classes
Tuples • Quick data structure to group variables together • Group together variables of different types • Multiple return values for functions >>> t = ('a', 1, True) >>> t = (1, (2, ((3, 4), 5))) >>> def x(): return (3, 4) >>> (a, b) = x() >>> e = () # empty tuple >>> e = (1,) # one-element tuple, note the comma