Skip to content

Latest commit

 

History

History
1837 lines (831 loc) · 20.3 KB

File metadata and controls

1837 lines (831 loc) · 20.3 KB

Chpater 3. Built-in Data Structures, Functions, and Files

3.1 Data Structures and Sequences

Tuple

Fixed-length, immutable sequence of Python objects.

tup = 4, 5, 6
tup
(4, 5, 6)
nested_tup = (4, 5, 6), (7, 8)
nested_tup
((4, 5, 6), (7, 8))

Can convert a sequence or iterator to a tuple with the tuple function.

tuple([4, 0, 2])
(4, 0, 2)
tuple('string')
('s', 't', 'r', 'i', 'n', 'g')

Indexing a tuple is standard.

tup[1]
5

While the tuple is not mutable, mutable objects within the tuple can be modified in place.

tup = 'foo', [1, 2], True
tup[1].append(3)
tup
('foo', [1, 2, 3], True)

Tuples can be concatenated using the + operator or repeated with the * operator and an integer. Note that the objects are not copied, just the references to them.

(4, None, 'foo') + (6, 0) + ('bar', )
(4, None, 'foo', 6, 0, 'bar')
('foo', 'bar') * 4
('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar')
tup = ([1, 2], 'foo')
tup = tup * 4
tup[0].append(3)
tup
([1, 2, 3], 'foo', [1, 2, 3], 'foo', [1, 2, 3], 'foo', [1, 2, 3], 'foo')

Tuples can be unpacked by position.

tup = (4, 5, 6)
a, b, c = tup
b
5
tup = 4, 5, (6, 7)
a, b, (c, d) = tup
d
7
a, b = 1, 2
b, a = a, b
a
2

A common use of variable unpacking is iterating over sequences of tuples or lists.

seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for a, b, c, in seq:
    print(f'a={a}, b={b}, c={c}')
a=1, b=2, c=3
a=4, b=5, c=6
a=7, b=8, c=9

Another common use is return multiple values from a function (discuessed later).

There is specific syntax if you only want the first few values and put the rest into another tuple.

values = tuple(range(5))
a, b, *rest = values
a
0
b
1
rest
[2, 3, 4]

If you don't want the other values, the convention is to assign them to a variable called _.

a, b, *_ = values
a, b, *_ = values

List

Variable-lengthed and the contents can be modified in place.

a_list = [2, 3, 7, None]
a_list
[2, 3, 7, None]
tup = 'foo', 'bar', 'baz'
b_list = list(tup)
b_list
['foo', 'bar', 'baz']
b_list[1]
'bar'
b_list[1] = 'peekaboo'
b_list
['foo', 'peekaboo', 'baz']

Elements can be added, inserted, removed, etc.

b_list.append('dwarf')
b_list
['foo', 'peekaboo', 'baz', 'dwarf']
b_list.insert(1, 'red')
b_list
['foo', 'red', 'peekaboo', 'baz', 'dwarf']
b_list.pop(2)
'peekaboo'
b_list
['foo', 'red', 'baz', 'dwarf']
b_list.append('foo')
b_list.remove('foo')
b_list
['red', 'baz', 'dwarf', 'foo']

Lists can be concatenated using the + operator. Alternatively, an existing list can be extended using the extend method and passing another list.

[4, None, 'foo'] + [7, 8, (2, 3)]
[4, None, 'foo', 7, 8, (2, 3)]
x = [4, None, 'foo']
x.extend([7, 8, (2, 3)])
x
[4, None, 'foo', 7, 8, (2, 3)]

A list can be sorted in place.

a = [7, 2, 5, 1, 3]
a.sort()
a
[1, 2, 3, 5, 7]

Sort has a few options, one being key that allows us to define the function used for sorting.

b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort(key=len)
b
['He', 'saw', 'six', 'small', 'foxes']

The 'bisect' module implements binary search and insertion into a sorted list. This finds the location of where to insert a new element to maintain the sorted list. bisect.bisect(list, value) finds the location for where the element should be added, bisect.insort actually inserts the element.

import bisect
c = [1, 2, 2, 2, 2, 3, 4, 7]
bisect.bisect(c, 2)
5
bisect.bisect(c, 5)
7
bisect.insort(c, 6)
c
[1, 2, 2, 2, 2, 3, 4, 6, 7]

Specific elements of a list can be accessed using slicing.

seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[1:5]
[2, 3, 7, 5]
seq[3:4] = [6, 7, 8, 9]
seq
[7, 2, 3, 6, 7, 8, 9, 5, 6, 0, 1]
seq[:5]
[7, 2, 3, 6, 7]
seq[5:]
[8, 9, 5, 6, 0, 1]
seq[-4:]
[5, 6, 0, 1]
seq[-6:-2]
[8, 9, 5, 6]

A step can also be included after another :.

seq[::2]
[7, 3, 7, 9, 6, 1]
seq[::-1]
[1, 0, 6, 5, 9, 8, 7, 6, 3, 2, 7]

Built-in sequence functions

There are a number of useful built-in functions specifically for sequence types.

enumerate builds an iterator of the sequence to return each value and its index.

some_list = ['foo', 'bar', 'baz']
for i, v in enumerate(some_list):
    print(f'{i}. {v}')
0. foo
1. bar
2. baz

sorted returns a new sorted list from the elements of a sequence.

sorted([7, 1, 2, 6, 0, 3, 2])
[0, 1, 2, 2, 3, 6, 7]
sorted('horse race')
[' ', 'a', 'c', 'e', 'e', 'h', 'o', 'r', 'r', 's']

zip pairs up elements of a number of sequences to create a list of tuples.

seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two','three']
zipped = zip(seq1, seq2)
list(zipped)
[('foo', 'one'), ('bar', 'two'), ('baz', 'three')]
seq2 = ['one', 'two']
list(zip(seq1, seq2))
[('foo', 'one'), ('bar', 'two')]
for a, b in zip(seq1, seq2):
    print(f'{a} - {b}')
foo - one
bar - two

A list of tuples can also be "unzipped".

pitchers = [
    ('Nolan', 'Ryan'),
    ('Roger', 'Clemens'),
    ('Curt', 'Schilling')
]
first_names, last_names = zip(*pitchers)
first_names
('Nolan', 'Roger', 'Curt')
last_names
('Ryan', 'Clemens', 'Schilling')

The reversed function iterates over the sequence in reverse order.

list(reversed(range(10)))
[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Dictionaries

A flexibly-sized collection of key-value pairs.

empty_dict = {}
d1 = {'a': 'some value', 'b': [1, 2, 3, 4]}
d1
{'a': 'some value', 'b': [1, 2, 3, 4]}
d1[7] = 'an integer'
d1
{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}
d1['b']
[1, 2, 3, 4]

You can check if a key is in a dictionary.

'b' in d1
True

A key-value pair can be deleted using del or the pop method which returns the value.

d1[5] = 'some value'
d1
{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer', 5: 'some value'}
del d1[5]
d1
{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}
ret = d1.pop('a')
d1
{'b': [1, 2, 3, 4], 7: 'an integer'}
ret
'some value'

The keys and values methods return iteractors of the dictionary's keys and values. While they do not return in a particular order, they do return in the same order.

list(d1.keys())
['b', 7]
list(d1.values())
[[1, 2, 3, 4], 'an integer']

A dictionary can be added to another using the update method.

d1.update({'b': 'foo', 'c': 12})
d1
{'b': 'foo', 7: 'an integer', 'c': 12}

We will learn about dictionary comhrehensions later, but for now, here is a good way to contruct a dictionary from two lists or tuples.

mapping = dict(zip(range(5), reversed(range(5))))
mapping
{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

The get and pop methods for dicitonary can take default values for if the key does not exist in the dictionary.

words = ['apple', 'bat', 'bar', 'atom', 'book']
by_letter = {}
for word in words:
    letter = word[0]
    if letter not in by_letter:
        by_letter[letter] = [word]
    else:
        by_letter[letter].append(word)

by_letter
{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

This can instead be written more concisely as follows.

by_letter = {}
for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)

by_letter
{'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

Set

An unordered collection of unique elements - the name comes from the mathematical term.

set([1, 1, 2, 3, 4, 4, 5, 6, 6])
{1, 2, 3, 4, 5, 6}
{1, 1, 2, 3, 4, 4, 5, 6, 6}
{1, 2, 3, 4, 5, 6}

Sets support standard set operations such as union, intersection, difference, and symmetric difference.

a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7, 8}
a.union(b)
{1, 2, 3, 4, 5, 6, 7, 8}
a | b
{1, 2, 3, 4, 5, 6, 7, 8}
a.intersection(b)
{3, 4, 5}
a & b
{3, 4, 5}
a - b
{1, 2}
b - a
{6, 7, 8}
a ^ b
{1, 2, 6, 7, 8}
a.issubset(b)
False
a.issuperset({1, 2, 3})
True

List, set, and dictionary comprehensions

These are features for the easy (and fast) creation of the collection types. The basic format is as follows

[ expr for val in collection if condition ]

This is equivalent to the following loop.

result = []
for val in collection:
    if condition:
        result.append(expr)

The condition can be omitted if no filter is needed.

strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
[x.upper() for x in strings if len(x) > 2]
['BAT', 'CAR', 'DOVE', 'PYTHON']

A dicitonary comprehension is syntactically simillar.

{ key-expr : value-expr for value in collection if condition}

A set comprehension is identical to a list comprehension save for it uses curly braces instead of square brackets.

{val : index for index, val in enumerate(strings)}
{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}

List comprehensions can be nested. Here is one example where a list of two lists are iterated over and only the names with at least two 'e's are kept.

 all_data = [
     ['John', 'Emily', 'Michael', 'Mary', 'Steven'],
     ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']
]
[name for names in all_data for name in names if name.count('e') >= 2]
['Steven']

The next example is followed by an identical nested for loop. Notice that the order of the for expressions are the same.

some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
[x for tup in some_tuples for x in tup]
[1, 2, 3, 4, 5, 6, 7, 8, 9]
flattened = []
for tup in some_tuples:
    for x in tup:
        flattened.append(x)
flattened
[1, 2, 3, 4, 5, 6, 7, 8, 9]

3.2 Functions

def my_function(x, y, z=1.5):
    if z > 1:
        return z * (x + y)
    else:
        return z / (x + y)
my_function(5, 6, z=0.7)
0.06363636363636363

Note that there are various ways to pass arguments to functions that are not covered here with several new ones available in Python 3.8.

Namespaces, scope, and local functions

A function gets its own local namespace when it is called, this is immediately populated with the arguments, and it is destryoed once the function returns. A global variable can be created using the global keyword.

a = None
a
def bind_a_variable():
    global a
    a = []

bind_a_variable()
print(a)
[]

Returning multiple values

Functions can only return one object, but by returning a tuple, unpacking can be used to create multiple variables.

def f():
    a = 5
    b = 6
    c = 7
    return a, b, c

a, b, c = f()
print(f"a={a}, b={b}, c={c}")
a=5, b=6, c=7

Another option is to use a dictionary. This allows for the naming of the returned values.

Functions are objects

Say we wanted to clean the input from a survey.

states = ['    Alabama', 'Georgia!', 'Georgia', 'georgia', 'flOrIda', 'south   carolina###', 'West virginia?']

We could use one function to implement various string methods and methods from 're' for regular expressions.

import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub('[!#?]', '', value)
        value = value.title()
        result.append(value)
    return result

clean_strings(states)
['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']

Alternatively, we could make a few functions that each do one step in the processing and apply it to all of the values of a list.

def remove_punctuation(value):
    return re.sub('[!#?]', '', value)

cleaning_operations = [str.strip, remove_punctuation, str.title]

def clean_strings(strings, ops):
    result = []
    for value in strings:
        for function in ops:
            value = function(value)
        result.append(value)
    return result

clean_strings(states, cleaning_operations)
['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']

Anonymous (lambda) functions

Thesse are single-line functions that autmatcially return the final value. They are defined by the keyword lambda. These are very useful in data analysis for passing a function as an argument to another function.

anon = lambda x: x * 2
anon(3)
6
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]
apply_to_list(ints, lambda x: x * x)
[16, 0, 1, 25, 36]

Another example is where some common methods take functions for an argument to augment their default functionality.

strings = ['foo', 'card', 'bar', 'aaaa', 'abab']
strings.sort(key = lambda x: len(set(list(x))))
strings
['aaaa', 'foo', 'abab', 'bar', 'card']

Currying: partia argument application

Currying is a CS term that means deriving new functions from existing once by partial argument application. For example, add_numbers adds its two paramters, x and y together. It is curried by add_five which sets x to be 5, automatically.

def add_numbers(x, y):
    return x + y

add_five = lambda y: add_numbers(5, y)

Generators

The iterator protocol is a generic way to make iterable objects. An iterator object can specifically be created from most built-in collection types.

dict_iterator = iter(d1)
dict_iterator
<dict_keyiterator at 0x11c920a10>
list(dict_iterator)
['b', 7, 'c']

The iterator yields the objects when it is used in a for-like context or passed to the common built-in methods that take collection types.

A geerator is a way to create a new iterable object. They are like functions, but return multiple objects in a lazy fashion. A generator is created using the yield keyword instead of a return.

def squares(n=10):
    print(f'Generating squares from 1 to {n}.')
    for i in range(1, n + 1):
        yield i**2

gen = squares() 
gen
<generator object squares at 0x11da31bd0>
for x in gen:
    print(x, end = ' ')
Generating squares from 1 to 10.
1 4 9 16 25 36 49 64 81 100

Generators can be created using a generator expression which is simillar in kind and syntax to list comprehensions.

gen = (x**2 for x in range(100))
gen
<generator object <genexpr> at 0x11db32ed0>
sum(gen)
328350

The itertools module from the standard library has a collection of generators for many common data algorithms. Here is an example of groupby.

import itertools

first_letter = lambda x: x[0]

names = ['Alan', 'Adam', 'Wes', 'Will', 'Albert', 'Steven']

for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names))
A ['Alan', 'Adam']
W ['Wes', 'Will']
A ['Albert']
S ['Steven']

Errors and exception handling

Use try-except to fail gracefully.

def attempt_float(x):
    try:
        return float(x)
    except:
        return x

attempt_float('1.23')
1.23
attempt_float('a')
'a'

You can define except for different types of errors. For example, when float() is passed an improper string, it raises a ValueError. If it is passed a tuple, it raises a TypeError.

attempt_float((1, 2))
(1, 2)
def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        print('value error')
    except TypeError:
        print('type error')
attempt_float(5)
5.0
attempt_float('a')
value error
attempt_float((2, 3))
type error

A single except can recognize multiple error types.

def attempt_float(x):
    try:
        return float(x)
    except (TypeError, ValueError):
        return x

Often, you want some code to execute after a command regardless of whether it succeeds or fails.

def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        print('error')
        return x
    else:
        print('succeeded')
    finally:
        print('all done')
attempt_float(1)
all done





1.0
attempt_float('a')
error
all done





'a'

3.3 Files and the operating system

Open a file for reasing using the open function.

path  = "assets/segismundo.txt"
f = open(path)

It is opened in a 'read-only' form, by default. Lines can be iterated through.

for line in f:
    print(line)
Sueña el rico en su riqueza,

que más cuidados le ofrece;



sueña el pobre que padece

su miseria y su pobreza;



sueña el que a medrar empieza,

sueña el que afana y pretende,

sueña el que agravia y ofende,



y en el mundo, en conclusión,

todos sueñan lo que son,

aunque ninguno lo entiende.

It is important to close files that are opened.

f.close()

It is often useful to remove end-of-line markers.

lines = [x.rstrip() for x in open(path)]
lines
['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.']

Alternatively, it is often easier to use a with statement that autmatcially cleans up the open file when it finishes.

with open(path) as f:
    lines = [x.rstrip() for x in f]

For readable files, a few commonly used methods are:

  • read: returns a certain number of characters from a file
  • seek: changes the file position to the indicated byte
  • tell: gives the current position in the file
f = open(path)
f.read(10)
'Sueña el '
f2 = open(path, 'rb')  # binary mode
f2.read(10)
b'Suen\xcc\x83a el'
f.tell()
11
f2.tell()
10
f.seek(3)
3
f.read(1)
'n'
f.close()
f2.close()

Use write or writelines to write to a file.

with open('assets/tmp.txt', 'w') as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)
with open("assets/tmp.txt") as f:
    lines = f.readlines()

lines
['Sueña el rico en su riqueza,\n',
 'que más cuidados le ofrece;\n',
 'sueña el pobre que padece\n',
 'su miseria y su pobreza;\n',
 'sueña el que a medrar empieza,\n',
 'sueña el que afana y pretende,\n',
 'sueña el que agravia y ofende,\n',
 'y en el mundo, en conclusión,\n',
 'todos sueñan lo que son,\n',
 'aunque ninguno lo entiende.\n']