Sunday, March 25, 2018

ML with Python (numpy, pandas), my tools + code snippets (Part 1)


1. Jupyter notebooks

Jupyter is part of Anaconda pack of packages, but I prefer simple commands:
$ python3 -m pip install jupyter
Jupyter notebooks run a web server that can be queried either directly from web console, or via terminal. To find what localhost:port is used by a running server:
$ jupyter notebook list
To stop a server:
$ jupyter notebook stop 8888 
To open a notebook file (with .ipynb extension), in terminal point to a notebook directory and run:
$ jupyter notebook
This starts a server, then opens a notebook's home directory in a browser. Click a notebook to run it. Keep terminal shell open and running a server as an attached process. To stop a server use Ctrl-C, or just close a terminal.
Shift-Return to run selected cell and advance to the next.

2. Paths to Python components and packages

>>> import sys  
>>> print('\n'.join(sys.path))

3. PYZO editor for .py

PYZO works for me. It is lightweight, has debugging options, and is configurable.
Supports cells, blocks of code that can be executed with a simple shortcut (Cmd-Return). Define cells by delimiting with #$$ optional commentary lines.
Some shortcuts (original and my own custom):
---------- EDITOR: 
Opt-Tab    - select previous file 
F1         - focus to shell panel 
F2         - focus to file editor 
Cmd-/      - comment selection  
Cmd-Opt-/  - uncomment  
---------- RUN:  
Cmd-R      - run file as a script (in console restarts interpreter)  
Cmd-Return - run cell with cursor (cells are delimited by #%%)   
Opt-Return - run selection  
---------- DEBUGGER:  
Cmd-B      - toggle breakpoint  
F6         - step over  
F7         - step in  
F8         - step out  
Ctrl-Cmd-Y - continue  
Ctrl-Cmd-. - stop debugging 

4. Basics of numpy arrays

Many ways to init arrays. There are two main options - shape and dtype.
>>> x = np.ndarray(shape=(2, 2), dtype=np.int8, order='C') 
>>> print(x) 
[[1 0] [1 1]]
Internal buffer is linear. Shape can be changed easily without reallocation if total element count remains the same:
>>> x.shape = (1,4) 
>>> print(x) 
[[1 0 1 1]]
Other array constructors:
>>> print(np.array((1, 2, 3))) 
[1 2 3] 
>>> print(np.zeros((2, 3))) 
[[0. 0. 0.] 
 [0. 0. 0.]] 
>>> print(np.empty((2,))) 
[7.74860419e-304 7.74860419e-304]

Array construction using list comprehensions. Note, unspecified dimension size of -1 infers the required element count in that dimension:
>>> x = np.array([(x, y) for x in [1,2,3] for y in [3,1,4] if x != y]) 
>>> print(x) 
[[1 3] 
 [1 4] 
 [2 3] 
 [2 1] 
 [2 4] 
 [3 1] 
 [3 4]] 
>>> x.shape = (2, -1) 
>>> print(x) 
[[1 3 1 4 2 3 2] 
 [1 2 4 3 1 3 4]] 
>>> x.shape = (-1) 
>>> print(x) 
[1 3 1 4 2 3 2 1 2 4 3 1 3 4]

Numpy arrays in list comprehension expressions:
>>> y = np.array([e for e in x if not e % 2]) 
>>> print(y) 
[4 2 2 2 4 4]

Range into 2D array:
>>> x = np.arange(15).reshape(5, -1).T 
>>> print(x) 
[[ 0 3 6 9 12] 
 [ 1 4 7 10 13] 
 [ 2 5 8 11 14]]

Numpy array operations perform like native C memory access ops (or parallelized vectors in GPU). Therefore much faster than Python's list comprehensions:
#%% Timing init 
import time 
x1 = np.arange(1000000) 
x2 = np.arange(1000000) 

#%% Timing native 
t0 = time.time() 
for _ in range(10): 
    x1 **= 2 
t1 = time.time() 
print("x1 time = ", t1 - t0) 

#%% Timing list comprehension 
t0 = time.time() 
for _ in range(10): 
    x2 = np.array([x ** 2 for x in x1]) 
t1 = time.time() 
print("x2 time = ", t1 - t0) 
x1 time = 0.015141725540161133 
x2 time = 4.744795083999634

5. Numpy.random

To generate standard numpy arrays filled with random numbers:
np.random.rand(d0,..dn)     - with each value uniformly in range [0, 1). 
np.random.randn(d0,..dn)    - with each value in Gaussian distribution, where mean = 0, variance = 1 (sigma squared). 
sigma * np.random.randn(d0,..dn) + mean    - a full normal distribution.
There are lots of other useful functions, i.e. np.random.choice(...).

6. Array operations

Slice of an array returns a data structure that defines subrange and points at the original array. Note that the second index (i.e. in [3:10]) points beyond the last element:
x1 = np.arange(12) 
x1_slice = x1[3:10] 
print('x1_slice      = ', x1_slice) 
print('x1_slice[0:3] = ', x1_slice[0:3]) 
x1_slice[0:3] = 100 
[ 0 1 2 3 4 5 6 7 8 9 10 11] 
x1_slice      = [3 4 5 6 7 8 9] 
x1_slice[0:3] = [3 4 5] 
[ 0 1 2 100 100 100 6 7 8 9 10 11]

Multi-dim array slices:
x1 = np.arange(12).reshape(4,-1) 
x1_slice = x1[2:4] 
[[  0  1  2] 
 [  3  4  5] 
 [  6  7  8] 
 [  9 10 11]]

[[ 6 7 8] 
 [ 9 10 11]] 

[7 8]

Slicing across multiple dimensions:
print(x1[:2, 1:3]) 
x1[:, 1:2] = 100 
[[1 2] 
 [4 5]] 

[[  0 100  2]  
 [  3 100  5] 
 [  6 100  8] 
 [  9 100 11]]

7. Boolean indexing

Taken and modified from the pydata-book.
A boolean operation with array will return an array of boolean element-wise results. Boolean array when used as index will pick array elements where boolean component is True.
Boolean array size must be equal to data array size at the index:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Joe']) 
data = np.random.randint(1, 10, (5, 4)) 
mask = (names == 'Bob') | (names == 'Will') 
data[mask, 1:3] = 100. 
data[data > 5] = 0 
['Bob' 'Joe' 'Will' 'Bob' 'Joe'] 

[[1 4 3 6] 
 [6 1 3 3] 
 [7 1 6 1] 
 [3 3 9 4] 
 [4 7 8 9]] 

[ True False True True False] 

[[1 4 3 6] 
 [7 1 6 1] 
 [3 3 9 4]] 

[[  1 100 100   6] 
 [  6   1   3   3] 
 [  7 100 100   1] 
 [  3 100 100   4] 
 [  4   7   8   9]] 

[[1 0 0 0] 
 [0 1 3 3] 
 [0 0 0 1] 
 [3 0 0 4] 
 [4 0 0 0]]

8. Fancy indexing

Passing list of numbers into an indexer will treat it as an index picker. Allows assembling a new array from a combination of elements of another array:
x1 = np.arange(20).reshape((5, 4)) 
x2 = x1[[1, 4, 2, 2], [0, 3, 1, 2]] 
x1[[1, 4, 2, 2], [0, 3, 1, 2]] = 100 
[[ 0  1  2  3] 
 [ 4  5  6  7] 
 [ 8  9 10 11] 
 [12 13 14 15] 
 [16 17 18 19]] 

[ 4 19 9 10] 

[[  0   1   2   3] 
 [100   5   6   7] 
 [  8 100 100  11] 
 [ 12  13  14  15] 
 [ 16  17  18 100]]

9. Transposition

Transposed matrix is pointing at the same data (no copying takes place):
x1 = np.arange(15).reshape((3, 5)) 
x2 = x1.T 
x2[2] = 100 
[[ 0  1  2  3  4] 
 [ 5  6  7  8  9] 
 [10 11 12 13 14]] 

[[ 0 5 10] 
 [ 1 6 11] 
 [ 2 7 12] 
 [ 3 8 13] 
 [ 4 9 14]] 

[[  0  1 100  3  4] 
 [  5  6 100  8  9] 
 [ 10 11 100 13 14]]



