A Gentle Introduction to Python

Author

Ahmed Osman

Published

January 24, 2024

Introduction

Economist widely used STATA for the last 30 years to analyse economic data. Whether they are researching school selection, minimum wage, GDP, or stock trends, Stata provides all the statistics, graphics, and data-management tools needed to pursue a broad range of economic questions.

Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale.

https://xkcd.com/353/

The core philosophy of the language is summarized by the document “PEP 20 (The Zen of Python)” as:

  • Beautiful is better than ugly
  • Explicit is better than implicit
  • Simple is better than complex Complex is better than complicated
  • Readability counts

Python is compact and readable. Programs written in Python are typically much shorter than equivalent C, C++, or Java programs. Python is extensible: if you know how to program in C, it is easy to add a new built-in function or module to the interpreter, either to perform critical operations at maximum speed.

Why we should use Python?

  • Open source!
  • Easy to learn and use with no programing experiences.
  • Rich ecosystem for modeling and data analysis

Fun fact!

The language is named after the BBC show Monty Python’s Flying Circus (1969-1974) and has nothing to do with reptiles.

Python modules

  • Python has a large standard library, commonly cited as one of Python’s greatest strengths, providing tools suited to many tasks.
  • The standard library is not essential to run Python or embed Python within an application.
  • Python package Index (PyPI), the official repository for third-party Python software, contains over 350,000 packages with a wide range of functionality.
  • Notable applications: web scraping, scientific computing, text processing, image processing.

What this course is not about?

This is a short introduction to Python for course of data science for economists. Topics that will not be discussed in this course are:

  • Package developing
  • Efficient coding (partly, we will discuss reproducibility)
  • Web development
  • Geospatial analysis

Python from scratch

Installing Python

Python scripts

let’s start with the most basic code humans ever made:

print("Hello World!")
print("hello \n World!") #save this script as a hello.py file
Hello World!
hello 
 World!

to run a python file in CMD type the following (But remember our Git/Shell class):

  $python hello.py

Bravo you made your first Python program.

Basics

5+5
10
10-5
5
25/5
5.0
(1+4)*6
30
x = 2024
print(x)
y = "I love Pizza"
print(y)
2024
I love Pizza

comment is made with the #. Python ignores everything in a line that follows a #. let’s practice making comments

i = 26 #Assigned the value 26 to variable i
j = 13
# we can add them
i+j
j-i
i/j
2.0

Strings

Python can also manipulate strings, which can be expressed in several ways. They can be enclosed in single quotes (‘…’) or double quotes (“…”) with the same result.

FirstName = " Ahmed" #leave 1 space to make it fancy
LastName = "Osman"
FullName = LastName + FirstName
print(FirstName)
print(LastName)
print(FullName)
 Ahmed
Osman
Osman Ahmed

len() is a builtin function in python that returns the length of string.

print(len(FullName))
11

Two or more string literals (i.e. the ones enclosed between quotes) next to each other are automatically concatenated. This feature is particularly useful when you want to break long strings:

"Ahmed" " Osman"
'Ahmed Osman'

Or add letters to form a word like the following:

"A"+"h"+"m"+"e"+"d"
'Ahmed'

Lists

Python includes various compound data types for grouping together other values. Among these, the list stands out as the most flexible. It is represented by square brackets enclosing comma-separated values (items). Lists can comprise items of varying types, although typically, items within a list share the same type.

values = [1,5,7,9,12]
print(values)
print(len(values))
print(values[0]) #remember [[]] case of R
[1, 5, 7, 9, 12]
5
1

Lists are a mutable type, i.e. it is possible to change their content:

values[2]=100
values
[1, 5, 100, 9, 12]

You can also add new items at the end of the list, by using the append() method

values.append(15)
values
[1, 5, 100, 9, 12, 15]

More Python Concepts

Functions

The Python interpreter incorporates a variety of built-in functions that are consistently accessible. These functions are presented here in alphabetical order. You can use the help() function, for example help(abs), to access detailed information about a specific function.

'abs() aiter() all() any() anext() ascii() bin() bool() breakpoint() bytearray() bytes() callable() chr() classmethod() compile() complex() delattr() dict() dir() divmod() enumerate() eval() exec() filter() float() format() frozenset() getattr() globals() hasattr() hash() help() hex() id() input() int() isinstance() issubclass() iter() len() list() locals() map() max() memoryview() min() next() object() oct() open() ord() pow() print() property() range() repr() reversed() round() set() setattr() slice() sorted() staticmethod() str() sum() super() tuple() type() vars() zip() __import__()'
  • We can create a function that writes the Fibonacci series to an arbitrary boundary.
  • The function should also include a proper help with the docstring
    • The first line should always be a short, concise summary of the object’s purpose.
    • If there are more lines in the documentation string, the second line should be blank, visually separating the summary from the rest of the description.
    • The following lines should be one or more paragraphs describing the object’s calling conventions, its side effects, etc.
# 0, 1, 1, 2, 3, 5,
def fib(n):    # write Fibonacci series up to n
    """Print a Fibonacci series up to n.
    
    par n: integer
    out  : list
    """ # the function help
    a, b = 0, 1
    while a < n:
        print(a, end=' ')
        #  better than print(a), why?
        a, b = b, a+b
help(fib)
Help on function fib in module __main__:

fib(n)
    Print a Fibonacci series up to n.

    par n: integer
    out  : list
fib(4)
0 1 1 2 3 

Generally Functions are structured like:

def function_name(inputs):
    # step 1
    # step 2
    # ...
    return outputs
def mean(numbers):
    total = sum(numbers)
    N = len(numbers)
    answer = total / N

    return answer

x = [1,2,3,4,5,6]
print(mean(x))
3.5

If we save the function to a file named mycollections.py, we could import the function as follows:

import mycollections
mycollections.fib(200)

from mycollections import fib
fib(2)
0 1 1 2 3 5 8 13 21 34 55 89 144 0 1 1 

Example from economics

Production functions are valuable tools for representing the economic activities of firms generating goods or the total output within an economy. Despite using the term “function” in a mathematical context, we will closely link the conceptualization of mathematical functions with the implementation of Python functions. \[ Y=F(K,L) \]

Cobb-Douglas production functions can help us understand how to create Python functions and why they are useful.

\[ Y=zK^\alpha L^{1-\alpha} \]

The function is parameterized by:

  • A parameter \(\alpha\) \(\large \epsilon\)[0,1], called the “output elasticity of capital”.
  • A value \(z\) called the Total Factor Productivity (TFP).

Now let’s define the Cobb-Douglas function which computes the output production with parameters \(z\)=1 and \(\alpha\)=0.33 and takes the input \(K\) and \(L\):

def cobb_douglas(K,L):
  #create alpha and z
  z=1
  alpha=0.33
  return z*K**alpha*L**(1-alpha)

Now we can use this function and do the computations as fellow:

cobb_douglas(1.0,0.5)
0.6285066872609142

Now, Your turn: Define a function that compares two Cobb-Douglas functions and returns the ratio between the two production functions.

\[ Y_2=F(K_2,L_2)=F(\gamma K_1,\gamma L_1) \]

Hint : Use the same function to calculate all the functions.

Solution:

Code
def ratio_CD (K, L, gamma):
  y1 = cobb_douglas(K,L)
  y2 = cobb_douglas(gamma*K,gamma*L)
  y_ratio = y2/y1
  return y_ratio/gamma

print(ratio_CD(1,0.5,2))
1.0

Iteration

Types in Iterations in Python:

  • While Loop –> A condition-based iteration
  • For Loop –> An iteration technique that traverses through iterable objects
  • Recursion –> a programming technique where a function calls itself to solve a problem by breaking it down into smaller instances of the same problem.

Before jumping into iteration let’s see how the if statements are computes in Python.

# if condition
    # code to run when condition is True
#else
    # code to run if no conditions above are True
    #return or print somethin

if (1<2) is True:
  print(" 1 is less than 2")    
 1 is less than 2

Suppose we wanted to print out the first 10 integers and their squares. we could do somethings like this.

print(f"1**2={1**2}")
print(f"2**2={2**2}")
print(f"3**2={3**2}")
print(f"4**2={4**2}")
print(f"5**2={5**2}")
print(f"6**2={6**2}")
1**2=1
2**2=4
3**2=9
4**2=16
5**2=25
6**2=36

For loops can the same and with less coding efforts.

for i in range(1,11):
  print(f"{i}**2={i**2}")
1**2=1
2**2=4
3**2=9
4**2=16
5**2=25
6**2=36
7**2=49
8**2=64
9**2=81
10**2=100
# for item in iterable:
   # operation 1 with item
   # operation 2 with item
   # ...
   # operation N with item

While loops:

i = 0
while i < 3:
    print(i)
    i = i + 1
print("done")
0
1
2
done

Suppose we wanted to know the smallest N such that \(\sum_{i=0}^{N}{i>1000}\)

total = 0
i = 0
while total <= 1000:
    i = i + 1
    total = total + i

print("The answer is", i)
The answer is 45

This takes us to more technical stuffs like data analysis and statistical modelling using both buil-in functions and libraries. Libraries offer a more efficient way of using python as they are fast executable. Pandas and NumPy are most frequent-used and well-know libraries that used in Python.

Data Analysis

NumPy Arrays

NumPy is a powerful Python package for manipulating data with multi-dimensional vectors. Its versatility and speed makes Python an ideal language for applied and computational mathematics. NumPy’s core contribution is a new data-type called an array. An array is similar to a list, but numpy imposes some additional restrictions on how the data inside is organized

# to install: pip install numpy 
import numpy as np #np is called alias and is standard way of calling libraries in Python 
x_1d = np.array([1, 2, 3])
print(x_1d)
[1 2 3]

This is one dimensional array as a list number and you can do all the slicing and indexing operations as we previously saw.

print(x_1d[0])
print(x_1d[0:2])
print(x_1d[0:3] == x_1d[:])
1
[1 2]
[ True  True  True]

NumPy arrays act like mathematical vectors and matrices: + and * perform component-wise addition or multiplication.

x, y = np.array([1, 2, 3]), np.array([4, 5, 6])
print(x, y)
print(x + 10) # Add 10 to each entry of x.
print(x * 4) # Multiply each entry of x by 4.
x + y
[1 2 3] [4 5 6]
[11 12 13]
[ 4  8 12]
array([5, 7, 9])

Example: Write a function that defines the following matrix as NumPy array and return \(-A^3+9A^2-15A\)

A=[
  [3,1,4],
  [1,5,9],
  [-5,3,1]
]
print(A)

# To print A matrix in a nice form you can use for loops(of course there are many ways to do it with packages)
for row in A:
  print(row)
[[3, 1, 4], [1, 5, 9], [-5, 3, 1]]
[3, 1, 4]
[1, 5, 9]
[-5, 3, 1]
import numpy as np
def function_numpy(A):
  A_1 = [row for row in A] 
  A_2 = np.array(A_1)
  return(-A_2**3+9*A_2**2-15*A_2)

function_numpy(A)
array([[   9,   -7,   20],
       [  -7,   25, -135],
       [ 425,    9,   -7]])

NumPy has several attributes, some of which are listed below.

NumPy attributes
Attribute Description
dtype The type of the elements in the array.
ndim The number of axes (dimensions) of the array.
shape A tuple of integers indicating the size in each dimension.
size The total number of elements in the array.

Examples

# Create an array of the first seven integers 
np.arange(7)
# Create an array of floats from 1 to 12
np.arange(1.,13.)
# Create an array of values between 0 and 20, stepping by 2
np.arange(0,20,2)
array([ 0,  2,  4,  6,  8, 10, 12, 14, 16, 18])

For plotting NumPy arrays, we can use Matplotlib package the main visualization package for Python.

import matplotlib.pyplot as plt
%matplotlib inline

# Step 1
fig, ax = plt.subplots()

# Step 2
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)

# Step 3
ax.plot(x, y)

More advanved plot

N = 50

np.random.seed(42)

x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = np.pi * (15 * np.random.rand(N))**2  # 0 to 15 point radii

fig, ax = plt.subplots()

ax.scatter(x, y, s=area, c=colors, alpha=0.5)

ax.annotate(
    "Starting point", xy=(x[0], y[0]), xycoords="data",
    xytext=(25, -25), textcoords="offset points",
    arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=0.6")
)
Text(25, -25, 'Starting point')

Data Wrangling with Pandas

Pandas is at the top of the “scientific stack”, because it allows data to be imported, manipulated and exported so easily. In contrast, NumPy supports the bottom of the stack with fundamental infrastructure for array operations, mathematical calculations, and random number generation.

Pandas is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.

Pandas is well suited for:

  • Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
  • Ordered and unordered (not necessarily fixed-frequency) time series data.
  • Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
  • Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

key Features of Pandas

  • Easy handling of missing data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the data can be aligned automatically
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes
  • Robust IO tools for loading data from flat files, Excel files, databases, and HDF5
  • Time series functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
# To install use !pip install pandas -U
import pandas as pd

Pandas Data Structures

Series

The first pandas data structure is a Series. A Series is a one-dimensional array that can hold any datatype, similar to a ndarray. However, a Series has an index that gives a label to each entry. An index generally is used to label the data. Typically a Series contains information about one feature of the data. For example, the data in a Series might show a class’s grades on a test and the Index would indicate each student in the class. To initialize a Series, the first parameter is the data and the second is the index.

import pandas as pd
import numpy as np
# let's initialize Series of student grades
math = pd.Series(np.random.randint(0,100,4),['Alisha','Monica','Joseph','Eva'])
english = pd.Series(np.random.randint(0,100,5),['Alisha','Monica','Yusuf','Eva', 'Gia'])
print(math, "/n")
english #everytime you run you will get a different random numbers becuase of the randint function
Alisha    23
Monica    74
Joseph    71
Eva       35
dtype: int32 /n
Alisha    37
Monica    83
Yusuf     98
Eva       88
Gia       98
dtype: int32

We created a random series with labels(here names) which can be used to find the specific values like the following case

print(math['Eva']) 
english['Alisha']
35
37

Use .index to find the labels of the data

english.index
Index(['Alisha', 'Monica', 'Yusuf', 'Eva', 'Gia'], dtype='object')

Remember for loops! They can be utilized to identify students whose names ends with the letter A

a = [name.endswith('a') for name in english.index] #only shows true and false
print(a) 
english[[name.endswith('a') for name in english.index]] #for extraction use this one
[True, True, False, True, True]
Alisha    37
Monica    83
Eva       88
Gia       98
dtype: int32

Dataframe

The second key pandas data structure is a DataFrame. A DataFrame is a collection of multiple Series. It can be thought of as a 2-dimensional array, where each row is a separate datapoint and each column is a feature of the data. The rows are labeled with an index (as in a Series) and the columns are labeled in the attribute columns.

grades = pd.DataFrame({"Math": math, "English":english})
grades
Math English
Alisha 23.0 37.0
Eva 35.0 88.0
Gia NaN 98.0
Joseph 71.0 NaN
Monica 74.0 83.0
Yusuf NaN 98.0
grades.columns
Index(['Math', 'English'], dtype='object')
b = grades.dropna()
b
Math English
Alisha 23.0 37.0
Eva 35.0 88.0
Monica 74.0 83.0

Here is non-exhaustive list of Pandas functions for data reading, writing, manipulation and statistical operations.

Pandas common functions
Method Returns
to_csv() Write the index and entries to a CSV file
read_csv() Read a csv and convert into a DataFrame
to_json() Convert the object to a JSON string
to_pickle() Serialize the object and store it in an external file
to_sql() Write the object data to an open SQL database
read_html() Read a table in an html page and convert to a DataFrame
append() Concatenate two or more Series
drop() Remove the entries with the specified label or labels
drop_duplicates() Remove duplicate values
dropna() Drop null entries
fillna() Replace null entries with a specified value or strategy
reindex() Replace the index
sample() Draw a random entry
shift() Shift the index
unique() Return unique values
abs() Object with absolute values taken (of numerical data)
idxmax() The index label of the maximum value
idxmin() The index label of the minimum value
count() The number of non-null entries
cumprod() The cumulative product over an axis
cumsum() The cumulative sum over an axis
max() The maximum of the entries
mean() The average of the entries
median() The median of the entries
min() The minimum of the entries
mode() The most common element(s)
prod() The product of the elements
sum() The sum of the elements
var() The variance of the elements

Example

titanic = pd.read_csv("https://raw.githubusercontent.com/jorisvandenbossche/course-python-data/main/notebooks/data/titanic.csv")
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Use pandas, we can calculate:

  1. Age distrbution
  2. Survival rate by different sex groups as well as classes of titanic
  3. Records has the dataset (rows)
  4. Average fare by class
  5. Likelihood of young people to survive <=25
titanic['Age'].hist()

titanic.groupby('Sex')['Survived'].mean()
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
titanic.groupby('Pclass')['Survived'].mean().plot.bar()

len(titanic)
891
titanic.groupby('Pclass')['Fare'].mean()
Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64
titanic[titanic['Age']<=25]['Survived'].mean() #Or you can make a new df and then calcualte the mean
0.4119601328903654