print("Hello World!")
print("hello \n World!") #save this script as a hello.py file
Hello World!
hello
World!
Economist widely used STATA
for the last 30 years to analyse economic data. Whether they are researching school selection, minimum wage, GDP, or stock trends, Stata provides all the statistics, graphics, and data-management tools needed to pursue a broad range of economic questions.
Python
is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++
or Java
. The language provides constructs intended to enable clear programs on both a small and large scale.
The core philosophy of the language is summarized by the document “PEP 20 (The Zen of Python)” as:
Python
is compact and readable. Programs written in Python are typically much shorter than equivalent C
, C++
, or Java
programs. Python is extensible: if you know how to program in C
, it is easy to add a new built-in function or module to the interpreter, either to perform critical operations at maximum speed.
The language is named after the BBC show Monty Python’s Flying Circus (1969-1974) and has nothing to do with reptiles.
Python
has a large standard library, commonly cited as one of Python’s greatest strengths, providing tools suited to many tasks.Python
within an application.Python
software, contains over 350,000 packages with a wide range of functionality.This is a short introduction to Python for course of data science for economists. Topics that will not be discussed in this course are:
Python
data science and machine learning on a single machine.let’s start with the most basic code humans ever made:
print("Hello World!")
print("hello \n World!") #save this script as a hello.py file
Hello World!
hello
World!
to run a python file in CMD type the following (But remember our Git/Shell class):
$python hello.py
BravoPython
program.
5+5
10
10-5
5
25/5
5.0
1+4)*6 (
30
= 2024
x print(x)
= "I love Pizza"
y print(y)
2024
I love Pizza
comment is made with the #. Python
ignores everything in a line that follows a #. let’s practice making comments
= 26 #Assigned the value 26 to variable i
i = 13
j # we can add them
+j
i-i
j/j i
2.0
Python can also manipulate strings, which can be expressed in several ways. They can be enclosed in single quotes (‘…’) or double quotes (“…”) with the same result.
= " Ahmed" #leave 1 space to make it fancy
FirstName = "Osman"
LastName = LastName + FirstName
FullName print(FirstName)
print(LastName)
print(FullName)
Ahmed
Osman
Osman Ahmed
len()
is a builtin function in python that returns the length of string.
print(len(FullName))
11
Two or more string literals (i.e. the ones enclosed between quotes) next to each other are automatically concatenated. This feature is particularly useful when you want to break long strings:
"Ahmed" " Osman"
'Ahmed Osman'
Or add letters to form a word like the following:
"A"+"h"+"m"+"e"+"d"
'Ahmed'
Python includes various compound data types for grouping together other values. Among these, the list stands out as the most flexible. It is represented by square brackets enclosing comma-separated values (items). Lists can comprise items of varying types, although typically, items within a list share the same type.
= [1,5,7,9,12]
values print(values)
print(len(values))
print(values[0]) #remember [[]] case of R
[1, 5, 7, 9, 12]
5
1
Lists are a mutable type, i.e. it is possible to change their content:
2]=100
values[ values
[1, 5, 100, 9, 12]
You can also add new items at the end of the list, by using the append() method
15)
values.append( values
[1, 5, 100, 9, 12, 15]
The Python
interpreter incorporates a variety of built-in functions that are consistently accessible. These functions are presented here in alphabetical order. You can use the help() function, for example help(abs), to access detailed information about a specific function.
'abs() aiter() all() any() anext() ascii() bin() bool() breakpoint() bytearray() bytes() callable() chr() classmethod() compile() complex() delattr() dict() dir() divmod() enumerate() eval() exec() filter() float() format() frozenset() getattr() globals() hasattr() hash() help() hex() id() input() int() isinstance() issubclass() iter() len() list() locals() map() max() memoryview() min() next() object() oct() open() ord() pow() print() property() range() repr() reversed() round() set() setattr() slice() sorted() staticmethod() str() sum() super() tuple() type() vars() zip() __import__()'
docstring
# 0, 1, 1, 2, 3, 5,
def fib(n): # write Fibonacci series up to n
"""Print a Fibonacci series up to n.
par n: integer
out : list
""" # the function help
= 0, 1
a, b while a < n:
print(a, end=' ')
# better than print(a), why?
= b, a+b a, b
help(fib)
Help on function fib in module __main__:
fib(n)
Print a Fibonacci series up to n.
par n: integer
out : list
4) fib(
0 1 1 2 3
Generally Functions are structured like:
def function_name(inputs):
# step 1
# step 2
# ...
return outputs
def mean(numbers):
= sum(numbers)
total = len(numbers)
N = total / N
answer
return answer
= [1,2,3,4,5,6]
x print(mean(x))
3.5
If we save the function to a file named mycollections.py, we could import the function as follows:
import mycollections
200)
mycollections.fib(
from mycollections import fib
2) fib(
0 1 1 2 3 5 8 13 21 34 55 89 144 0 1 1
Production functions are valuable tools for representing the economic activities of firms generating goods or the total output within an economy. Despite using the term “function” in a mathematical context, we will closely link the conceptualization of mathematical functions with the implementation of Python functions. \[ Y=F(K,L) \]
Cobb-Douglas production functions can help us understand how to create Python functions and why they are useful.
\[ Y=zK^\alpha L^{1-\alpha} \]
The function is parameterized by:
Now let’s define the Cobb-Douglas function which computes the output production with parameters \(z\)=1 and \(\alpha\)=0.33 and takes the input \(K\) and \(L\):
def cobb_douglas(K,L):
#create alpha and z
=1
z=0.33
alphareturn z*K**alpha*L**(1-alpha)
Now we can use this function and do the computations as fellow:
1.0,0.5) cobb_douglas(
0.6285066872609142
Now, Your turn: Define a function that compares two Cobb-Douglas functions and returns the ratio between the two production functions.
\[ Y_2=F(K_2,L_2)=F(\gamma K_1,\gamma L_1) \]
Hint : Use the same function to calculate all the functions.
Solution:
def ratio_CD (K, L, gamma):
= cobb_douglas(K,L)
y1 = cobb_douglas(gamma*K,gamma*L)
y2 = y2/y1
y_ratio return y_ratio/gamma
print(ratio_CD(1,0.5,2))
1.0
Types in Iterations in Python:
Before jumping into iteration let’s see how the if statements are computes in Python
.
# if condition
# code to run when condition is True
#else
# code to run if no conditions above are True
#return or print somethin
if (1<2) is True:
print(" 1 is less than 2")
1 is less than 2
Suppose we wanted to print out the first 10 integers and their squares. we could do somethings like this.
print(f"1**2={1**2}")
print(f"2**2={2**2}")
print(f"3**2={3**2}")
print(f"4**2={4**2}")
print(f"5**2={5**2}")
print(f"6**2={6**2}")
1**2=1
2**2=4
3**2=9
4**2=16
5**2=25
6**2=36
For loops can the same and with less coding efforts.
for i in range(1,11):
print(f"{i}**2={i**2}")
1**2=1
2**2=4
3**2=9
4**2=16
5**2=25
6**2=36
7**2=49
8**2=64
9**2=81
10**2=100
# for item in iterable:
# operation 1 with item
# operation 2 with item
# ...
# operation N with item
While loops:
= 0
i while i < 3:
print(i)
= i + 1
i print("done")
0
1
2
done
Suppose we wanted to know the smallest N
such that \(\sum_{i=0}^{N}{i>1000}\)
= 0
total = 0
i while total <= 1000:
= i + 1
i = total + i
total
print("The answer is", i)
The answer is 45
This takes us to more technical stuffs like data analysis and statistical modelling using both buil-in functions and libraries. Libraries offer a more efficient way of using python as they are fast executable. Pandas
and NumPy
are most frequent-used and well-know libraries that used in Python
.
NumPy
is a powerful Python package for manipulating data with multi-dimensional vectors. Its versatility and speed makes Python an ideal language for applied and computational mathematics. NumPy
’s core contribution is a new data-type called an array. An array is similar to a list, but numpy imposes some additional restrictions on how the data inside is organized
# to install: pip install numpy
import numpy as np #np is called alias and is standard way of calling libraries in Python
= np.array([1, 2, 3])
x_1d print(x_1d)
[1 2 3]
This is one dimensional array as a list number and you can do all the slicing and indexing operations as we previously saw.
print(x_1d[0])
print(x_1d[0:2])
print(x_1d[0:3] == x_1d[:])
1
[1 2]
[ True True True]
NumPy
arrays act like mathematical vectors and matrices: + and * perform component-wise addition or multiplication.
= np.array([1, 2, 3]), np.array([4, 5, 6])
x, y print(x, y)
print(x + 10) # Add 10 to each entry of x.
print(x * 4) # Multiply each entry of x by 4.
+ y x
[1 2 3] [4 5 6]
[11 12 13]
[ 4 8 12]
array([5, 7, 9])
Example: Write a function that defines the following matrix as NumPy array and return \(-A^3+9A^2-15A\)
=[
A3,1,4],
[1,5,9],
[-5,3,1]
[
]print(A)
# To print A matrix in a nice form you can use for loops(of course there are many ways to do it with packages)
for row in A:
print(row)
[[3, 1, 4], [1, 5, 9], [-5, 3, 1]]
[3, 1, 4]
[1, 5, 9]
[-5, 3, 1]
import numpy as np
def function_numpy(A):
= [row for row in A]
A_1 = np.array(A_1)
A_2 return(-A_2**3+9*A_2**2-15*A_2)
function_numpy(A)
array([[ 9, -7, 20],
[ -7, 25, -135],
[ 425, 9, -7]])
NumPy
has several attributes, some of which are listed below.
Attribute | Description |
---|---|
dtype | The type of the elements in the array. |
ndim | The number of axes (dimensions) of the array. |
shape | A tuple of integers indicating the size in each dimension. |
size | The total number of elements in the array. |
Examples
# Create an array of the first seven integers
7)
np.arange(# Create an array of floats from 1 to 12
1.,13.)
np.arange(# Create an array of values between 0 and 20, stepping by 2
0,20,2) np.arange(
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
For plotting NumPy
arrays, we can use Matplotlib
package the main visualization package for Python
.
import matplotlib.pyplot as plt
%matplotlib inline
# Step 1
= plt.subplots()
fig, ax
# Step 2
= np.linspace(0, 2*np.pi, 100)
x = np.sin(x)
y
# Step 3
ax.plot(x, y)
More advanved plot
= 50
N
42)
np.random.seed(
= np.random.rand(N)
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.pi * (15 * np.random.rand(N))**2 # 0 to 15 point radii
area
= plt.subplots()
fig, ax
=area, c=colors, alpha=0.5)
ax.scatter(x, y, s
ax.annotate("Starting point", xy=(x[0], y[0]), xycoords="data",
=(25, -25), textcoords="offset points",
xytext=dict(arrowstyle="->", connectionstyle="arc3,rad=0.6")
arrowprops )
Text(25, -25, 'Starting point')
Pandas
is at the top of the “scientific stack”, because it allows data to be imported, manipulated and exported so easily. In contrast, NumPy supports the bottom of the stack with fundamental infrastructure for array operations, mathematical calculations, and random number generation.
Pandas
is a Python package providing fast, flexible, and expressive data structures designed to work with relational or labeled data both. It is a fundamental high-level building block for doing practical, real world data analysis in Python.
Pandas is well suited for:
# To install use !pip install pandas -U
import pandas as pd
The first pandas data structure is a Series. A Series is a one-dimensional array that can hold any datatype, similar to a ndarray. However, a Series has an index that gives a label to each entry. An index generally is used to label the data. Typically a Series contains information about one feature of the data. For example, the data in a Series might show a class’s grades on a test and the Index would indicate each student in the class. To initialize a Series, the first parameter is the data and the second is the index.
import pandas as pd
import numpy as np
# let's initialize Series of student grades
= pd.Series(np.random.randint(0,100,4),['Alisha','Monica','Joseph','Eva'])
math = pd.Series(np.random.randint(0,100,5),['Alisha','Monica','Yusuf','Eva', 'Gia'])
english print(math, "/n")
#everytime you run you will get a different random numbers becuase of the randint function english
Alisha 23
Monica 74
Joseph 71
Eva 35
dtype: int32 /n
Alisha 37
Monica 83
Yusuf 98
Eva 88
Gia 98
dtype: int32
We created a random series with labels(here names) which can be used to find the specific values like the following case
print(math['Eva'])
'Alisha'] english[
35
37
Use .index
to find the labels of the data
english.index
Index(['Alisha', 'Monica', 'Yusuf', 'Eva', 'Gia'], dtype='object')
Remember for loops! They can be utilized to identify students whose names ends with the letter A
= [name.endswith('a') for name in english.index] #only shows true and false
a print(a)
'a') for name in english.index]] #for extraction use this one english[[name.endswith(
[True, True, False, True, True]
Alisha 37
Monica 83
Eva 88
Gia 98
dtype: int32
The second key pandas data structure is a DataFrame
. A DataFrame
is a collection of multiple Series. It can be thought of as a 2-dimensional array, where each row is a separate datapoint and each column is a feature of the data. The rows are labeled with an index (as in a Series) and the columns are labeled in the attribute columns.
= pd.DataFrame({"Math": math, "English":english})
grades grades
Math | English | |
---|---|---|
Alisha | 23.0 | 37.0 |
Eva | 35.0 | 88.0 |
Gia | NaN | 98.0 |
Joseph | 71.0 | NaN |
Monica | 74.0 | 83.0 |
Yusuf | NaN | 98.0 |
grades.columns
Index(['Math', 'English'], dtype='object')
= grades.dropna()
b b
Math | English | |
---|---|---|
Alisha | 23.0 | 37.0 |
Eva | 35.0 | 88.0 |
Monica | 74.0 | 83.0 |
Here is non-exhaustive list of Pandas
functions for data reading, writing, manipulation and statistical operations.
Method | Returns |
---|---|
to_csv() | Write the index and entries to a CSV file |
read_csv() | Read a csv and convert into a DataFrame |
to_json() | Convert the object to a JSON string |
to_pickle() | Serialize the object and store it in an external file |
to_sql() | Write the object data to an open SQL database |
read_html() | Read a table in an html page and convert to a DataFrame |
append() | Concatenate two or more Series |
drop() | Remove the entries with the specified label or labels |
drop_duplicates() | Remove duplicate values |
dropna() | Drop null entries |
fillna() | Replace null entries with a specified value or strategy |
reindex() | Replace the index |
sample() | Draw a random entry |
shift() | Shift the index |
unique() | Return unique values |
abs() | Object with absolute values taken (of numerical data) |
idxmax() | The index label of the maximum value |
idxmin() | The index label of the minimum value |
count() | The number of non-null entries |
cumprod() | The cumulative product over an axis |
cumsum() | The cumulative sum over an axis |
max() | The maximum of the entries |
mean() | The average of the entries |
median() | The median of the entries |
min() | The minimum of the entries |
mode() | The most common element(s) |
prod() | The product of the elements |
sum() | The sum of the elements |
var() | The variance of the elements |
Example
= pd.read_csv("https://raw.githubusercontent.com/jorisvandenbossche/course-python-data/main/notebooks/data/titanic.csv")
titanic titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Use pandas, we can calculate:
'Age'].hist() titanic[
'Sex')['Survived'].mean() titanic.groupby(
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
'Pclass')['Survived'].mean().plot.bar() titanic.groupby(
len(titanic)
891
'Pclass')['Fare'].mean() titanic.groupby(
Pclass
1 84.154687
2 20.662183
3 13.675550
Name: Fare, dtype: float64
'Age']<=25]['Survived'].mean() #Or you can make a new df and then calcualte the mean titanic[titanic[
0.4119601328903654