331 KiB
Table of Contents
- Pandas Dataframes
- Creating a simple DataFrame
- Presenting results with DataFrames
- Extracting a subset of data
- Importing data from file to DataFrame
- Import example
- Filtering data
- Filtering by multiple conditions
- Exporting a DataFrame to a file
- GroupBy
- Printing with
df.head()
anddf.tail()
- Something to be aware of
- Much more functionality
- When and why to use
pandas
- Exercise 1.1
- Exercise 1.2
- Exercse 1.3
- Exercise 1.4
- Exercise 2.1
- Exercise 2.2
- Exercise 2.3
- Exercise 2.4
- Exercise 2.5
- If you are up for more
Pandas Dataframes¶
Python is very good for data analysis. Much of this is thanks to the pandas
library, which contains a wealth of powerful functions to load data and manipulate it.
In the pandas
environment what we normally refer to as a table is called a DataFrame. Is the data has only a single column, it is called a Series. These are the core objects in the library.
As with many libraries, there is a convection for renaming when importing. In pandas
the convention is to import as pd
:
import pandas as pd
Creating a simple DataFrame¶
A simple DataFrame can be created as with pandas.DataFrame()
:
# Create a simple DataFrame
df = pd.DataFrame({'Column1': [11, 12, 13], 'Column2': [21, 22, 23], 'Column3': [31, 32, 33]})
df
- Note 1: The input argument for creating the DataFrame is a dictionary. I.e. a data structure with keys-value pairs.
- Note 2: It automatically creates and index column as the leftmost column.
- Note 3: The displayed DataFrame looks nicer than the it would have in an editor. This is because it is styled with HTML. In an editor, the printed DataFrame would look like this:
# DataFrame as it would look without HTML-styling
print(df)
Presenting results with DataFrames¶
If we have a dictionary from a previous calculation of some kind, we can quickly turn it into a DataFrame with the same pricinple as above:
# Define normal force and cross sectional area
normal_force = [85, 56, 120]
area = [314, 314, 314]
# Compute stress in cross section for all normal forces
stress = [n/a for n, a in zip(normal_force, area)]
# Gather calculation results in dictionary
results = {'N [kN]': normal_force, 'A [mm2]': area, 'sigma_n [MPa]': stress}
# Create a DataFrame of the results form the dictionary
df2 = pd.DataFrame(results)
df2
Adjusting the index column¶
The default index (leftmost column) is not really suited for this particular scenario, so we could change it to be "Load Case" and have it start at 1 instead of 0
# Set the name of the index to "Load Case"
df2.index.name = 'Load Case'
# Add 1 to all indices
df2.index += 1
df2
Extracting a subset of data¶
We can extract specific columns from the DataFrame:
# Extract only the stress column to new DataFrame
df2[['sigma_n [MPa]']]
- Note: The use of two square bracket pairs
[[]]
turns the result into a new DataFrame, with just one column. If there had been only a single square bracket, the result would be a Series object. See below.
# Extract stress column to Series object
df2['sigma_n [MPa]']
Most of the time, we want to keep working with DataFrames, so remember to put double square brackets.
Double square brakcets must be used if we want to extract more than one column. Otherwise, a KeyError
will be raised.
# Extract multiple columns to DataFrame
df2[['N [kN]', 'sigma_n [MPa]']]
Importing data from file to DataFrame¶
Data can be imported from various file types. The most common ones are probably standard text files (.txt
), comma separated value files (.csv
) or Excel files (.xlsx
)
Some common scenarios
# Import from .csv (comma separated values)
pd.read_csv('<file_name>.csv')
# Import from .txt with values separated by white space
pd.read_csv('<file_name>.txt', delim_whitespace=True)
# Import from Excel
pd.read_excel('<file_name>.xlsx', sheet_name='<sheet_name>')
The above assumes that the files to import are located in the same directory as the script. Placing it there makes it easier to do the imports.
The functions above have many optional arguments. When importing from an Excel workbook it will often be necessary to specify more parameters than when importing a plain text file, because the Excel fil is a lot more complex. For example, by default the data starts at cell A1 as the top left and the default sheet is the first sheet occuring in the workbook, but this is not always what is wanted.
See docs for both functions here:
Import example¶
# Import data from 'HEA.txt', which has data separated
# by white spaces
df = pd.read_csv('HEA.txt', delim_whitespace=True)
df
Filtering data¶
Data filtering is easy and intuitive. It is done by conditional expressions.
For example, if we want to filter the HEA-DataFrame for profiles with moment of inertia $I_y$ larger than some value:
df[df['Iy[mm4]'] > 30000000]
Understanding the filtering process¶
The inner expression of the filtering
df['Iy[mm4]'] > 30000000
returns the column Iy[mm4]
from the DataFrame converted into a boolean Series. I.e. a Series with True
/False
in each row depending on the condition being fulfilled or not. See the printout below.
# Inner expression returns a boolean Series of Iy[mm4]
df['Iy[mm4]'] > 30000000
This boolean Series is used to filter the original DataFrame, which is done in the outer expression by df[boolean_series]
.
The outer expression picks only the rows from the orignal DataFrame where the boolean series is True
.
Filtering by multiple conditions¶
Filtering based on multiple conditions can be quite powerful. The syntax is only slightly more complicated
df[(df['Iy[mm4]'] > 30000000) & (df['h[mm]'] < 260 )]
Filtering can also be based on lists of values:
# Valid profiles to choose from
valid_profiles = ['HE180A', 'HE220A', 'HE260A', 'HE280A']
# Filter DataFrame based in Iy and valid profiles
df[(df['Iy[mm4]'] > 30000000) & (df['Profile'].isin(valid_profiles) )]
If we want to rule out some profiles, we could put a ~
in front of the condition to specify that values must not be present in the list:
# Invalid profiles
invalid_profiles = ['HE180A', 'HE220A', 'HE260A', 'HE280A']
# Filter DataFrame based in Iy and valid profiles
df[(df['Iy[mm4]'] > 30000000) & (~df['Profile'].isin(invalid_profiles) )]
Exporting a DataFrame to a file¶
Exporting a DataFrame to a new text file could not be easier. Saving to a .txt
:
# Save df to a .txt file in the same folder as the script
df.to_csv('filename.txt')
# Create a dataframe to work with
dff = pd.DataFrame({'Fruit': ['Pear', 'Apple', 'Apple', 'Banana', 'Lemon', 'Banana', 'Banana', 'Pear'],
'Amount_sold': [3, 6, 7, 2, 4, 7, 1, 6]})
dff
The DataFrame.groupby
method itself returns a groupby object, not a DataFrame. So printing that on its own will just show you the object.
# The gropuby will return a groupby object
dff.groupby('Fruit')
The object contains metadata about how the data is grouped. The powerful operations are visible only after we apply a certain function to the groupby object, like sum()
:
dff.groupby('Fruit').sum()
We could say that we first split the DataFrame in fruit groups, applied a function to those individual groups and combined and returned the results.
Note that by default the column that was grouped by becomes the new index, since these are now unique values.
Documentation¶
Documentation for groupby
: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
Documentation for apply
: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html
Printing with df.head()
and df.tail()
¶
When DataFrames become very large, printing all the data to the screen becomes unwieldy. Printing is mostly done only to make sure that some operation worked as we expected it would. In that case, printing just a few rows will be sufficient, which the following methods will alloww for:
# Print the first 5 rows of df
df.head()
# Print the last 5 rows of df
df.tail()
# Print the first x rows of df
df.head(x)
# Print the last y rows of df
df.tail(y)
Something to be aware of¶
A potentially confusing thing about pandas
methods is that it can be hard to know which mutates the DataFrame inplace and which needs to be saved to a new varaible. Consider the lines below:
# This line does not rename the column in df
# but returns a a copy of df with the column renamed.
df.rename(columns={'Current_name': 'New_name})
# Thus, it has to be saved to a new variable
df = df.rename(columns={'Current_name': 'New_name})
# Or, use the argument inplace=True to modify df directly
df.rename(columns={'Current_name': 'New_name}, inplace=True)
You will most likely stumble across this when working with pandas
.
Note that there is no error when when executing the first line shown above, but when df
is eventually printed it will just not be as intended.
Much more functionality¶
There are numerous functions and methods available in pandas
and the above mentioned barely scrathes the surface.
Practically anything that you would want to do to a dataset can be done. And quite possibly somebody has had the same problem as you before and found a solution or maybe even even contributed to the pandas
library and put in that functionality for everyone to use.
However, some functionality can be much harder to understand and use than the above mentioned.
The pandas
library integrates well with other big libraries like numpy
and matplotlib
and other functionality in the Python language in general. For example, many DataFrame methods can take as input a customly defined function def ...()
and run it through certian content of the DataFrame.
Plotting with matplotlib
is directly supported in pandas
via shortcuts so you can do df.plot()
and it will create a plot of the DataFrame of a specified kind even without having to import matplotlib
.
When and why to use pandas
¶
- The manipulations that can be done with
pandas
are quite powerful when datasets become much larger than ones shown above. It is especially helpful when the dataset reaches a size where all data can not be viewed and understood well by simply scrolling down and looking at the data. If the number of rows go beyond just a couple of thousands, it is hard to get the overall feel for the data and its trends just by inspection. This is were typing logic commands to do manipulations becomes a great help.
- Use it when a very specific solution for data manipulation is desired. Especially when the solution is not trivially done in for example Excel.
- It is a good tool for combining multiple datasets, e.g. from different files.
- Last but not least, it is good for reproducibility and handling changes in data size.
Exercise 1.1¶
All exercises 1.x are working with the same DataFrame.
Create a DataFrame from the dictionary d
below. Save it is a variable called df
.
# Import built-in libraries string and random
import random
import string
# Get upper- and lowercase letters from string library
lower = string.ascii_lowercase
upper = string.ascii_uppercase
# Create a dictionary with dummy data of integers and letters
d = {'Integers': [random.randint(1, 100) for i in range(1, 100)],
'Lowercase': [random.choice(lower) for i in range(1, 100)],
'Uppercase': [random.choice(upper) for i in range(1, 100)]}
Print/display the entire DataFrame to see it if it comes out as you expect.
Remember to import pandas as pd
.
Exercise 1.2¶
Print/display the only the first or last rows by using DataFrame.head()
or DataFrame.tail()
. You choose how many rows to print (default is 5).
Use these methods to test print the DataFrames from now on to avoid printing all rows.
Exercse 1.3¶
Filter df
to only contain the rows where the uppercase letter is 'K'
. Save it to a new variable called dfk
.
Print/display it to make sure it it correct.
If you were unlucky and did not have a 'K'
generated in the uppercase column, try re-running the code.
Exercise 1.4¶
When printing the filtered dfk
, notice that the index from the original DataFrame is kept. This is often useful for back reference, but sometimes we want the index to be reset.
Reset the index of dfk
to start from 0 by using DataFrame.reset_index()
.
This method does not modify the DataFrame inplace by default, so remember to either save to a new variable or give the input argument inplace=True
.
By default, the orignal index will be added as a new column to the DataFrame. If you don't want this, use the input argument drop=True
.
Exercise 2.1¶
All exercises 2.x are to be seen as the same problem. It has just been divided into smaller tasks.
Import the file shear_key_forces.csv
to a DataFrame using pandas.read_csv()
. The values in the file are comma separated, which the function also assumes as default.
The file is located in the Session 5 folder and has 104329 rows. Print the head or the tail to see the imported data.
The data has all spring element forces in a bunch of load cases from a Sofistik finite element calculation.
Exercise 2.2¶
The model has many spring elements. Some of them represent shear keys between tunnel parts at movement joints. These are the one we are going to extract.
The data has a column 'shear_key' which has the name of the shear key if the element in that row is part of a shear key. E.g. 'Shear_key1'. If the element is not part of a shear key, the name is 'Not_a_shear_key'
Filter out all rows which are not part of a shear key. The resulting DataFrame should have 2874 rows.
Exercise 2.3¶
Since we are not really using the 'Element_no' column. Go ahead and remove it from the DataFrame. This can be done by
# Remove column 'column_name' form 'df'
df = df.drop('column_name', axis=1)
The argument axis=1
specifies that it is a column and not a row that should be removed.
Remember to save to a new variable or use argument inplace=True
. If you save to a variable, you can use the same name to 'overwrite' the old one if it's not needed anymore.
Exercise 2.4¶
Each shear key consists of three spring elements. The total force that the shear key should be designed for is the sum of those three spring forces.
Create a DataFrame with the sum of the three values within each shear key for every load case. The resulting DataFrame should have 958 rows.
Hint: Use the methods DataFrame.groupby()
and DataFrame.sum()
like this:
df.groupby(['Shear_key', 'LC', 'LC-title'], as_index=False).sum()
Replace df
with the name of your variable contating the DataFrame.
Here, a list of column labels is passed in the groupby()
method instead of just a single column label. The first column 'Shear_key'
is what is used to create the groups, while consecutive labels just follow. Any columns that are not passed in will not appear in the resulting DataFrame.
Exercise 2.5¶
Filter the DataFrame for a shear key, for example 'Shear_key1' and create a bar plot of it with the DataFrame.plot()
method. The bar plot should have the load cases as $x$-values and the force $P$ [kN] as $y$-values.
# Plot dataframe contents
df.plot(kind='bar', x='column_for_x_values', y='column_for_y_values')
The method has many optional arguments, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html.
Try for example to change the figure size by figsize=(width, height)
, rotate the x-ticks by rot=angle_in_degrees
and change the color of the bars by color='some_color'
.
If you are up for more¶
Create a loop that goes through all shear keys, creates a plot like the one from the previous exercise and saves each plot to a png-file.