Compare commits

...

14 Commits
master ... v2

Author SHA1 Message Date
Jake VanderPlas ce26f2373b remove contents cells from chapters 00 and 01 2021-10-18 19:37:40 -07:00
Jake VanderPlas 4d709a9126 Update 02.05 2021-10-06 06:40:30 -07:00
Jake VanderPlas e890dafd06 Update 02.03, 02.04 2021-10-06 06:25:30 -07:00
Jake VanderPlas 68ff5a4d78 Update jupytext version 2021-10-05 06:45:57 -07:00
Jake VanderPlas 231dc690f7 Update 02.00, 02.02, 02.02 2021-10-05 06:45:46 -07:00
Jake VanderPlas 7d64b946f5 Update requirements 2021-10-05 05:52:15 -07:00
Jake VanderPlas 175a5dc1f3 update requirements.txt to most recent versions of packages 2021-03-19 06:58:47 -07:00
Jake VanderPlas 4ee6907795 Update 01.07 and 01.08 2021-03-19 06:50:42 -07:00
Jake VanderPlas 153182a6ef Update 01.06 2021-03-11 06:51:32 -08:00
Jake VanderPlas 75160a70d0 update 01.05 2021-03-11 06:37:39 -08:00
Jake VanderPlas 8e6ddff89b Update 01.00-01.04 2021-03-11 06:20:18 -08:00
Jake VanderPlas 795099efd1 Update 00.00-Preface 2021-03-08 07:44:03 -08:00
Jake VanderPlas 3ee9ce82f5 Add pre-commit github action 2021-03-08 06:49:10 -08:00
Jake VanderPlas f8ab0bfd72 Start notebooks_v2 and and sync to md with jupytext 2021-03-08 06:42:23 -08:00
184 changed files with 101433 additions and 10 deletions

22
.github/workflows/ci-build.yaml vendored 100644
View File

@ -0,0 +1,22 @@
name: CI
on:
# Trigger the workflow on push or pull request,
# but only for the master branch
push:
branches:
- v2
pull_request:
branches:
- v2
jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.9
uses: actions/setup-python@v2
with:
python-version: 3.9
- uses: pre-commit/action@v2.0.0

View File

@ -0,0 +1,15 @@
# Install the pre-commit hooks below with
# 'pre-commit install'
# Auto-update the version of the hooks with
# 'pre-commit autoupdate'
# Run the hooks on all files with
# 'pre-commit run --all'
repos:
- repo: https://github.com/mwouts/jupytext
rev: v1.10.0
hooks:
- id: jupytext
args: [--sync]

View File

@ -0,0 +1,161 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Preface"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What Is Data Science?\n",
"\n",
"This is a book about doing data science with Python, which immediately begs the question: what is *data science*?\n",
"It's a surprisingly hard definition to nail down, especially given how ubiquitous the term has become.\n",
"Vocal critics have variously dismissed the term as a superfluous label (after all, what science doesn't involve data?) or a simple buzzword that only exists to salt resumes and catch the eye of overzealous tech recruiters.\n",
"\n",
"In my mind, these critiques miss something important.\n",
"Data science, despite its hype-laden veneer, is perhaps the best label we have for the cross-disciplinary set of skills that are becoming increasingly important in many applications across industry and academia.\n",
"This cross-disciplinary piece is key: in my mind, the best extisting definition of data science is illustrated by Drew Conway's Data Science Venn Diagram, first published on his blog in September 2010:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Data Science Venn Diagram](figures/Data_Science_VD.png)\n",
"\n",
"<small>(Source: [Drew Conway](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram). Used by permission.)</small>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While some of the intersection labels are a bit tongue-in-cheek, this diagram captures the essence of what I think people mean when they say \"data science\": it is fundamentally an *interdisciplinary* subject.\n",
"Data science comprises three distinct and overlapping areas: the skills of a *statistician* who knows how to model and summarize datasets (which are growing ever larger); the skills of a *computer scientist* who can design and use algorithms to efficiently store, process, and visualize this data; and the *domain expertise*—what we might think of as \"classical\" training in a subject—necessary both to formulate the right questions and to put their answers in context.\n",
"\n",
"With this in mind, I would encourage you to think of data science not as a new domain of knowledge to learn, but a new set of skills that you can apply within your current area of expertise.\n",
"Whether you are reporting election results, forecasting stock returns, optimizing online ad clicks, identifying microorganisms in microscope photos, seeking new classes of astronomical objects, or working with data in any other field, the goal of this book is to give you the ability to ask and answer new questions about your chosen subject area."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Who Is This Book For?\n",
"\n",
"In my teaching both at the University of Washington and at various tech-focused conferences and meetups, one of the most common questions I have heard is this: \"how should I learn Python?\"\n",
"The people asking are generally technically minded students, developers, or researchers, often with an already strong background in writing code and using computational and numerical tools.\n",
"Most of these folks don't want to learn Python *per se*, but want to learn the language with the aim of using it as a tool for data-intensive and computational science.\n",
"While a large patchwork of videos, blog posts, and tutorials for this audience is available online, I've long been frustrated by the lack of a single good answer to this question; that is what inspired this book.\n",
"\n",
"The book is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.\n",
"Instead it is meant to help Python users learn to use Python's data science stacklibraries such as IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related toolsto effectively store, manipulate, and gain insight from data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Why Python?\n",
"\n",
"Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets.\n",
"This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind.\n",
"The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages: *NumPy* for manipulation of homogeneous array-based data, *Pandas* for manipulation of heterogeneous and labeled data, *SciPy* for common scientific computing tasks, *Matplotlib* for publication-quality visualizations, *IPython* for interactive execution and sharing of code, *Scikit-Learn* for machine learning, and many more tools that will be mentioned in the following pages.\n",
"\n",
"If you are looking for a guide to the Python language itself, I would suggest the sister project to this book, \"[A Whirlwind Tour of the Python Language](https://github.com/jakevdp/WhirlwindTourOfPython)\".\n",
"This short report provides a tour of the essential features of the Python language, aimed at data scientists who already are familiar with one or more other programming languages."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Outline of the Book\n",
"\n",
"Each chapter of this book focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story.\n",
"\n",
"1. IPython and Jupyter: these packages provide the computational environment in which many Python-using data scientists work.\n",
"2. NumPy: this library provides the ``ndarray`` for efficient storage and manipulation of dense data arrays in Python.\n",
"3. Pandas: this library provides the ``DataFrame`` for efficient storage and manipulation of labeled/columnar data in Python.\n",
"4. Matplotlib: this library provides capabilities for a flexible range of data visualizations in Python.\n",
"5. Scikit-Learn: this library provides efficient & clean Python implementations of the most important and established machine learning algorithms.\n",
"\n",
"The PyData world is certainly much larger than these five packages, and is growing every day.\n",
"With this in mind, I make every attempt through these pages to provide references to other interesting efforts, projects, and packages that are pushing the boundaries of what can be done in Python.\n",
"Nevertheless, these five are currently fundamental to much of the work being done in the Python data science space, and I expect they will remain important even as the ecosystem continues growing around them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Code Examples\n",
"\n",
"Supplemental material (code examples, figures, etc.) is available for download at http://github.com/jakevdp/PythonDataScienceHandbook/. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your products documentation does require permission.\n",
"\n",
"We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example:\n",
"\n",
"> *The Python Data Science Handbook* by Jake VanderPlas (OReilly). Copyright 2016 Jake VanderPlas, 978-1-491-91205-8.\n",
"\n",
"If you feel your use of code examples falls outside fair use or the per mission given above, feel free to contact us at permissions@oreilly.com."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installation Considerations\n",
"\n",
"Installing Python and the suite of libraries that enable scientific computing is straightforward . This section will outline some of the considerations when setting up your computer.\n",
"\n",
"Though there are various ways to install Python, the one I would suggest for use in data science is the Anaconda distribution, which works similarly whether you use Windows, Linux, or Mac OS X.\n",
"The Anaconda distribution comes in two flavors:\n",
"\n",
"- [Miniconda](http://conda.pydata.org/miniconda.html) gives you the Python interpreter itself, along with a command-line tool called ``conda`` which operates as a cross-platform package manager geared toward Python packages, similar in spirit to the apt or yum tools that Linux users might be familiar with.\n",
"\n",
"- [Anaconda](https://www.continuum.io/downloads) includes both Python and conda, and additionally bundles a suite of other pre-installed packages geared toward scientific computing. Because of the size of this bundle, expect the installation to consume several gigabytes of disk space.\n",
"\n",
"Any of the packages included with Anaconda can also be installed manually on top of Miniconda; for this reason I suggest starting with Miniconda.\n",
"\n",
"To get started, download and install the Miniconda packagemake sure to choose a version with Python 3and then install the core packages used in this book:\n",
"\n",
"```\n",
"[~]$ conda install numpy pandas scikit-learn matplotlib seaborn jupyter\n",
"```\n",
"\n",
"Throughout the text, we will also make use of other more specialized tools in Python's scientific ecosystem; installation is usually as easy as typing **``conda install packagename``**.\n",
"For more information on conda, including information about creating and using conda environments (which I would *highly* recommend), refer to [conda's online documentation](http://conda.pydata.org/docs/)."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,109 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Preface
## What Is Data Science?
This is a book about doing data science with Python, which immediately begs the question: what is *data science*?
It's a surprisingly hard definition to nail down, especially given how ubiquitous the term has become.
Vocal critics have variously dismissed the term as a superfluous label (after all, what science doesn't involve data?) or a simple buzzword that only exists to salt resumes and catch the eye of overzealous tech recruiters.
In my mind, these critiques miss something important.
Data science, despite its hype-laden veneer, is perhaps the best label we have for the cross-disciplinary set of skills that are becoming increasingly important in many applications across industry and academia.
This cross-disciplinary piece is key: in my mind, the best extisting definition of data science is illustrated by Drew Conway's Data Science Venn Diagram, first published on his blog in September 2010:
![Data Science Venn Diagram](figures/Data_Science_VD.png)
<small>(Source: [Drew Conway](http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram). Used by permission.)</small>
While some of the intersection labels are a bit tongue-in-cheek, this diagram captures the essence of what I think people mean when they say "data science": it is fundamentally an *interdisciplinary* subject.
Data science comprises three distinct and overlapping areas: the skills of a *statistician* who knows how to model and summarize datasets (which are growing ever larger); the skills of a *computer scientist* who can design and use algorithms to efficiently store, process, and visualize this data; and the *domain expertise*—what we might think of as "classical" training in a subject—necessary both to formulate the right questions and to put their answers in context.
With this in mind, I would encourage you to think of data science not as a new domain of knowledge to learn, but a new set of skills that you can apply within your current area of expertise.
Whether you are reporting election results, forecasting stock returns, optimizing online ad clicks, identifying microorganisms in microscope photos, seeking new classes of astronomical objects, or working with data in any other field, the goal of this book is to give you the ability to ask and answer new questions about your chosen subject area.
## Who Is This Book For?
In my teaching both at the University of Washington and at various tech-focused conferences and meetups, one of the most common questions I have heard is this: "how should I learn Python?"
The people asking are generally technically minded students, developers, or researchers, often with an already strong background in writing code and using computational and numerical tools.
Most of these folks don't want to learn Python *per se*, but want to learn the language with the aim of using it as a tool for data-intensive and computational science.
While a large patchwork of videos, blog posts, and tutorials for this audience is available online, I've long been frustrated by the lack of a single good answer to this question; that is what inspired this book.
The book is not meant to be an introduction to Python or to programming in general; I assume the reader has familiarity with the Python language, including defining functions, assigning variables, calling methods of objects, controlling the flow of a program, and other basic tasks.
Instead it is meant to help Python users learn to use Python's data science stacklibraries such as IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and related toolsto effectively store, manipulate, and gain insight from data.
## Why Python?
Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets.
This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind.
The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages: *NumPy* for manipulation of homogeneous array-based data, *Pandas* for manipulation of heterogeneous and labeled data, *SciPy* for common scientific computing tasks, *Matplotlib* for publication-quality visualizations, *IPython* for interactive execution and sharing of code, *Scikit-Learn* for machine learning, and many more tools that will be mentioned in the following pages.
If you are looking for a guide to the Python language itself, I would suggest the sister project to this book, "[A Whirlwind Tour of the Python Language](https://github.com/jakevdp/WhirlwindTourOfPython)".
This short report provides a tour of the essential features of the Python language, aimed at data scientists who already are familiar with one or more other programming languages.
## Outline of the Book
Each chapter of this book focuses on a particular package or tool that contributes a fundamental piece of the Python Data Sciece story.
1. IPython and Jupyter: these packages provide the computational environment in which many Python-using data scientists work.
2. NumPy: this library provides the ``ndarray`` for efficient storage and manipulation of dense data arrays in Python.
3. Pandas: this library provides the ``DataFrame`` for efficient storage and manipulation of labeled/columnar data in Python.
4. Matplotlib: this library provides capabilities for a flexible range of data visualizations in Python.
5. Scikit-Learn: this library provides efficient & clean Python implementations of the most important and established machine learning algorithms.
The PyData world is certainly much larger than these five packages, and is growing every day.
With this in mind, I make every attempt through these pages to provide references to other interesting efforts, projects, and packages that are pushing the boundaries of what can be done in Python.
Nevertheless, these five are currently fundamental to much of the work being done in the Python data science space, and I expect they will remain important even as the ecosystem continues growing around them.
## Using Code Examples
Supplemental material (code examples, figures, etc.) is available for download at http://github.com/jakevdp/PythonDataScienceHandbook/. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless youre reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from OReilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your products documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example:
> *The Python Data Science Handbook* by Jake VanderPlas (OReilly). Copyright 2016 Jake VanderPlas, 978-1-491-91205-8.
If you feel your use of code examples falls outside fair use or the per mission given above, feel free to contact us at permissions@oreilly.com.
## Installation Considerations
Installing Python and the suite of libraries that enable scientific computing is straightforward . This section will outline some of the considerations when setting up your computer.
Though there are various ways to install Python, the one I would suggest for use in data science is the Anaconda distribution, which works similarly whether you use Windows, Linux, or Mac OS X.
The Anaconda distribution comes in two flavors:
- [Miniconda](http://conda.pydata.org/miniconda.html) gives you the Python interpreter itself, along with a command-line tool called ``conda`` which operates as a cross-platform package manager geared toward Python packages, similar in spirit to the apt or yum tools that Linux users might be familiar with.
- [Anaconda](https://www.continuum.io/downloads) includes both Python and conda, and additionally bundles a suite of other pre-installed packages geared toward scientific computing. Because of the size of this bundle, expect the installation to consume several gigabytes of disk space.
Any of the packages included with Anaconda can also be installed manually on top of Miniconda; for this reason I suggest starting with Miniconda.
To get started, download and install the Miniconda packagemake sure to choose a version with Python 3and then install the core packages used in this book:
```
[~]$ conda install numpy pandas scikit-learn matplotlib seaborn jupyter
```
Throughout the text, we will also make use of other more specialized tools in Python's scientific ecosystem; installation is usually as easy as typing **``conda install packagename``**.
For more information on conda, including information about creating and using conda environments (which I would *highly* recommend), refer to [conda's online documentation](http://conda.pydata.org/docs/).

View File

@ -0,0 +1,122 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IPython: Beyond Normal Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many options for development environments for Python, and I'm often asked which one I use in my own work.\n",
"My answer sometimes surprises people: my preferred environment is [IPython](http://ipython.org/) plus a text editor (in my case, Emacs or VSCode depending on my mood).\n",
"IPython (short for *Interactive Python*) was started in 2001 by Fernando Perez as an enhanced Python interpreter, and has since grown into a project aiming to provide, in Perez's words, \"Tools for the entire life cycle of research computing.\"\n",
"If Python is the engine of our data science task, you might think of IPython as the interactive control panel.\n",
"\n",
"As well as being a useful interactive interface to Python, IPython also provides a number of useful syntactic additions to the language; we'll cover the most useful of these additions here.\n",
"In addition, IPython is closely tied with the [Jupyter project](http://jupyter.org), which provides a browser-based notebook that is useful for development, collaboration, sharing, and even publication of data science results.\n",
"The IPython notebook is actually a special case of the broader Jupyter notebook structure, which encompasses notebooks for Julia, R, and other programming languages.\n",
"As an example of the usefulness of the notebook format, look no further than the page you are reading: the entire manuscript for this book was composed as a set of IPython notebooks.\n",
"\n",
"IPython is about using Python effectively for interactive scientific and data-intensive computing.\n",
"This chapter will start by stepping through some of the IPython features that are useful to the practice of data science, focusing especially on the syntax it offers beyond the standard features of Python.\n",
"Next, we will go into a bit more depth on some of the more useful \"magic commands\" that can speed-up common tasks in creating and using data science code.\n",
"Finally, we will touch on some of the features of the notebook that make it useful in understanding data and sharing results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Shell or Notebook?\n",
"\n",
"There are two primary means of using IPython that we'll discuss in this chapter: the IPython shell and the IPython notebook.\n",
"The bulk of the material in this chapter is relevant to both, and the examples will switch between them depending on what is most convenient.\n",
"In the few sections that are relevant to just one or the other, we will explicitly state that fact.\n",
"Before we start, some words on how to launch the IPython shell and IPython notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Launching the IPython Shell\n",
"\n",
"This chapter, like most of this book, is not designed to be absorbed passively.\n",
"I recommend that as you read through it, you follow along and experiment with the tools and syntax we cover: the muscle-memory you build through doing this will be far more useful than the simple act of reading about it.\n",
"Start by launching the IPython interpreter by typing **``ipython``** on the command-line; alternatively, if you've installed a distribution like Anaconda or EPD, there may be a launcher specific to your system (we'll discuss this more fully in [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)).\n",
"\n",
"Once you do this, you should see a prompt like the following:\n",
"```\n",
"Python 3.9.2 (v3.9.2:1a79785e3e, Feb 19 2021, 09:06:10) \n",
"Type 'copyright', 'credits' or 'license' for more information\n",
"IPython 7.21.0 -- An enhanced Interactive Python. Type '?' for help.\n",
"\n",
"In [1]:\n",
"```\n",
"With that, you're ready to follow along."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Launching the Jupyter Notebook\n",
"\n",
"The Jupyter notebook is a browser-based graphical interface to the IPython shell, and builds on it a rich set of dynamic display capabilities.\n",
"As well as executing Python/IPython statements, the notebook allows the user to include formatted text, static and dynamic visualizations, mathematical equations, JavaScript widgets, and much more.\n",
"Furthermore, these documents can be saved in a way that lets other people open them and execute the code on their own systems.\n",
"\n",
"Though the IPython notebook is viewed and edited through your web browser window, it must connect to a running Python process in order to execute code.\n",
"This process (known as a \"kernel\") can be started by running the following command in your system shell:\n",
"\n",
"```\n",
"$ jupyter lab\n",
"```\n",
"\n",
"This command will launch a local web server that will be visible to your browser.\n",
"It immediately spits out a log showing what it is doing; that log will look something like this:\n",
"\n",
"```\n",
"$ jupyter lab\n",
"[ServerApp] Serving notebooks from local directory: /Users/jakevdp/PythonDataScienceHandbook\n",
"[ServerApp] Jupyter Server 1.4.1 is running at:\n",
"[ServerApp] http://localhost:8888/lab?token=dd852649\n",
"[ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).\n",
"```\n",
"\n",
"Upon issuing the command, your default browser should automatically open and navigate to the listed local URL;\n",
"the exact address will depend on your system.\n",
"If the browser does not open automatically, you can open a window and manually open this address (*http://localhost:8888/lab/* in this example)."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,86 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# IPython: Beyond Normal Python
There are many options for development environments for Python, and I'm often asked which one I use in my own work.
My answer sometimes surprises people: my preferred environment is [IPython](http://ipython.org/) plus a text editor (in my case, Emacs or VSCode depending on my mood).
IPython (short for *Interactive Python*) was started in 2001 by Fernando Perez as an enhanced Python interpreter, and has since grown into a project aiming to provide, in Perez's words, "Tools for the entire life cycle of research computing."
If Python is the engine of our data science task, you might think of IPython as the interactive control panel.
As well as being a useful interactive interface to Python, IPython also provides a number of useful syntactic additions to the language; we'll cover the most useful of these additions here.
In addition, IPython is closely tied with the [Jupyter project](http://jupyter.org), which provides a browser-based notebook that is useful for development, collaboration, sharing, and even publication of data science results.
The IPython notebook is actually a special case of the broader Jupyter notebook structure, which encompasses notebooks for Julia, R, and other programming languages.
As an example of the usefulness of the notebook format, look no further than the page you are reading: the entire manuscript for this book was composed as a set of IPython notebooks.
IPython is about using Python effectively for interactive scientific and data-intensive computing.
This chapter will start by stepping through some of the IPython features that are useful to the practice of data science, focusing especially on the syntax it offers beyond the standard features of Python.
Next, we will go into a bit more depth on some of the more useful "magic commands" that can speed-up common tasks in creating and using data science code.
Finally, we will touch on some of the features of the notebook that make it useful in understanding data and sharing results.
## Shell or Notebook?
There are two primary means of using IPython that we'll discuss in this chapter: the IPython shell and the IPython notebook.
The bulk of the material in this chapter is relevant to both, and the examples will switch between them depending on what is most convenient.
In the few sections that are relevant to just one or the other, we will explicitly state that fact.
Before we start, some words on how to launch the IPython shell and IPython notebook.
### Launching the IPython Shell
This chapter, like most of this book, is not designed to be absorbed passively.
I recommend that as you read through it, you follow along and experiment with the tools and syntax we cover: the muscle-memory you build through doing this will be far more useful than the simple act of reading about it.
Start by launching the IPython interpreter by typing **``ipython``** on the command-line; alternatively, if you've installed a distribution like Anaconda or EPD, there may be a launcher specific to your system (we'll discuss this more fully in [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)).
Once you do this, you should see a prompt like the following:
```
Python 3.9.2 (v3.9.2:1a79785e3e, Feb 19 2021, 09:06:10)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.21.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]:
```
With that, you're ready to follow along.
### Launching the Jupyter Notebook
The Jupyter notebook is a browser-based graphical interface to the IPython shell, and builds on it a rich set of dynamic display capabilities.
As well as executing Python/IPython statements, the notebook allows the user to include formatted text, static and dynamic visualizations, mathematical equations, JavaScript widgets, and much more.
Furthermore, these documents can be saved in a way that lets other people open them and execute the code on their own systems.
Though the IPython notebook is viewed and edited through your web browser window, it must connect to a running Python process in order to execute code.
This process (known as a "kernel") can be started by running the following command in your system shell:
```
$ jupyter lab
```
This command will launch a local web server that will be visible to your browser.
It immediately spits out a log showing what it is doing; that log will look something like this:
```
$ jupyter lab
[ServerApp] Serving notebooks from local directory: /Users/jakevdp/PythonDataScienceHandbook
[ServerApp] Jupyter Server 1.4.1 is running at:
[ServerApp] http://localhost:8888/lab?token=dd852649
[ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
```
Upon issuing the command, your default browser should automatically open and navigate to the listed local URL;
the exact address will depend on your system.
If the browser does not open automatically, you can open a window and manually open this address (*http://localhost:8888/lab/* in this example).

View File

@ -0,0 +1,321 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Help and Documentation in IPython"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you read no other section in this chapter, read this one: I find the tools discussed here to be the most transformative contributions of IPython to my daily workflow.\n",
"\n",
"When a technologically-minded person is asked to help a friend, family member, or colleague with a computer problem, most of the time it's less a matter of knowing the answer as much as knowing how to quickly find an unknown answer.\n",
"In data science it's the same: searchable web resources such as online documentation, mailing-list threads, and StackOverflow answers contain a wealth of information, even (especially?) if it is a topic you've found yourself searching before.\n",
"Being an effective practitioner of data science is less about memorizing the tool or command you should use for every possible situation, and more about learning to effectively find the information you don't know, whether through a web search engine or another means.\n",
"\n",
"One of the most useful functions of IPython/Jupyter is to shorten the gap between the user and the type of documentation and search that will help them do their work effectively.\n",
"While web searches still play a role in answering complicated questions, an amazing amount of information can be found through IPython alone.\n",
"Some examples of the questions IPython can help answer in a few keystrokes:\n",
"\n",
"- How do I call this function? What arguments and options does it have?\n",
"- What does the source code of this Python object look like?\n",
"- What is in this package I imported? What attributes or methods does this object have?\n",
"\n",
"Here we'll discuss IPython's tools to quickly access this information, namely the ``?`` character to explore documentation, the ``??`` characters to explore source code, and the Tab key for auto-completion."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Accessing Documentation with ``?``\n",
"\n",
"The Python language and its data science ecosystem is built with the user in mind, and one big part of that is access to documentation.\n",
"Every Python object contains the reference to a string, known as a *doc string*, which in most cases will contain a concise summary of the object and how to use it.\n",
"Python has a built-in ``help()`` function that can access this information and prints the results.\n",
"For example, to see the documentation of the built-in ``len`` function, you can do the following:\n",
"\n",
"```ipython\n",
"In [1]: help(len)\n",
"Help on built-in function len in module builtins:\n",
"\n",
"len(obj, /)\n",
" Return the number of items in a container.\n",
"```\n",
"\n",
"Depending on your interpreter, this information may be displayed as inline text, or in some separate pop-up window."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because finding help on an object is so common and useful, IPython introduces the ``?`` character as a shorthand for accessing this documentation and other relevant information:\n",
"\n",
"```ipython\n",
"In [2]: len?\n",
"Signature: len(obj, /)\n",
"Docstring: Return the number of items in a container.\n",
"Type: builtin_function_or_method\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notation works for just about anything, including object methods:\n",
"\n",
"```ipython\n",
"In [3]: L = [1, 2, 3]\n",
"In [4]: L.insert?\n",
"Signature: L.insert(index, object, /)\n",
"Docstring: Insert object before index.\n",
"Type: builtin_function_or_method\n",
"```\n",
"\n",
"or even objects themselves, with the documentation from their type:\n",
"\n",
"```ipython\n",
"In [5]: L?\n",
"Type: list\n",
"String form: [1, 2, 3]\n",
"Length: 3\n",
"Docstring: \n",
"Built-in mutable sequence.\n",
"\n",
"If no argument is given, the constructor creates a new empty list.\n",
"The argument must be an iterable if specified.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Importantly, this will even work for functions or other objects you create yourself!\n",
"Here we'll define a small function with a docstring:\n",
"\n",
"```ipython\n",
"In [6]: def square(a):\n",
" ....: \"\"\"Return the square of a.\"\"\"\n",
" ....: return a ** 2\n",
" ....:\n",
"```\n",
"\n",
"Note that to create a docstring for our function, we simply placed a string literal in the first line.\n",
"Because doc strings are usually multiple lines, by convention we used Python's triple-quote notation for multi-line strings."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we'll use the ``?`` mark to find this doc string:\n",
"\n",
"```ipython\n",
"In [7]: square?\n",
"Signature: square(a)\n",
"Docstring: Return the square of a.\n",
"File: <ipython-input-6>\n",
"Type: function\n",
"```\n",
"\n",
"This quick access to documentation via docstrings is one reason you should get in the habit of always adding such inline documentation to the code you write!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Accessing Source Code with ``??``\n",
"Because the Python language is so easily readable, another level of insight can usually be gained by reading the source code of the object you're curious about.\n",
"IPython provides a shortcut to the source code with the double question mark (``??``):\n",
"\n",
"```ipython\n",
"In [8]: square??\n",
"Signature: square(a)\n",
"Source: \n",
"def square(a):\n",
" \"\"\"Return the square of a.\"\"\"\n",
" return a ** 2\n",
"File: <ipython-input-6>\n",
"Type: function\n",
"```\n",
"\n",
"For simple functions like this, the double question-mark can give quick insight into the under-the-hood details."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you play with this much, you'll notice that sometimes the ``??`` suffix doesn't display any source code: this is generally because the object in question is not implemented in Python, but in C or some other compiled extension language.\n",
"If this is the case, the ``??`` suffix gives the same output as the ``?`` suffix.\n",
"You'll find this particularly with many of Python's built-in objects and types, for example ``len`` from above:\n",
"\n",
"```ipython\n",
"In [9]: len??\n",
"Signature: len(obj, /)\n",
"Docstring: Return the number of items in a container.\n",
"Type: builtin_function_or_method\n",
"```\n",
"\n",
"Using ``?`` and/or ``??`` gives a powerful and quick interface for finding information about what any Python function or module does."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploring Modules with Tab-Completion\n",
"\n",
"IPython's other useful interface is the use of the tab key for auto-completion and exploration of the contents of objects, modules, and name-spaces.\n",
"In the examples that follow, we'll use ``<TAB>`` to indicate when the Tab key should be pressed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tab-completion of object contents\n",
"\n",
"Every Python object has various attributes and methods associated with it.\n",
"Like with the ``help`` function discussed before, Python has a built-in ``dir`` function that returns a list of these, but the tab-completion interface is much easier to use in practice.\n",
"To see a list of all available attributes of an object, you can type the name of the object followed by a period (\"``.``\") character and the Tab key:\n",
"\n",
"```ipython\n",
"In [10]: L.<TAB>\n",
" append() count insert reverse \n",
" clear extend pop sort \n",
" copy index remove \n",
"```\n",
"\n",
"To narrow-down the list, you can type the first character or several characters of the name, and the Tab key will find the matching attributes and methods:\n",
"\n",
"```ipython\n",
"In [10]: L.c<TAB>\n",
" clear() count()\n",
" copy() \n",
"\n",
"In [10]: L.co<TAB>\n",
" copy() count()\n",
"```\n",
"\n",
"If there is only a single option, pressing the Tab key will complete the line for you.\n",
"For example, the following will instantly be replaced with ``L.count``:\n",
"\n",
"```ipython\n",
"In [10]: L.cou<TAB>\n",
"\n",
"```\n",
"\n",
"Though Python has no strictly-enforced distinction between public/external attributes and private/internal attributes, by convention a preceding underscore is used to denote such methods.\n",
"For clarity, these private methods and special methods are omitted from the list by default, but it's possible to list them by explicitly typing the underscore:\n",
"\n",
"```ipython\n",
"In [10]: L._<TAB>\n",
" __add__ __delattr__ __eq__ \n",
" __class__ __delitem__ __format__()\n",
" __class_getitem__() __dir__() __ge__ >\n",
" __contains__ __doc__ __getattribute__ \n",
"```\n",
"\n",
"For brevity, we've only shown the first few columns of the output.\n",
"Most of these are Python's special double-underscore methods (often nicknamed \"dunder\" methods)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tab completion when importing\n",
"\n",
"Tab completion is also useful when importing objects from packages.\n",
"Here we'll use it to find all possible imports in the ``itertools`` package that start with ``co``:\n",
"```\n",
"In [10]: from itertools import co<TAB>\n",
" combinations() compress()\n",
" combinations_with_replacement() count()\n",
"```\n",
"Similarly, you can use tab-completion to see which imports are available on your system (this will change depending on which third-party scripts and modules are visible to your Python session):\n",
"```\n",
"In [10]: import <TAB>\n",
" abc anyio \n",
" activate_this appdirs \n",
" aifc appnope >\n",
" antigravity argon2 \n",
"\n",
"In [10]: import h<TAB>\n",
" hashlib html \n",
" heapq http \n",
" hmac \n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Beyond tab completion: wildcard matching\n",
"\n",
"Tab completion is useful if you know the first few characters of the object or attribute you're looking for, but is little help if you'd like to match characters at the middle or end of the word.\n",
"For this use-case, IPython provides a means of wildcard matching for names using the ``*`` character.\n",
"\n",
"For example, we can use this to list every object in the namespace that ends with ``Warning``:\n",
"\n",
"```ipython\n",
"In [10]: *Warning?\n",
"BytesWarning RuntimeWarning\n",
"DeprecationWarning SyntaxWarning\n",
"FutureWarning UnicodeWarning\n",
"ImportWarning UserWarning\n",
"PendingDeprecationWarning Warning\n",
"ResourceWarning\n",
"```\n",
"\n",
"Notice that the ``*`` character matches any string, including the empty string.\n",
"\n",
"Similarly, suppose we are looking for a string method that contains the word ``find`` somewhere in its name.\n",
"We can search for it this way:\n",
"\n",
"```ipython\n",
"In [11]: str.*find*?\n",
"str.find\n",
"str.rfind\n",
"```\n",
"\n",
"I find this type of flexible wildcard search can be useful for finding a particular command when getting to know a new package or reacquainting myself with a familiar one."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,253 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Help and Documentation in IPython
If you read no other section in this chapter, read this one: I find the tools discussed here to be the most transformative contributions of IPython to my daily workflow.
When a technologically-minded person is asked to help a friend, family member, or colleague with a computer problem, most of the time it's less a matter of knowing the answer as much as knowing how to quickly find an unknown answer.
In data science it's the same: searchable web resources such as online documentation, mailing-list threads, and StackOverflow answers contain a wealth of information, even (especially?) if it is a topic you've found yourself searching before.
Being an effective practitioner of data science is less about memorizing the tool or command you should use for every possible situation, and more about learning to effectively find the information you don't know, whether through a web search engine or another means.
One of the most useful functions of IPython/Jupyter is to shorten the gap between the user and the type of documentation and search that will help them do their work effectively.
While web searches still play a role in answering complicated questions, an amazing amount of information can be found through IPython alone.
Some examples of the questions IPython can help answer in a few keystrokes:
- How do I call this function? What arguments and options does it have?
- What does the source code of this Python object look like?
- What is in this package I imported? What attributes or methods does this object have?
Here we'll discuss IPython's tools to quickly access this information, namely the ``?`` character to explore documentation, the ``??`` characters to explore source code, and the Tab key for auto-completion.
## Accessing Documentation with ``?``
The Python language and its data science ecosystem is built with the user in mind, and one big part of that is access to documentation.
Every Python object contains the reference to a string, known as a *doc string*, which in most cases will contain a concise summary of the object and how to use it.
Python has a built-in ``help()`` function that can access this information and prints the results.
For example, to see the documentation of the built-in ``len`` function, you can do the following:
```ipython
In [1]: help(len)
Help on built-in function len in module builtins:
len(obj, /)
Return the number of items in a container.
```
Depending on your interpreter, this information may be displayed as inline text, or in some separate pop-up window.
Because finding help on an object is so common and useful, IPython introduces the ``?`` character as a shorthand for accessing this documentation and other relevant information:
```ipython
In [2]: len?
Signature: len(obj, /)
Docstring: Return the number of items in a container.
Type: builtin_function_or_method
```
This notation works for just about anything, including object methods:
```ipython
In [3]: L = [1, 2, 3]
In [4]: L.insert?
Signature: L.insert(index, object, /)
Docstring: Insert object before index.
Type: builtin_function_or_method
```
or even objects themselves, with the documentation from their type:
```ipython
In [5]: L?
Type: list
String form: [1, 2, 3]
Length: 3
Docstring:
Built-in mutable sequence.
If no argument is given, the constructor creates a new empty list.
The argument must be an iterable if specified.
```
Importantly, this will even work for functions or other objects you create yourself!
Here we'll define a small function with a docstring:
```ipython
In [6]: def square(a):
....: """Return the square of a."""
....: return a ** 2
....:
```
Note that to create a docstring for our function, we simply placed a string literal in the first line.
Because doc strings are usually multiple lines, by convention we used Python's triple-quote notation for multi-line strings.
Now we'll use the ``?`` mark to find this doc string:
```ipython
In [7]: square?
Signature: square(a)
Docstring: Return the square of a.
File: <ipython-input-6>
Type: function
```
This quick access to documentation via docstrings is one reason you should get in the habit of always adding such inline documentation to the code you write!
## Accessing Source Code with ``??``
Because the Python language is so easily readable, another level of insight can usually be gained by reading the source code of the object you're curious about.
IPython provides a shortcut to the source code with the double question mark (``??``):
```ipython
In [8]: square??
Signature: square(a)
Source:
def square(a):
"""Return the square of a."""
return a ** 2
File: <ipython-input-6>
Type: function
```
For simple functions like this, the double question-mark can give quick insight into the under-the-hood details.
If you play with this much, you'll notice that sometimes the ``??`` suffix doesn't display any source code: this is generally because the object in question is not implemented in Python, but in C or some other compiled extension language.
If this is the case, the ``??`` suffix gives the same output as the ``?`` suffix.
You'll find this particularly with many of Python's built-in objects and types, for example ``len`` from above:
```ipython
In [9]: len??
Signature: len(obj, /)
Docstring: Return the number of items in a container.
Type: builtin_function_or_method
```
Using ``?`` and/or ``??`` gives a powerful and quick interface for finding information about what any Python function or module does.
## Exploring Modules with Tab-Completion
IPython's other useful interface is the use of the tab key for auto-completion and exploration of the contents of objects, modules, and name-spaces.
In the examples that follow, we'll use ``<TAB>`` to indicate when the Tab key should be pressed.
### Tab-completion of object contents
Every Python object has various attributes and methods associated with it.
Like with the ``help`` function discussed before, Python has a built-in ``dir`` function that returns a list of these, but the tab-completion interface is much easier to use in practice.
To see a list of all available attributes of an object, you can type the name of the object followed by a period ("``.``") character and the Tab key:
```ipython
In [10]: L.<TAB>
append() count insert reverse
clear extend pop sort
copy index remove
```
To narrow-down the list, you can type the first character or several characters of the name, and the Tab key will find the matching attributes and methods:
```ipython
In [10]: L.c<TAB>
clear() count()
copy()
In [10]: L.co<TAB>
copy() count()
```
If there is only a single option, pressing the Tab key will complete the line for you.
For example, the following will instantly be replaced with ``L.count``:
```ipython
In [10]: L.cou<TAB>
```
Though Python has no strictly-enforced distinction between public/external attributes and private/internal attributes, by convention a preceding underscore is used to denote such methods.
For clarity, these private methods and special methods are omitted from the list by default, but it's possible to list them by explicitly typing the underscore:
```ipython
In [10]: L._<TAB>
__add__ __delattr__ __eq__
__class__ __delitem__ __format__()
__class_getitem__() __dir__() __ge__ >
__contains__ __doc__ __getattribute__
```
For brevity, we've only shown the first few columns of the output.
Most of these are Python's special double-underscore methods (often nicknamed "dunder" methods).
### Tab completion when importing
Tab completion is also useful when importing objects from packages.
Here we'll use it to find all possible imports in the ``itertools`` package that start with ``co``:
```
In [10]: from itertools import co<TAB>
combinations() compress()
combinations_with_replacement() count()
```
Similarly, you can use tab-completion to see which imports are available on your system (this will change depending on which third-party scripts and modules are visible to your Python session):
```
In [10]: import <TAB>
abc anyio
activate_this appdirs
aifc appnope >
antigravity argon2
In [10]: import h<TAB>
hashlib html
heapq http
hmac
```
### Beyond tab completion: wildcard matching
Tab completion is useful if you know the first few characters of the object or attribute you're looking for, but is little help if you'd like to match characters at the middle or end of the word.
For this use-case, IPython provides a means of wildcard matching for names using the ``*`` character.
For example, we can use this to list every object in the namespace that ends with ``Warning``:
```ipython
In [10]: *Warning?
BytesWarning RuntimeWarning
DeprecationWarning SyntaxWarning
FutureWarning UnicodeWarning
ImportWarning UserWarning
PendingDeprecationWarning Warning
ResourceWarning
```
Notice that the ``*`` character matches any string, including the empty string.
Similarly, suppose we are looking for a string method that contains the word ``find`` somewhere in its name.
We can search for it this way:
```ipython
In [11]: str.*find*?
str.find
str.rfind
```
I find this type of flexible wildcard search can be useful for finding a particular command when getting to know a new package or reacquainting myself with a familiar one.

View File

@ -0,0 +1,178 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Keyboard Shortcuts in the IPython Shell"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you spend any amount of time on the computer, you've probably found a use for keyboard shortcuts in your workflow.\n",
"Most familiar perhaps are the Cmd-C and Cmd-V (or Ctrl-C and Ctrl-V) for copying and pasting in a wide variety of programs and systems.\n",
"Power-users tend to go even further: popular text editors like Emacs, Vim, and others provide users an incredible range of operations through intricate combinations of keystrokes.\n",
"\n",
"The IPython shell doesn't go this far, but does provide a number of keyboard shortcuts for fast navigation while typing commands.\n",
"These shortcuts are not in fact provided by IPython itself, but through its dependency on the GNU Readline library: as such, some of the following shortcuts may differ depending on your system configuration.\n",
"Also, while some of these shortcuts do work in the browser-based notebook, this section is primarily about shortcuts in the IPython shell.\n",
"\n",
"Once you get accustomed to these, they can be very useful for quickly performing certain commands without moving your hands from the \"home\" keyboard position.\n",
"If you're an Emacs user or if you have experience with Linux-style shells, the following will be very familiar.\n",
"We'll group these shortcuts into a few categories: *navigation shortcuts*, *text entry shortcuts*, *command history shortcuts*, and *miscellaneous shortcuts*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Navigation shortcuts\n",
"\n",
"While the use of the left and right arrow keys to move backward and forward in the line is quite obvious, there are other options that don't require moving your hands from the \"home\" keyboard position:\n",
"\n",
"| Keystroke | Action |\n",
"|-----------------------------------|--------------------------------------------|\n",
"| ``Ctrl-a`` | Move cursor to the beginning of the line |\n",
"| ``Ctrl-e`` | Move cursor to the end of the line |\n",
"| ``Ctrl-b`` or the left arrow key | Move cursor back one character |\n",
"| ``Ctrl-f`` or the right arrow key | Move cursor forward one character |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Text Entry Shortcuts\n",
"\n",
"While everyone is familiar with using the Backspace key to delete the previous character, reaching for the key often requires some minor finger gymnastics, and it only deletes a single character at a time.\n",
"In IPython there are several shortcuts for removing some portion of the text you're typing.\n",
"The most immediately useful of these are the commands to delete entire lines of text.\n",
"You'll know these have become second-nature if you find yourself using a combination of Ctrl-b and Ctrl-d instead of reaching for Backspace to delete the previous character!\n",
"\n",
"| Keystroke | Action |\n",
"|-------------------------------|--------------------------------------------------|\n",
"| Backspace key | Delete previous character in line |\n",
"| ``Ctrl-d`` | Delete next character in line |\n",
"| ``Ctrl-k`` | Cut text from cursor to end of line |\n",
"| ``Ctrl-u`` | Cut text from beginning of line to cursor |\n",
"| ``Ctrl-y`` | Yank (i.e. paste) text that was previously cut |\n",
"| ``Ctrl-t`` | Transpose (i.e., switch) previous two characters |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Command History Shortcuts\n",
"\n",
"Perhaps the most impactful shortcuts discussed here are the ones IPython provides for navigating the command history.\n",
"This command history goes beyond your current IPython session: your entire command history is stored in a SQLite database in your IPython profile directory.\n",
"The most straightforward way to access these is with the up and down arrow keys to step through the history, but other options exist as well:\n",
"\n",
"| Keystroke | Action |\n",
"|-------------------------------------|--------------------------------------------|\n",
"| ``Ctrl-p`` (or the up arrow key) | Access previous command in history |\n",
"| ``Ctrl-n`` (or the down arrow key) | Access next command in history |\n",
"| ``Ctrl-r`` | Reverse-search through command history |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The reverse-search can be particularly useful.\n",
"Recall that in the previous section we defined a function called ``square``.\n",
"Let's reverse-search our Python history from a new IPython shell and find this definition again.\n",
"When you press Ctrl-r in the IPython terminal, you'll see the following prompt:\n",
"\n",
"```ipython\n",
"In [1]:\n",
"(reverse-i-search)`': \n",
"```\n",
"\n",
"If you start typing characters at this prompt, IPython will auto-fill the most recent command, if any, that matches those characters:\n",
"\n",
"```ipython\n",
"In [1]: \n",
"(reverse-i-search)`sqa': square??\n",
"```\n",
"\n",
"At any point, you can add more characters to refine the search, or press Ctrl-r again to search further for another command that matches the query. If you followed along in the previous section, pressing Ctrl-r twice more gives:\n",
"\n",
"```ipython\n",
"In [1]: \n",
"(reverse-i-search)`sqa': def square(a):\n",
" \"\"\"Return the square of a\"\"\"\n",
" return a ** 2\n",
"```\n",
"\n",
"Once you have found the command you're looking for, press Return and the search will end.\n",
"We can then use the retrieved command, and carry-on with our session:\n",
"\n",
"```ipython\n",
"In [1]: def square(a):\n",
" \"\"\"Return the square of a\"\"\"\n",
" return a ** 2\n",
"\n",
"In [2]: square(2)\n",
"Out[2]: 4\n",
"```\n",
"\n",
"Note that Ctrl-p/Ctrl-n or the up/down arrow keys can also be used to search through history, but only by matching characters at the beginning of the line.\n",
"That is, if you type **``def``** and then press Ctrl-p, it would find the most recent command (if any) in your history that begins with the characters ``def``."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Miscellaneous Shortcuts\n",
"\n",
"Finally, there are a few miscellaneous shortcuts that don't fit into any of the preceding categories, but are nevertheless useful to know:\n",
"\n",
"| Keystroke | Action |\n",
"|-------------------------------|--------------------------------------------|\n",
"| ``Ctrl-l`` | Clear terminal screen |\n",
"| ``Ctrl-c`` | Interrupt current Python command |\n",
"| ``Ctrl-d`` | Exit IPython session |\n",
"\n",
"The Ctrl-c in particular can be useful when you inadvertently start a very long-running job."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While some of the shortcuts discussed here may seem a bit tedious at first, they quickly become automatic with practice.\n",
"Once you develop that muscle memory, I suspect you will even find yourself wishing they were available in other contexts."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,130 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Keyboard Shortcuts in the IPython Shell
If you spend any amount of time on the computer, you've probably found a use for keyboard shortcuts in your workflow.
Most familiar perhaps are the Cmd-C and Cmd-V (or Ctrl-C and Ctrl-V) for copying and pasting in a wide variety of programs and systems.
Power-users tend to go even further: popular text editors like Emacs, Vim, and others provide users an incredible range of operations through intricate combinations of keystrokes.
The IPython shell doesn't go this far, but does provide a number of keyboard shortcuts for fast navigation while typing commands.
These shortcuts are not in fact provided by IPython itself, but through its dependency on the GNU Readline library: as such, some of the following shortcuts may differ depending on your system configuration.
Also, while some of these shortcuts do work in the browser-based notebook, this section is primarily about shortcuts in the IPython shell.
Once you get accustomed to these, they can be very useful for quickly performing certain commands without moving your hands from the "home" keyboard position.
If you're an Emacs user or if you have experience with Linux-style shells, the following will be very familiar.
We'll group these shortcuts into a few categories: *navigation shortcuts*, *text entry shortcuts*, *command history shortcuts*, and *miscellaneous shortcuts*.
## Navigation shortcuts
While the use of the left and right arrow keys to move backward and forward in the line is quite obvious, there are other options that don't require moving your hands from the "home" keyboard position:
| Keystroke | Action |
|-----------------------------------|--------------------------------------------|
| ``Ctrl-a`` | Move cursor to the beginning of the line |
| ``Ctrl-e`` | Move cursor to the end of the line |
| ``Ctrl-b`` or the left arrow key | Move cursor back one character |
| ``Ctrl-f`` or the right arrow key | Move cursor forward one character |
## Text Entry Shortcuts
While everyone is familiar with using the Backspace key to delete the previous character, reaching for the key often requires some minor finger gymnastics, and it only deletes a single character at a time.
In IPython there are several shortcuts for removing some portion of the text you're typing.
The most immediately useful of these are the commands to delete entire lines of text.
You'll know these have become second-nature if you find yourself using a combination of Ctrl-b and Ctrl-d instead of reaching for Backspace to delete the previous character!
| Keystroke | Action |
|-------------------------------|--------------------------------------------------|
| Backspace key | Delete previous character in line |
| ``Ctrl-d`` | Delete next character in line |
| ``Ctrl-k`` | Cut text from cursor to end of line |
| ``Ctrl-u`` | Cut text from beginning of line to cursor |
| ``Ctrl-y`` | Yank (i.e. paste) text that was previously cut |
| ``Ctrl-t`` | Transpose (i.e., switch) previous two characters |
## Command History Shortcuts
Perhaps the most impactful shortcuts discussed here are the ones IPython provides for navigating the command history.
This command history goes beyond your current IPython session: your entire command history is stored in a SQLite database in your IPython profile directory.
The most straightforward way to access these is with the up and down arrow keys to step through the history, but other options exist as well:
| Keystroke | Action |
|-------------------------------------|--------------------------------------------|
| ``Ctrl-p`` (or the up arrow key) | Access previous command in history |
| ``Ctrl-n`` (or the down arrow key) | Access next command in history |
| ``Ctrl-r`` | Reverse-search through command history |
The reverse-search can be particularly useful.
Recall that in the previous section we defined a function called ``square``.
Let's reverse-search our Python history from a new IPython shell and find this definition again.
When you press Ctrl-r in the IPython terminal, you'll see the following prompt:
```ipython
In [1]:
(reverse-i-search)`':
```
If you start typing characters at this prompt, IPython will auto-fill the most recent command, if any, that matches those characters:
```ipython
In [1]:
(reverse-i-search)`sqa': square??
```
At any point, you can add more characters to refine the search, or press Ctrl-r again to search further for another command that matches the query. If you followed along in the previous section, pressing Ctrl-r twice more gives:
```ipython
In [1]:
(reverse-i-search)`sqa': def square(a):
"""Return the square of a"""
return a ** 2
```
Once you have found the command you're looking for, press Return and the search will end.
We can then use the retrieved command, and carry-on with our session:
```ipython
In [1]: def square(a):
"""Return the square of a"""
return a ** 2
In [2]: square(2)
Out[2]: 4
```
Note that Ctrl-p/Ctrl-n or the up/down arrow keys can also be used to search through history, but only by matching characters at the beginning of the line.
That is, if you type **``def``** and then press Ctrl-p, it would find the most recent command (if any) in your history that begins with the characters ``def``.
## Miscellaneous Shortcuts
Finally, there are a few miscellaneous shortcuts that don't fit into any of the preceding categories, but are nevertheless useful to know:
| Keystroke | Action |
|-------------------------------|--------------------------------------------|
| ``Ctrl-l`` | Clear terminal screen |
| ``Ctrl-c`` | Interrupt current Python command |
| ``Ctrl-d`` | Exit IPython session |
The Ctrl-c in particular can be useful when you inadvertently start a very long-running job.
While some of the shortcuts discussed here may seem a bit tedious at first, they quickly become automatic with practice.
Once you develop that muscle memory, I suspect you will even find yourself wishing they were available in other contexts.

View File

@ -0,0 +1,209 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IPython Magic Commands"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The previous two sections showed how IPython lets you use and explore Python efficiently and interactively.\n",
"Here we'll begin discussing some of the enhancements that IPython adds on top of the normal Python syntax.\n",
"These are known in IPython as *magic commands*, and are prefixed by the ``%`` character.\n",
"These magic commands are designed to succinctly solve various common problems in standard data analysis.\n",
"Magic commands come in two flavors: *line magics*, which are denoted by a single ``%`` prefix and operate on a single line of input, and *cell magics*, which are denoted by a double ``%%`` prefix and operate on multiple lines of input.\n",
"We'll demonstrate and discuss a few brief examples here, and come back to more focused discussion of several useful magic commands later in the chapter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pasting Code Blocks: ``%paste`` and ``%cpaste``\n",
"\n",
"When working in the IPython interpreter, one common gotcha is that pasting multi-line code blocks can lead to unexpected errors, especially when indentation and interpreter markers are involved.\n",
"A common case is that you find some example code on a website and want to paste it into your interpreter.\n",
"Consider the following simple function:\n",
"\n",
"``` python\n",
">>> def donothing(x):\n",
"... return x\n",
"\n",
"```\n",
"The code is formatted as it would appear in the Python interpreter, and if you copy and paste this directly into older IPython versions, you get an error:\n",
"\n",
"```ipython\n",
"In [2]: >>> def donothing(x):\n",
" ...: ... return x\n",
" ...: \n",
" File \"<ipython-input-20-5a66c8964687>\", line 2\n",
" ... return x\n",
" ^\n",
"SyntaxError: invalid syntax\n",
"```\n",
"\n",
"In the direct paste, the interpreter is confused by the additional prompt characters.\n",
"But never fearIPython's ``%paste`` magic function is designed to handle this exact type of multi-line, marked-up input:\n",
"\n",
"```ipython\n",
"In [3]: %paste\n",
">>> def donothing(x):\n",
"... return x\n",
"\n",
"## -- End pasted text --\n",
"```\n",
"\n",
"The ``%paste`` command both enters and executes the code, so now the function is ready to be used:\n",
"\n",
"```ipython\n",
"In [4]: donothing(10)\n",
"Out[4]: 10\n",
"```\n",
"\n",
"A command with a similar intent is ``%cpaste``, which opens up an interactive multiline prompt in which you can paste one or more chunks of code to be executed in a batch:\n",
"\n",
"```ipython\n",
"In [5]: %cpaste\n",
"Pasting code; enter '--' alone on the line to stop or use Ctrl-D.\n",
":>>> def donothing(x):\n",
":... return x\n",
":--\n",
"```\n",
"\n",
"These magic commands, like others we'll see, make available functionality that would be difficult or impossible in a standard Python interpreter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running External Code: ``%run``\n",
"As you begin developing more extensive code, you will likely find yourself working in both IPython for interactive exploration, as well as a text editor to store code that you want to reuse.\n",
"Rather than running this code in a new window, it can be convenient to run it within your IPython session.\n",
"This can be done with the ``%run`` magic.\n",
"\n",
"For example, imagine you've created a ``myscript.py`` file with the following contents:\n",
"\n",
"```python\n",
"#-------------------------------------\n",
"# file: myscript.py\n",
"\n",
"def square(x):\n",
" \"\"\"square a number\"\"\"\n",
" return x ** 2\n",
"\n",
"for N in range(1, 4):\n",
" print(f\"{N} squared is {square(N)}\")\n",
"```\n",
"\n",
"You can execute this from your IPython session as follows:\n",
"\n",
"```ipython\n",
"In [6]: %run myscript.py\n",
"1 squared is 1\n",
"2 squared is 4\n",
"3 squared is 9\n",
"```\n",
"\n",
"Note also that after you've run this script, any functions defined within it are available for use in your IPython session:\n",
"\n",
"```ipython\n",
"In [7]: square(5)\n",
"Out[7]: 25\n",
"```\n",
"\n",
"There are several options to fine-tune how your code is run; you can see the documentation in the normal way, by typing **``%run?``** in the IPython interpreter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Timing Code Execution: ``%timeit``\n",
"Another example of a useful magic function is ``%timeit``, which will automatically determine the execution time of the single-line Python statement that follows it.\n",
"For example, we may want to check the performance of a list comprehension:\n",
"\n",
"```ipython\n",
"In [8]: %timeit L = [n ** 2 for n in range(1000)]\n",
"430 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
"```\n",
"\n",
"The benefit of ``%timeit`` is that for short commands it will automatically perform multiple runs in order to attain more robust results.\n",
"For multi line statements, adding a second ``%`` sign will turn this into a cell magic that can handle multiple lines of input.\n",
"For example, here's the equivalent construction with a ``for``-loop:\n",
"\n",
"```ipython\n",
"In [9]: %%timeit\n",
" ...: L = []\n",
" ...: for n in range(1000):\n",
" ...: L.append(n ** 2)\n",
" ...: \n",
"484 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n",
"```\n",
"\n",
"We can immediately see that list comprehensions are about 10% faster than the equivalent ``for``-loop construction in this case.\n",
"We'll explore ``%timeit`` and other approaches to timing and profiling code in [Profiling and Timing Code](01.07-Timing-and-Profiling.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Help on Magic Functions: ``?``, ``%magic``, and ``%lsmagic``\n",
"\n",
"Like normal Python functions, IPython magic functions have docstrings, and this useful\n",
"documentation can be accessed in the standard manner.\n",
"So, for example, to read the documentation of the ``%timeit`` magic simply type this:\n",
"\n",
"```ipython\n",
"In [10]: %timeit?\n",
"```\n",
"\n",
"Documentation for other functions can be accessed similarly.\n",
"To access a general description of available magic functions, including some examples, you can type this:\n",
"\n",
"```ipython\n",
"In [11]: %magic\n",
"```\n",
"\n",
"For a quick and simple list of all available magic functions, type this:\n",
"\n",
"```ipython\n",
"In [12]: %lsmagic\n",
"```\n",
"\n",
"Finally, I'll mention that it is quite straightforward to define your own magic functions if you wish.\n",
"We won't discuss it here, but if you are interested, see the references listed in [More IPython Resources](01.08-More-IPython-Resources.ipynb)."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,170 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# IPython Magic Commands
The previous two sections showed how IPython lets you use and explore Python efficiently and interactively.
Here we'll begin discussing some of the enhancements that IPython adds on top of the normal Python syntax.
These are known in IPython as *magic commands*, and are prefixed by the ``%`` character.
These magic commands are designed to succinctly solve various common problems in standard data analysis.
Magic commands come in two flavors: *line magics*, which are denoted by a single ``%`` prefix and operate on a single line of input, and *cell magics*, which are denoted by a double ``%%`` prefix and operate on multiple lines of input.
We'll demonstrate and discuss a few brief examples here, and come back to more focused discussion of several useful magic commands later in the chapter.
<!-- #region -->
## Pasting Code Blocks: ``%paste`` and ``%cpaste``
When working in the IPython interpreter, one common gotcha is that pasting multi-line code blocks can lead to unexpected errors, especially when indentation and interpreter markers are involved.
A common case is that you find some example code on a website and want to paste it into your interpreter.
Consider the following simple function:
``` python
>>> def donothing(x):
... return x
```
The code is formatted as it would appear in the Python interpreter, and if you copy and paste this directly into older IPython versions, you get an error:
```ipython
In [2]: >>> def donothing(x):
...: ... return x
...:
File "<ipython-input-20-5a66c8964687>", line 2
... return x
^
SyntaxError: invalid syntax
```
In the direct paste, the interpreter is confused by the additional prompt characters.
But never fearIPython's ``%paste`` magic function is designed to handle this exact type of multi-line, marked-up input:
```ipython
In [3]: %paste
>>> def donothing(x):
... return x
## -- End pasted text --
```
The ``%paste`` command both enters and executes the code, so now the function is ready to be used:
```ipython
In [4]: donothing(10)
Out[4]: 10
```
A command with a similar intent is ``%cpaste``, which opens up an interactive multiline prompt in which you can paste one or more chunks of code to be executed in a batch:
```ipython
In [5]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:>>> def donothing(x):
:... return x
:--
```
These magic commands, like others we'll see, make available functionality that would be difficult or impossible in a standard Python interpreter.
<!-- #endregion -->
<!-- #region -->
## Running External Code: ``%run``
As you begin developing more extensive code, you will likely find yourself working in both IPython for interactive exploration, as well as a text editor to store code that you want to reuse.
Rather than running this code in a new window, it can be convenient to run it within your IPython session.
This can be done with the ``%run`` magic.
For example, imagine you've created a ``myscript.py`` file with the following contents:
```python
#-------------------------------------
# file: myscript.py
def square(x):
"""square a number"""
return x ** 2
for N in range(1, 4):
print(f"{N} squared is {square(N)}")
```
You can execute this from your IPython session as follows:
```ipython
In [6]: %run myscript.py
1 squared is 1
2 squared is 4
3 squared is 9
```
Note also that after you've run this script, any functions defined within it are available for use in your IPython session:
```ipython
In [7]: square(5)
Out[7]: 25
```
There are several options to fine-tune how your code is run; you can see the documentation in the normal way, by typing **``%run?``** in the IPython interpreter.
<!-- #endregion -->
## Timing Code Execution: ``%timeit``
Another example of a useful magic function is ``%timeit``, which will automatically determine the execution time of the single-line Python statement that follows it.
For example, we may want to check the performance of a list comprehension:
```ipython
In [8]: %timeit L = [n ** 2 for n in range(1000)]
430 µs ± 3.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
The benefit of ``%timeit`` is that for short commands it will automatically perform multiple runs in order to attain more robust results.
For multi line statements, adding a second ``%`` sign will turn this into a cell magic that can handle multiple lines of input.
For example, here's the equivalent construction with a ``for``-loop:
```ipython
In [9]: %%timeit
...: L = []
...: for n in range(1000):
...: L.append(n ** 2)
...:
484 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
We can immediately see that list comprehensions are about 10% faster than the equivalent ``for``-loop construction in this case.
We'll explore ``%timeit`` and other approaches to timing and profiling code in [Profiling and Timing Code](01.07-Timing-and-Profiling.ipynb).
## Help on Magic Functions: ``?``, ``%magic``, and ``%lsmagic``
Like normal Python functions, IPython magic functions have docstrings, and this useful
documentation can be accessed in the standard manner.
So, for example, to read the documentation of the ``%timeit`` magic simply type this:
```ipython
In [10]: %timeit?
```
Documentation for other functions can be accessed similarly.
To access a general description of available magic functions, including some examples, you can type this:
```ipython
In [11]: %magic
```
For a quick and simple list of all available magic functions, type this:
```ipython
In [12]: %lsmagic
```
Finally, I'll mention that it is quite straightforward to define your own magic functions if you wish.
We won't discuss it here, but if you are interested, see the references listed in [More IPython Resources](01.08-More-IPython-Resources.ipynb).

View File

@ -0,0 +1,195 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Input and Output History"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Previously we saw that the IPython shell allows you to access previous commands with the up and down arrow keys, or equivalently the Ctrl-p/Ctrl-n shortcuts.\n",
"Additionally, in both the shell and the notebook, IPython exposes several ways to obtain the output of previous commands, as well as string versions of the commands themselves.\n",
"We'll explore those here."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## IPython's ``In`` and ``Out`` Objects\n",
"\n",
"By now I imagine you're becoming familiar with the ``In [1]:``/``Out[1]:`` style prompts used by IPython.\n",
"But it turns out that these are not just pretty decoration: they give a clue as to how you can access previous inputs and outputs in your current session.\n",
"Imagine you start a session that looks like this:\n",
"\n",
"```ipython\n",
"In [1]: import math\n",
"\n",
"In [2]: math.sin(2)\n",
"Out[2]: 0.9092974268256817\n",
"\n",
"In [3]: math.cos(2)\n",
"Out[3]: -0.4161468365471424\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We've imported the built-in ``math`` package, then computed the sine and the cosine of the number 2.\n",
"These inputs and outputs are displayed in the shell with ``In``/``Out`` labels, but there's moreIPython actually creates some Python variables called ``In`` and ``Out`` that are automatically updated to reflect this history:\n",
"\n",
"```ipython\n",
"In [4]: In\n",
"Out[4]: ['', 'import math', 'math.sin(2)', 'math.cos(2)', 'In']\n",
"\n",
"In [5]: Out\n",
"Out[5]:\n",
"{2: 0.9092974268256817,\n",
" 3: -0.4161468365471424,\n",
" 4: ['', 'import math', 'math.sin(2)', 'math.cos(2)', 'In', 'Out']}\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ``In`` object is a list, which keeps track of the commands in order (the first item in the list is a place-holder so that ``In[1]`` can refer to the first command):\n",
"\n",
"```ipython\n",
"In [6]: print(In[1])\n",
"import math\n",
"```\n",
"\n",
"The ``Out`` object is not a list but a dictionary mapping input numbers to their outputs (if any):\n",
"\n",
"```ipython\n",
"In [7]: print(Out[2])\n",
"0.9092974268256817\n",
"```\n",
"\n",
"Note that not all operations have outputs: for example, ``import`` statements and ``print`` statements don't affect the output.\n",
"The latter may be surprising, but makes sense if you consider that ``print`` is a function that returns ``None``; for brevity, any command that returns ``None`` is not added to ``Out``.\n",
"\n",
"Where this can be useful is if you want to interact with past results.\n",
"For example, let's check the sum of ``sin(2) ** 2`` and ``cos(2) ** 2`` using the previously-computed results:\n",
"\n",
"```ipython\n",
"In [8]: Out[2] ** 2 + Out[3] ** 2\n",
"Out[8]: 1.0\n",
"```\n",
"\n",
"The result is ``1.0`` as we'd expect from the well-known trigonometric identity.\n",
"In this case, using these previous results probably is not necessary, but it can become very handy if you execute a very expensive computation and want to reuse the result!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Underscore Shortcuts and Previous Outputs\n",
"\n",
"The standard Python shell contains just one simple shortcut for accessing previous output; the variable ``_`` (i.e., a single underscore) is kept updated with the previous output; this works in IPython as well:\n",
"\n",
"```ipython\n",
"In [9]: print(_)\n",
"1.0\n",
"```\n",
"\n",
"But IPython takes this a bit further—you can use a double underscore to access the second-to-last output, and a triple underscore to access the third-to-last output (skipping any commands with no output):\n",
"\n",
"```ipython\n",
"In [10]: print(__)\n",
"-0.4161468365471424\n",
"\n",
"In [11]: print(___)\n",
"0.9092974268256817\n",
"```\n",
"\n",
"IPython stops there: more than three underscores starts to get a bit hard to count, and at that point it's easier to refer to the output by line number.\n",
"\n",
"There is one more shortcut we should mention, howevera shorthand for ``Out[X]`` is ``_X`` (i.e., a single underscore followed by the line number):\n",
"\n",
"```ipython\n",
"In [12]: Out[2]\n",
"Out[12]: 0.9092974268256817\n",
"\n",
"In [13]: _2\n",
"Out[13]: 0.9092974268256817\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Suppressing Output\n",
"Sometimes you might wish to suppress the output of a statement (this is perhaps most common with the plotting commands that we'll explore in [Introduction to Matplotlib](04.00-Introduction-To-Matplotlib.ipynb)).\n",
"Or maybe the command you're executing produces a result that you'd prefer not like to store in your output history, perhaps so that it can be deallocated when other references are removed.\n",
"The easiest way to suppress the output of a command is to add a semicolon to the end of the line:\n",
"\n",
"```ipython\n",
"In [14]: math.sin(2) + math.cos(2);\n",
"```\n",
"\n",
"The result is computed silently, and the output is neither displayed on the screen or stored in the ``Out`` dictionary:\n",
"\n",
"```ipython\n",
"In [15]: 14 in Out\n",
"Out[15]: False\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Related Magic Commands\n",
"For accessing a batch of previous inputs at once, the ``%history`` magic command is very helpful.\n",
"Here is how you can print the first four inputs:\n",
"\n",
"```ipython\n",
"In [16]: %history -n 1-3\n",
" 1: import math\n",
" 2: math.sin(2)\n",
" 3: math.cos(2)\n",
"```\n",
"\n",
"As usual, you can type ``%history?`` for more information and a description of options available.\n",
"Other similar magic commands are ``%rerun`` (which will re-execute some portion of the command history) and ``%save`` (which saves some set of the command history to a file).\n",
"For more information, I suggest exploring these using the ``?`` help functionality discussed in [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,147 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Input and Output History
Previously we saw that the IPython shell allows you to access previous commands with the up and down arrow keys, or equivalently the Ctrl-p/Ctrl-n shortcuts.
Additionally, in both the shell and the notebook, IPython exposes several ways to obtain the output of previous commands, as well as string versions of the commands themselves.
We'll explore those here.
## IPython's ``In`` and ``Out`` Objects
By now I imagine you're becoming familiar with the ``In [1]:``/``Out[1]:`` style prompts used by IPython.
But it turns out that these are not just pretty decoration: they give a clue as to how you can access previous inputs and outputs in your current session.
Imagine you start a session that looks like this:
```ipython
In [1]: import math
In [2]: math.sin(2)
Out[2]: 0.9092974268256817
In [3]: math.cos(2)
Out[3]: -0.4161468365471424
```
We've imported the built-in ``math`` package, then computed the sine and the cosine of the number 2.
These inputs and outputs are displayed in the shell with ``In``/``Out`` labels, but there's moreIPython actually creates some Python variables called ``In`` and ``Out`` that are automatically updated to reflect this history:
```ipython
In [4]: In
Out[4]: ['', 'import math', 'math.sin(2)', 'math.cos(2)', 'In']
In [5]: Out
Out[5]:
{2: 0.9092974268256817,
3: -0.4161468365471424,
4: ['', 'import math', 'math.sin(2)', 'math.cos(2)', 'In', 'Out']}
```
The ``In`` object is a list, which keeps track of the commands in order (the first item in the list is a place-holder so that ``In[1]`` can refer to the first command):
```ipython
In [6]: print(In[1])
import math
```
The ``Out`` object is not a list but a dictionary mapping input numbers to their outputs (if any):
```ipython
In [7]: print(Out[2])
0.9092974268256817
```
Note that not all operations have outputs: for example, ``import`` statements and ``print`` statements don't affect the output.
The latter may be surprising, but makes sense if you consider that ``print`` is a function that returns ``None``; for brevity, any command that returns ``None`` is not added to ``Out``.
Where this can be useful is if you want to interact with past results.
For example, let's check the sum of ``sin(2) ** 2`` and ``cos(2) ** 2`` using the previously-computed results:
```ipython
In [8]: Out[2] ** 2 + Out[3] ** 2
Out[8]: 1.0
```
The result is ``1.0`` as we'd expect from the well-known trigonometric identity.
In this case, using these previous results probably is not necessary, but it can become very handy if you execute a very expensive computation and want to reuse the result!
## Underscore Shortcuts and Previous Outputs
The standard Python shell contains just one simple shortcut for accessing previous output; the variable ``_`` (i.e., a single underscore) is kept updated with the previous output; this works in IPython as well:
```ipython
In [9]: print(_)
1.0
```
But IPython takes this a bit further—you can use a double underscore to access the second-to-last output, and a triple underscore to access the third-to-last output (skipping any commands with no output):
```ipython
In [10]: print(__)
-0.4161468365471424
In [11]: print(___)
0.9092974268256817
```
IPython stops there: more than three underscores starts to get a bit hard to count, and at that point it's easier to refer to the output by line number.
There is one more shortcut we should mention, howevera shorthand for ``Out[X]`` is ``_X`` (i.e., a single underscore followed by the line number):
```ipython
In [12]: Out[2]
Out[12]: 0.9092974268256817
In [13]: _2
Out[13]: 0.9092974268256817
```
## Suppressing Output
Sometimes you might wish to suppress the output of a statement (this is perhaps most common with the plotting commands that we'll explore in [Introduction to Matplotlib](04.00-Introduction-To-Matplotlib.ipynb)).
Or maybe the command you're executing produces a result that you'd prefer not like to store in your output history, perhaps so that it can be deallocated when other references are removed.
The easiest way to suppress the output of a command is to add a semicolon to the end of the line:
```ipython
In [14]: math.sin(2) + math.cos(2);
```
The result is computed silently, and the output is neither displayed on the screen or stored in the ``Out`` dictionary:
```ipython
In [15]: 14 in Out
Out[15]: False
```
## Related Magic Commands
For accessing a batch of previous inputs at once, the ``%history`` magic command is very helpful.
Here is how you can print the first four inputs:
```ipython
In [16]: %history -n 1-3
1: import math
2: math.sin(2)
3: math.cos(2)
```
As usual, you can type ``%history?`` for more information and a description of options available.
Other similar magic commands are ``%rerun`` (which will re-execute some portion of the command history) and ``%save`` (which saves some set of the command history to a file).
For more information, I suggest exploring these using the ``?`` help functionality discussed in [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb).

View File

@ -0,0 +1,226 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IPython and Shell Commands"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When working interactively with the standard Python interpreter, one of the frustrations is the need to switch between multiple windows to access Python tools and system command-line tools.\n",
"IPython bridges this gap, and gives you a syntax for executing shell commands directly from within the IPython terminal.\n",
"The magic happens with the exclamation point: anything appearing after ``!`` on a line will be executed not by the Python kernel, but by the system command-line.\n",
"\n",
"The following assumes you're on a Unix-like system, such as Linux or Mac OSX.\n",
"Some of the examples that follow will fail on Windows, which uses a different type of shell by default, though if you use the *Windows Subsystem for Linux* the examples here should run correctly.\n",
"If you're unfamiliar with shell commands, I'd suggest reviewing the [Shell Tutorial](http://swcarpentry.github.io/shell-novice/) put together by the always excellent Software Carpentry Foundation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Quick Introduction to the Shell\n",
"\n",
"A full intro to using the shell/terminal/command-line is well beyond the scope of this chapter, but for the uninitiated we will offer a quick introduction here.\n",
"The shell is a way to interact textually with your computer.\n",
"Ever since the mid 1980s, when Microsoft and Apple introduced the first versions of their now ubiquitous graphical operating systems, most computer users have interacted with their operating system through familiar clicking of menus and drag-and-drop movements.\n",
"But operating systems existed long before these graphical user interfaces, and were primarily controlled through sequences of text input: at the prompt, the user would type a command, and the computer would do what the user told it to.\n",
"Those early prompt systems are the precursors of the shells and terminals that most data scientists still use today.\n",
"\n",
"Someone unfamiliar with the shell might ask why you would bother with this, when many results can be accomplished by simply clicking on icons and menus.\n",
"A shell user might reply with another question: why hunt icons and click menus when you can accomplish things much more easily by typing?\n",
"While it might sound like a typical tech preference impasse, when moving beyond basic tasks it quickly becomes clear that the shell offers much more control of advanced tasks, though admittedly the learning curve can be intimidating.\n",
"\n",
"As an example, here is a sample of a Linux/OSX shell session where a user explores, creates, and modifies directories and files on their system (``osx:~ $`` is the prompt, and everything after the ``$`` sign is the typed command; text that is preceded by a ``#`` is meant just as description, rather than something you would actually type in):\n",
"\n",
"```bash\n",
"osx:~ $ echo \"hello world\" # echo is like Python's print function\n",
"hello world\n",
"\n",
"osx:~ $ pwd # pwd = print working directory\n",
"/home/jake # this is the \"path\" that we're sitting in\n",
"\n",
"osx:~ $ ls # ls = list working directory contents\n",
"notebooks projects \n",
"\n",
"osx:~ $ cd projects/ # cd = change directory\n",
"\n",
"osx:projects $ pwd\n",
"/home/jake/projects\n",
"\n",
"osx:projects $ ls\n",
"datasci_book mpld3 myproject.txt\n",
"\n",
"osx:projects $ mkdir myproject # mkdir = make new directory\n",
"\n",
"osx:projects $ cd myproject/\n",
"\n",
"osx:myproject $ mv ../myproject.txt ./ # mv = move file. Here we're moving the\n",
" # file myproject.txt from one directory\n",
" # up (../) to the current directory (./)\n",
"osx:myproject $ ls\n",
"myproject.txt\n",
"```\n",
"\n",
"Notice that all of this is just a compact way to do familiar operations (navigating a directory structure, creating a directory, moving a file, etc.) by typing commands rather than clicking icons and menus.\n",
"With just a few commands (``pwd``, ``ls``, ``cd``, ``mkdir``, and ``cp``) you can do many of the most common file operations.\n",
"It's when you go beyond these basics that the shell approach becomes really powerful."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Shell Commands in IPython\n",
"\n",
"Any standard shell command can be used directly in IPython by prefixing it with the ``!`` character.\n",
"For example, the ``ls``, ``pwd``, and ``echo`` commands can be run as follows:\n",
"\n",
"```ipython\n",
"In [1]: !ls\n",
"myproject.txt\n",
"\n",
"In [2]: !pwd\n",
"/home/jake/projects/myproject\n",
"\n",
"In [3]: !echo \"printing from the shell\"\n",
"printing from the shell\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Passing Values to and from the Shell\n",
"\n",
"Shell commands can not only be called from IPython, but can also be made to interact with the IPython namespace.\n",
"For example, you can save the output of any shell command to a Python list using the assignment operator:\n",
"\n",
"```ipython\n",
"In [4]: contents = !ls\n",
"\n",
"In [5]: print(contents)\n",
"['myproject.txt']\n",
"\n",
"In [6]: directory = !pwd\n",
"\n",
"In [7]: print(directory)\n",
"['/Users/jakevdp/notebooks/tmp/myproject']\n",
"```\n",
"\n",
"These results are not returned as lists, but as a special shell return type defined in IPython:\n",
"\n",
"```ipython\n",
"In [8]: type(directory)\n",
"IPython.utils.text.SList\n",
"```\n",
"\n",
"This looks and acts a lot like a Python list, but has additional functionality, such as\n",
"the ``grep`` and ``fields`` methods and the ``s``, ``n``, and ``p`` properties that allow you to search, filter, and display the results in convenient ways.\n",
"For more information on these, you can use IPython's built-in help features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Communication in the other directionpassing Python variables into the shellis possible using the ``{varname}`` syntax:\n",
"\n",
"```ipython\n",
"In [9]: message = \"hello from Python\"\n",
"\n",
"In [10]: !echo {message}\n",
"hello from Python\n",
"```\n",
"\n",
"The curly braces contain the variable name, which is replaced by the variable's contents in the shell command."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Shell-Related Magic Commands\n",
"\n",
"If you play with IPython's shell commands for a while, you might notice that you cannot use ``!cd`` to navigate the filesystem:\n",
"\n",
"```ipython\n",
"In [11]: !pwd\n",
"/home/jake/projects/myproject\n",
"\n",
"In [12]: !cd ..\n",
"\n",
"In [13]: !pwd\n",
"/home/jake/projects/myproject\n",
"```\n",
"\n",
"The reason is that shell commands in the notebook are executed in a temporary subshell that does not maintain state from command to command.\n",
"If you'd like to change the working directory in a more enduring way, you can use the ``%cd`` magic command:\n",
"\n",
"```ipython\n",
"In [14]: %cd ..\n",
"/home/jake/projects\n",
"```\n",
"\n",
"In fact, by default you can even use this without the ``%`` sign:\n",
"\n",
"```ipython\n",
"In [15]: cd myproject\n",
"/home/jake/projects/myproject\n",
"```\n",
"\n",
"This is known as an ``automagic`` function, and the ability to execute such commands without an explicit `%` can be toggled with the ``%automagic`` magic function.\n",
"\n",
"Besides ``%cd``, other available shell-like magic functions are ``%cat``, ``%cp``, ``%env``, ``%ls``, ``%man``, ``%mkdir``, ``%more``, ``%mv``, ``%pwd``, ``%rm``, and ``%rmdir``, any of which can be used without the ``%`` sign if ``automagic`` is on.\n",
"This makes it so that you can almost treat the IPython prompt as if it's a normal shell:\n",
"\n",
"```ipython\n",
"In [16]: mkdir tmp\n",
"\n",
"In [17]: ls\n",
"myproject.txt tmp/\n",
"\n",
"In [18]: cp myproject.txt tmp/\n",
"\n",
"In [19]: ls tmp\n",
"myproject.txt\n",
"\n",
"In [20]: rm -r tmp\n",
"```\n",
"\n",
"This access to the shell from within the same terminal window as your Python session lets you more naturally combine Python and the shell in your workflows with fewer context switches."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,182 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# IPython and Shell Commands
When working interactively with the standard Python interpreter, one of the frustrations is the need to switch between multiple windows to access Python tools and system command-line tools.
IPython bridges this gap, and gives you a syntax for executing shell commands directly from within the IPython terminal.
The magic happens with the exclamation point: anything appearing after ``!`` on a line will be executed not by the Python kernel, but by the system command-line.
The following assumes you're on a Unix-like system, such as Linux or Mac OSX.
Some of the examples that follow will fail on Windows, which uses a different type of shell by default, though if you use the *Windows Subsystem for Linux* the examples here should run correctly.
If you're unfamiliar with shell commands, I'd suggest reviewing the [Shell Tutorial](http://swcarpentry.github.io/shell-novice/) put together by the always excellent Software Carpentry Foundation.
<!-- #region -->
## Quick Introduction to the Shell
A full intro to using the shell/terminal/command-line is well beyond the scope of this chapter, but for the uninitiated we will offer a quick introduction here.
The shell is a way to interact textually with your computer.
Ever since the mid 1980s, when Microsoft and Apple introduced the first versions of their now ubiquitous graphical operating systems, most computer users have interacted with their operating system through familiar clicking of menus and drag-and-drop movements.
But operating systems existed long before these graphical user interfaces, and were primarily controlled through sequences of text input: at the prompt, the user would type a command, and the computer would do what the user told it to.
Those early prompt systems are the precursors of the shells and terminals that most data scientists still use today.
Someone unfamiliar with the shell might ask why you would bother with this, when many results can be accomplished by simply clicking on icons and menus.
A shell user might reply with another question: why hunt icons and click menus when you can accomplish things much more easily by typing?
While it might sound like a typical tech preference impasse, when moving beyond basic tasks it quickly becomes clear that the shell offers much more control of advanced tasks, though admittedly the learning curve can be intimidating.
As an example, here is a sample of a Linux/OSX shell session where a user explores, creates, and modifies directories and files on their system (``osx:~ $`` is the prompt, and everything after the ``$`` sign is the typed command; text that is preceded by a ``#`` is meant just as description, rather than something you would actually type in):
```bash
osx:~ $ echo "hello world" # echo is like Python's print function
hello world
osx:~ $ pwd # pwd = print working directory
/home/jake # this is the "path" that we're sitting in
osx:~ $ ls # ls = list working directory contents
notebooks projects
osx:~ $ cd projects/ # cd = change directory
osx:projects $ pwd
/home/jake/projects
osx:projects $ ls
datasci_book mpld3 myproject.txt
osx:projects $ mkdir myproject # mkdir = make new directory
osx:projects $ cd myproject/
osx:myproject $ mv ../myproject.txt ./ # mv = move file. Here we're moving the
# file myproject.txt from one directory
# up (../) to the current directory (./)
osx:myproject $ ls
myproject.txt
```
Notice that all of this is just a compact way to do familiar operations (navigating a directory structure, creating a directory, moving a file, etc.) by typing commands rather than clicking icons and menus.
With just a few commands (``pwd``, ``ls``, ``cd``, ``mkdir``, and ``cp``) you can do many of the most common file operations.
It's when you go beyond these basics that the shell approach becomes really powerful.
<!-- #endregion -->
## Shell Commands in IPython
Any standard shell command can be used directly in IPython by prefixing it with the ``!`` character.
For example, the ``ls``, ``pwd``, and ``echo`` commands can be run as follows:
```ipython
In [1]: !ls
myproject.txt
In [2]: !pwd
/home/jake/projects/myproject
In [3]: !echo "printing from the shell"
printing from the shell
```
## Passing Values to and from the Shell
Shell commands can not only be called from IPython, but can also be made to interact with the IPython namespace.
For example, you can save the output of any shell command to a Python list using the assignment operator:
```ipython
In [4]: contents = !ls
In [5]: print(contents)
['myproject.txt']
In [6]: directory = !pwd
In [7]: print(directory)
['/Users/jakevdp/notebooks/tmp/myproject']
```
These results are not returned as lists, but as a special shell return type defined in IPython:
```ipython
In [8]: type(directory)
IPython.utils.text.SList
```
This looks and acts a lot like a Python list, but has additional functionality, such as
the ``grep`` and ``fields`` methods and the ``s``, ``n``, and ``p`` properties that allow you to search, filter, and display the results in convenient ways.
For more information on these, you can use IPython's built-in help features.
Communication in the other directionpassing Python variables into the shellis possible using the ``{varname}`` syntax:
```ipython
In [9]: message = "hello from Python"
In [10]: !echo {message}
hello from Python
```
The curly braces contain the variable name, which is replaced by the variable's contents in the shell command.
# Shell-Related Magic Commands
If you play with IPython's shell commands for a while, you might notice that you cannot use ``!cd`` to navigate the filesystem:
```ipython
In [11]: !pwd
/home/jake/projects/myproject
In [12]: !cd ..
In [13]: !pwd
/home/jake/projects/myproject
```
The reason is that shell commands in the notebook are executed in a temporary subshell that does not maintain state from command to command.
If you'd like to change the working directory in a more enduring way, you can use the ``%cd`` magic command:
```ipython
In [14]: %cd ..
/home/jake/projects
```
In fact, by default you can even use this without the ``%`` sign:
```ipython
In [15]: cd myproject
/home/jake/projects/myproject
```
This is known as an ``automagic`` function, and the ability to execute such commands without an explicit `%` can be toggled with the ``%automagic`` magic function.
Besides ``%cd``, other available shell-like magic functions are ``%cat``, ``%cp``, ``%env``, ``%ls``, ``%man``, ``%mkdir``, ``%more``, ``%mv``, ``%pwd``, ``%rm``, and ``%rmdir``, any of which can be used without the ``%`` sign if ``automagic`` is on.
This makes it so that you can almost treat the IPython prompt as if it's a normal shell:
```ipython
In [16]: mkdir tmp
In [17]: ls
myproject.txt tmp/
In [18]: cp myproject.txt tmp/
In [19]: ls tmp
myproject.txt
In [20]: rm -r tmp
```
This access to the shell from within the same terminal window as your Python session lets you more naturally combine Python and the shell in your workflows with fewer context switches.

View File

@ -0,0 +1,424 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Errors and Debugging"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Code development and data analysis always require a bit of trial and error, and IPython contains tools to streamline this process.\n",
"This section will briefly cover some options for controlling Python's exception reporting, followed by exploring tools for debugging errors in code."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Controlling Exceptions: ``%xmode``\n",
"\n",
"Most of the time when a Python script fails, it will raise an Exception.\n",
"When the interpreter hits one of these exceptions, information about the cause of the error can be found in the *traceback*, which can be accessed from within Python.\n",
"With the ``%xmode`` magic function, IPython allows you to control the amount of information printed when the exception is raised.\n",
"Consider the following code:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"def func1(a, b):\n",
" return a / b\n",
"\n",
"def func2(x):\n",
" a = x\n",
" b = x - 1\n",
" return func1(a, b)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"ename": "ZeroDivisionError",
"evalue": "division by zero",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mZeroDivisionError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-2-b2e110f6fc8f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfunc2\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m<ipython-input-1-d849e34d61fb>\u001b[0m in \u001b[0;36mfunc2\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0ma\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m<ipython-input-1-d849e34d61fb>\u001b[0m in \u001b[0;36mfunc1\u001b[0;34m(a, b)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfunc1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0ma\u001b[0m \u001b[0;34m/\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfunc2\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0ma\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mZeroDivisionError\u001b[0m: division by zero"
]
}
],
"source": [
"func2(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Calling ``func2`` results in an error, and reading the printed trace lets us see exactly what happened.\n",
"In the default mode, this trace includes several lines showing the context of each step that led to the error.\n",
"Using the ``%xmode`` magic function (short for *Exception mode*), we can change what information is printed.\n",
"\n",
"``%xmode`` takes a single argument, the mode, and there are three possibilities: ``Plain``, ``Context``, and ``Verbose``.\n",
"The default is ``Context``, and gives output like that just shown before.\n",
"``Plain`` is more compact and gives less information:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Exception reporting mode: Plain\n"
]
}
],
"source": [
"%xmode Plain"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"ename": "ZeroDivisionError",
"evalue": "division by zero",
"output_type": "error",
"traceback": [
"Traceback \u001b[0;36m(most recent call last)\u001b[0m:\n",
" File \u001b[1;32m\"<ipython-input-4-b2e110f6fc8f>\"\u001b[0m, line \u001b[1;32m1\u001b[0m, in \u001b[1;35m<module>\u001b[0m\n func2(1)\n",
" File \u001b[1;32m\"<ipython-input-1-d849e34d61fb>\"\u001b[0m, line \u001b[1;32m7\u001b[0m, in \u001b[1;35mfunc2\u001b[0m\n return func1(a, b)\n",
"\u001b[0;36m File \u001b[0;32m\"<ipython-input-1-d849e34d61fb>\"\u001b[0;36m, line \u001b[0;32m2\u001b[0;36m, in \u001b[0;35mfunc1\u001b[0;36m\u001b[0m\n\u001b[0;31m return a / b\u001b[0m\n",
"\u001b[0;31mZeroDivisionError\u001b[0m\u001b[0;31m:\u001b[0m division by zero\n"
]
}
],
"source": [
"func2(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ``Verbose`` mode adds some extra information, including the arguments to any functions that are called:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Exception reporting mode: Verbose\n"
]
}
],
"source": [
"%xmode Verbose"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"ename": "ZeroDivisionError",
"evalue": "division by zero",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mZeroDivisionError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-6-b2e110f6fc8f>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mfunc2\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m \u001b[0;36mglobal\u001b[0m \u001b[0;36mfunc2\u001b[0m \u001b[0;34m= <function func2 at 0x103729320>\u001b[0m\n",
"\u001b[0;32m<ipython-input-1-d849e34d61fb>\u001b[0m in \u001b[0;36mfunc2\u001b[0;34m(x=1)\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0ma\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 6\u001b[0m \u001b[0mb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 7\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m \u001b[0;36mglobal\u001b[0m \u001b[0;36mfunc1\u001b[0m \u001b[0;34m= <function func1 at 0x1037294d0>\u001b[0m\u001b[0;34m\n \u001b[0m\u001b[0;36ma\u001b[0m \u001b[0;34m= 1\u001b[0m\u001b[0;34m\n \u001b[0m\u001b[0;36mb\u001b[0m \u001b[0;34m= 0\u001b[0m\n",
"\u001b[0;32m<ipython-input-1-d849e34d61fb>\u001b[0m in \u001b[0;36mfunc1\u001b[0;34m(a=1, b=0)\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfunc1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0ma\u001b[0m \u001b[0;34m/\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m \u001b[0;36ma\u001b[0m \u001b[0;34m= 1\u001b[0m\u001b[0;34m\n \u001b[0m\u001b[0;36mb\u001b[0m \u001b[0;34m= 0\u001b[0m\n\u001b[1;32m 3\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 4\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mfunc2\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0ma\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mZeroDivisionError\u001b[0m: division by zero"
]
}
],
"source": [
"func2(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This extra information can help narrow-in on why the exception is being raised.\n",
"So why not use the ``Verbose`` mode all the time?\n",
"As code gets complicated, this kind of traceback can get extremely long.\n",
"Depending on the context, sometimes the brevity of ``Plain`` or ``Context`` mode is easier to work with."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Debugging: When Reading Tracebacks Is Not Enough\n",
"\n",
"The standard Python tool for interactive debugging is ``pdb``, the Python debugger.\n",
"This debugger lets the user step through the code line by line in order to see what might be causing a more difficult error.\n",
"The IPython-enhanced version of this is ``ipdb``, the IPython debugger.\n",
"\n",
"There are many ways to launch and use both these debuggers; we won't cover them fully here.\n",
"Refer to the online documentation of these two utilities to learn more.\n",
"\n",
"In IPython, perhaps the most convenient interface to debugging is the ``%debug`` magic command.\n",
"If you call it after hitting an exception, it will automatically open an interactive debugging prompt at the point of the exception.\n",
"The ``ipdb`` prompt lets you explore the current state of the stack, explore the available variables, and even run Python commands!\n",
"\n",
"Let's look at the most recent exception, then do some basic tasksprint the values of ``a`` and ``b``, and type ``quit`` to quit the debugging session:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> \u001b[0;32m<ipython-input-1-d849e34d61fb>\u001b[0m(2)\u001b[0;36mfunc1\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m 1 \u001b[0;31m\u001b[0;32mdef\u001b[0m \u001b[0mfunc1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m----> 2 \u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0ma\u001b[0m \u001b[0;34m/\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m 3 \u001b[0;31m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\n",
"ipdb> print(a)\n",
"1\n",
"ipdb> print(b)\n",
"0\n",
"ipdb> quit\n"
]
}
],
"source": [
"%debug"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The interactive debugger allows much more than this, thoughwe can even step up and down through the stack and explore the values of variables there:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"> \u001b[0;32m<ipython-input-1-d849e34d61fb>\u001b[0m(2)\u001b[0;36mfunc1\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m 1 \u001b[0;31m\u001b[0;32mdef\u001b[0m \u001b[0mfunc1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m----> 2 \u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0ma\u001b[0m \u001b[0;34m/\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m 3 \u001b[0;31m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\n",
"ipdb> up\n",
"> \u001b[0;32m<ipython-input-1-d849e34d61fb>\u001b[0m(7)\u001b[0;36mfunc2\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m 5 \u001b[0;31m \u001b[0ma\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m 6 \u001b[0;31m \u001b[0mb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m----> 7 \u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\n",
"ipdb> print(x)\n",
"1\n",
"ipdb> up\n",
"> \u001b[0;32m<ipython-input-6-b2e110f6fc8f>\u001b[0m(1)\u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m----> 1 \u001b[0;31m\u001b[0mfunc2\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\n",
"ipdb> down\n",
"> \u001b[0;32m<ipython-input-1-d849e34d61fb>\u001b[0m(7)\u001b[0;36mfunc2\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m 5 \u001b[0;31m \u001b[0ma\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m 6 \u001b[0;31m \u001b[0mb\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m-\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m----> 7 \u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mfunc1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\n",
"ipdb> quit\n"
]
}
],
"source": [
"%debug"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This allows you to quickly find out not only what caused the error, but what function calls led up to the error.\n",
"\n",
"If you'd like the debugger to launch automatically whenever an exception is raised, you can use the ``%pdb`` magic function to turn on this automatic behavior:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Exception reporting mode: Plain\n",
"Automatic pdb calling has been turned ON\n"
]
},
{
"ename": "ZeroDivisionError",
"evalue": "division by zero",
"output_type": "error",
"traceback": [
"Traceback \u001b[0;36m(most recent call last)\u001b[0m:\n",
" File \u001b[1;32m\"<ipython-input-9-569a67d2d312>\"\u001b[0m, line \u001b[1;32m3\u001b[0m, in \u001b[1;35m<module>\u001b[0m\n func2(1)\n",
" File \u001b[1;32m\"<ipython-input-1-d849e34d61fb>\"\u001b[0m, line \u001b[1;32m7\u001b[0m, in \u001b[1;35mfunc2\u001b[0m\n return func1(a, b)\n",
"\u001b[0;36m File \u001b[0;32m\"<ipython-input-1-d849e34d61fb>\"\u001b[0;36m, line \u001b[0;32m2\u001b[0;36m, in \u001b[0;35mfunc1\u001b[0;36m\u001b[0m\n\u001b[0;31m return a / b\u001b[0m\n",
"\u001b[0;31mZeroDivisionError\u001b[0m\u001b[0;31m:\u001b[0m division by zero\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"> \u001b[0;32m<ipython-input-1-d849e34d61fb>\u001b[0m(2)\u001b[0;36mfunc1\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m 1 \u001b[0;31m\u001b[0;32mdef\u001b[0m \u001b[0mfunc1\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m----> 2 \u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0ma\u001b[0m \u001b[0;34m/\u001b[0m \u001b[0mb\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\u001b[0;32m 3 \u001b[0;31m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0m\n",
"ipdb> print(b)\n",
"0\n",
"ipdb> quit\n"
]
}
],
"source": [
"%xmode Plain\n",
"%pdb on\n",
"func2(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, if you have a script that you'd like to run from the beginning in interactive mode, you can run it with the command ``%run -d``, and use the ``next`` command to step through the lines of code interactively."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Partial list of debugging commands\n",
"\n",
"There are many more available commands for interactive debugging than we've listed here; the following table contains a description of some of the more common and useful ones:\n",
"\n",
"| Command | Description |\n",
"|-----------------|-------------------------------------------------------------|\n",
"| ``l(ist)`` | Show the current location in the file |\n",
"| ``h(elp)`` | Show a list of commands, or find help on a specific command |\n",
"| ``q(uit)`` | Quit the debugger and the program |\n",
"| ``c(ontinue)`` | Quit the debugger, continue in the program |\n",
"| ``n(ext)`` | Go to the next step of the program |\n",
"| ``<enter>`` | Repeat the previous command |\n",
"| ``p(rint)`` | Print variables |\n",
"| ``s(tep)`` | Step into a subroutine |\n",
"| ``r(eturn)`` | Return out of a subroutine |\n",
"\n",
"For more information, use the ``help`` command in the debugger, or take a look at ``ipdb``'s [online documentation](https://github.com/gotcha/ipdb)."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,130 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Errors and Debugging
Code development and data analysis always require a bit of trial and error, and IPython contains tools to streamline this process.
This section will briefly cover some options for controlling Python's exception reporting, followed by exploring tools for debugging errors in code.
## Controlling Exceptions: ``%xmode``
Most of the time when a Python script fails, it will raise an Exception.
When the interpreter hits one of these exceptions, information about the cause of the error can be found in the *traceback*, which can be accessed from within Python.
With the ``%xmode`` magic function, IPython allows you to control the amount of information printed when the exception is raised.
Consider the following code:
```python jupyter={"outputs_hidden": false}
def func1(a, b):
return a / b
def func2(x):
a = x
b = x - 1
return func1(a, b)
```
```python jupyter={"outputs_hidden": false}
func2(1)
```
Calling ``func2`` results in an error, and reading the printed trace lets us see exactly what happened.
In the default mode, this trace includes several lines showing the context of each step that led to the error.
Using the ``%xmode`` magic function (short for *Exception mode*), we can change what information is printed.
``%xmode`` takes a single argument, the mode, and there are three possibilities: ``Plain``, ``Context``, and ``Verbose``.
The default is ``Context``, and gives output like that just shown before.
``Plain`` is more compact and gives less information:
```python jupyter={"outputs_hidden": false}
%xmode Plain
```
```python jupyter={"outputs_hidden": false}
func2(1)
```
The ``Verbose`` mode adds some extra information, including the arguments to any functions that are called:
```python jupyter={"outputs_hidden": false}
%xmode Verbose
```
```python jupyter={"outputs_hidden": false}
func2(1)
```
This extra information can help narrow-in on why the exception is being raised.
So why not use the ``Verbose`` mode all the time?
As code gets complicated, this kind of traceback can get extremely long.
Depending on the context, sometimes the brevity of ``Plain`` or ``Context`` mode is easier to work with.
## Debugging: When Reading Tracebacks Is Not Enough
The standard Python tool for interactive debugging is ``pdb``, the Python debugger.
This debugger lets the user step through the code line by line in order to see what might be causing a more difficult error.
The IPython-enhanced version of this is ``ipdb``, the IPython debugger.
There are many ways to launch and use both these debuggers; we won't cover them fully here.
Refer to the online documentation of these two utilities to learn more.
In IPython, perhaps the most convenient interface to debugging is the ``%debug`` magic command.
If you call it after hitting an exception, it will automatically open an interactive debugging prompt at the point of the exception.
The ``ipdb`` prompt lets you explore the current state of the stack, explore the available variables, and even run Python commands!
Let's look at the most recent exception, then do some basic tasksprint the values of ``a`` and ``b``, and type ``quit`` to quit the debugging session:
```python jupyter={"outputs_hidden": false}
%debug
```
The interactive debugger allows much more than this, thoughwe can even step up and down through the stack and explore the values of variables there:
```python jupyter={"outputs_hidden": false}
%debug
```
This allows you to quickly find out not only what caused the error, but what function calls led up to the error.
If you'd like the debugger to launch automatically whenever an exception is raised, you can use the ``%pdb`` magic function to turn on this automatic behavior:
```python jupyter={"outputs_hidden": false}
%xmode Plain
%pdb on
func2(1)
```
Finally, if you have a script that you'd like to run from the beginning in interactive mode, you can run it with the command ``%run -d``, and use the ``next`` command to step through the lines of code interactively.
### Partial list of debugging commands
There are many more available commands for interactive debugging than we've listed here; the following table contains a description of some of the more common and useful ones:
| Command | Description |
|-----------------|-------------------------------------------------------------|
| ``l(ist)`` | Show the current location in the file |
| ``h(elp)`` | Show a list of commands, or find help on a specific command |
| ``q(uit)`` | Quit the debugger and the program |
| ``c(ontinue)`` | Quit the debugger, continue in the program |
| ``n(ext)`` | Go to the next step of the program |
| ``<enter>`` | Repeat the previous command |
| ``p(rint)`` | Print variables |
| ``s(tep)`` | Step into a subroutine |
| ``r(eturn)`` | Return out of a subroutine |
For more information, use the ``help`` command in the debugger, or take a look at ``ipdb``'s [online documentation](https://github.com/gotcha/ipdb).

View File

@ -0,0 +1,525 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Profiling and Timing Code"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the process of developing code and creating data processing pipelines, there are often trade-offs you can make between various implementations.\n",
"Early in developing your algorithm, it can be counterproductive to worry about such things. As Donald Knuth famously quipped, \"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.\"\n",
"\n",
"But once you have your code working, it can be useful to dig into its efficiency a bit.\n",
"Sometimes it's useful to check the execution time of a given command or set of commands; other times it's useful to examine a multiline process and determine where the bottleneck lies in some complicated series of operations.\n",
"IPython provides access to a wide array of functionality for this kind of timing and profiling of code.\n",
"Here we'll discuss the following IPython magic commands:\n",
"\n",
"- ``%time``: Time the execution of a single statement\n",
"- ``%timeit``: Time repeated execution of a single statement for more accuracy\n",
"- ``%prun``: Run code with the profiler\n",
"- ``%lprun``: Run code with the line-by-line profiler\n",
"- ``%memit``: Measure the memory use of a single statement\n",
"- ``%mprun``: Run code with the line-by-line memory profiler\n",
"\n",
"The last four commands are not bundled with IPythonyou'll need to get the ``line_profiler`` and ``memory_profiler`` extensions, which we will discuss in the following sections."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Timing Code Snippets: ``%timeit`` and ``%time``\n",
"\n",
"We saw the ``%timeit`` line-magic and ``%%timeit`` cell-magic in the introduction to magic functions in [IPython Magic Commands](01.03-Magic-Commands.ipynb); it can be used to time the repeated execution of snippets of code:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.53 µs ± 47.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)\n"
]
}
],
"source": [
"%timeit sum(range(100))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that because this operation is so fast, ``%timeit`` automatically does a large number of repetitions.\n",
"For slower commands, ``%timeit`` will automatically adjust and perform fewer repetitions:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"536 ms ± 15.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%%timeit\n",
"total = 0\n",
"for i in range(1000):\n",
" for j in range(1000):\n",
" total += i * (-1) ** j"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes repeating an operation is not the best option.\n",
"For example, if we have a list that we'd like to sort, we might be misled by a repeated operation.\n",
"Sorting a pre-sorted list is much faster than sorting an unsorted list, so the repetition will skew the result:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.71 ms ± 334 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
]
}
],
"source": [
"import random\n",
"L = [random.random() for i in range(100000)]\n",
"%timeit L.sort()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this, the ``%time`` magic function may be a better choice. It also is a good choice for longer-running commands, when short, system-related delays are unlikely to affect the result.\n",
"Let's time the sorting of an unsorted and a presorted list:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sorting an unsorted list:\n",
"CPU times: user 31.3 ms, sys: 686 µs, total: 32 ms\n",
"Wall time: 33.3 ms\n"
]
}
],
"source": [
"import random\n",
"L = [random.random() for i in range(100000)]\n",
"print(\"sorting an unsorted list:\")\n",
"%time L.sort()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"sorting an already sorted list:\n",
"CPU times: user 5.19 ms, sys: 268 µs, total: 5.46 ms\n",
"Wall time: 14.1 ms\n"
]
}
],
"source": [
"print(\"sorting an already sorted list:\")\n",
"%time L.sort()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how much faster the presorted list is to sort, but notice also how much longer the timing takes with ``%time`` versus ``%timeit``, even for the presorted list!\n",
"This is a result of the fact that ``%timeit`` does some clever things under the hood to prevent system calls from interfering with the timing.\n",
"For example, it prevents cleanup of unused Python objects (known as *garbage collection*) which might otherwise affect the timing.\n",
"For this reason, ``%timeit`` results are usually noticeably faster than ``%time`` results.\n",
"\n",
"For ``%time`` as with ``%timeit``, using the double-percent-sign cell magic syntax allows timing of multiline scripts:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 655 ms, sys: 5.68 ms, total: 661 ms\n",
"Wall time: 710 ms\n"
]
}
],
"source": [
"%%time\n",
"total = 0\n",
"for i in range(1000):\n",
" for j in range(1000):\n",
" total += i * (-1) ** j"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information on ``%time`` and ``%timeit``, as well as their available options, use the IPython help functionality (i.e., type ``%time?`` at the IPython prompt)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Profiling Full Scripts: ``%prun``\n",
"\n",
"A program is made of many single statements, and sometimes timing these statements in context is more important than timing them on their own.\n",
"Python contains a built-in code profiler (which you can read about in the Python documentation), but IPython offers a much more convenient way to use this profiler, in the form of the magic function ``%prun``.\n",
"\n",
"By way of example, we'll define a simple function that does some calculations:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def sum_of_lists(N):\n",
" total = 0\n",
" for i in range(5):\n",
" L = [j ^ (j >> i) for j in range(N)]\n",
" total += sum(L)\n",
" return total"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can call ``%prun`` with a function call to see the profiled results:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" "
]
},
{
"data": {
"text/plain": [
" 14 function calls in 0.932 seconds\n",
"\n",
" Ordered by: internal time\n",
"\n",
" ncalls tottime percall cumtime percall filename:lineno(function)\n",
" 5 0.808 0.162 0.808 0.162 <ipython-input-7-f105717832a2>:4(<listcomp>)\n",
" 5 0.066 0.013 0.066 0.013 {built-in method builtins.sum}\n",
" 1 0.044 0.044 0.918 0.918 <ipython-input-7-f105717832a2>:1(sum_of_lists)\n",
" 1 0.014 0.014 0.932 0.932 <string>:1(<module>)\n",
" 1 0.000 0.000 0.932 0.932 {built-in method builtins.exec}\n",
" 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%prun sum_of_lists(1000000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is a table that indicates, in order of total time on each function call, where the execution is spending the most time. In this case, the bulk of execution time is in the list comprehension inside ``sum_of_lists``.\n",
"From here, we could start thinking about what changes we might make to improve the performance in the algorithm.\n",
"\n",
"For more information on ``%prun``, as well as its available options, use the IPython help functionality (i.e., type ``%prun?`` at the IPython prompt)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Line-By-Line Profiling with ``%lprun``\n",
"\n",
"The function-by-function profiling of ``%prun`` is useful, but sometimes it's more convenient to have a line-by-line profile report.\n",
"This is not built into Python or IPython, but there is a ``line_profiler`` package available for installation that can do this.\n",
"Start by using Python's packaging tool, ``pip``, to install the ``line_profiler`` package:\n",
"\n",
"```\n",
"$ pip install line_profiler\n",
"```\n",
"\n",
"Next, you can use IPython to load the ``line_profiler`` IPython extension, offered as part of this package:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"%load_ext line_profiler"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now the ``%lprun`` command will do a line-by-line profiling of any functionin this case, we need to tell it explicitly which functions we're interested in profiling:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Timer unit: 1e-06 s\n",
"\n",
"Total time: 0.014803 s\n",
"File: <ipython-input-7-f105717832a2>\n",
"Function: sum_of_lists at line 1\n",
"\n",
"Line # Hits Time Per Hit % Time Line Contents\n",
"==============================================================\n",
" 1 def sum_of_lists(N):\n",
" 2 1 6.0 6.0 0.0 total = 0\n",
" 3 6 13.0 2.2 0.1 for i in range(5):\n",
" 4 5 14242.0 2848.4 96.2 L = [j ^ (j >> i) for j in range(N)]\n",
" 5 5 541.0 108.2 3.7 total += sum(L)\n",
" 6 1 1.0 1.0 0.0 return total"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%lprun -f sum_of_lists sum_of_lists(5000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The information at the top gives us the key to reading the results: the time is reported in microseconds and we can see where the program is spending the most time.\n",
"At this point, we may be able to use this information to modify aspects of the script and make it perform better for our desired use case.\n",
"\n",
"For more information on ``%lprun``, as well as its available options, use the IPython help functionality (i.e., type ``%lprun?`` at the IPython prompt)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Profiling Memory Use: ``%memit`` and ``%mprun``\n",
"\n",
"Another aspect of profiling is the amount of memory an operation uses.\n",
"This can be evaluated with another IPython extension, the ``memory_profiler``.\n",
"As with the ``line_profiler``, we start by ``pip``-installing the extension:\n",
"\n",
"```\n",
"$ pip install memory_profiler\n",
"```\n",
"\n",
"Then we can use IPython to load the extension:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"%load_ext memory_profiler"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The memory profiler extension contains two useful magic functions: the ``%memit`` magic (which offers a memory-measuring equivalent of ``%timeit``) and the ``%mprun`` function (which offers a memory-measuring equivalent of ``%lprun``).\n",
"The ``%memit`` function can be used rather simply:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"peak memory: 141.70 MiB, increment: 75.65 MiB\n"
]
}
],
"source": [
"%memit sum_of_lists(1000000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that this function uses about 140 MB of memory.\n",
"\n",
"For a line-by-line description of memory use, we can use the ``%mprun`` magic.\n",
"Unfortunately, this magic works only for functions defined in separate modules rather than the notebook itself, so we'll start by using the ``%%file`` magic to create a simple module called ``mprun_demo.py``, which contains our ``sum_of_lists`` function, with one addition that will make our memory profiling results more clear:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overwriting mprun_demo.py\n"
]
}
],
"source": [
"%%file mprun_demo.py\n",
"def sum_of_lists(N):\n",
" total = 0\n",
" for i in range(5):\n",
" L = [j ^ (j >> i) for j in range(N)]\n",
" total += sum(L)\n",
" del L # remove reference to L\n",
" return total"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now import the new version of this function and run the memory line profiler:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/plain": [
"Filename: /Users/jakevdp/github/jakevdp/PythonDataScienceHandbook/notebooks_v2/mprun_demo.py\n",
"\n",
"Line # Mem usage Increment Occurences Line Contents\n",
"============================================================\n",
" 1 66.7 MiB 66.7 MiB 1 def sum_of_lists(N):\n",
" 2 66.7 MiB 0.0 MiB 1 total = 0\n",
" 3 75.1 MiB 8.4 MiB 6 for i in range(5):\n",
" 4 105.9 MiB 30.8 MiB 5000015 L = [j ^ (j >> i) for j in range(N)]\n",
" 5 109.8 MiB 3.8 MiB 5 total += sum(L)\n",
" 6 75.1 MiB -34.6 MiB 5 del L # remove reference to L\n",
" 7 66.9 MiB -8.2 MiB 1 return total"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from mprun_demo import sum_of_lists\n",
"%mprun -f sum_of_lists sum_of_lists(1000000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the ``Increment`` column tells us how much each line affects the total memory budget: observe that when we create and delete the list ``L``, we are adding about 30 MB of memory usage.\n",
"This is on top of the background memory usage from the Python interpreter itself.\n",
"\n",
"For more information on ``%memit`` and ``%mprun``, as well as their available options, use the IPython help functionality (i.e., type ``%memit?`` at the IPython prompt)."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,204 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# Profiling and Timing Code
In the process of developing code and creating data processing pipelines, there are often trade-offs you can make between various implementations.
Early in developing your algorithm, it can be counterproductive to worry about such things. As Donald Knuth famously quipped, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
But once you have your code working, it can be useful to dig into its efficiency a bit.
Sometimes it's useful to check the execution time of a given command or set of commands; other times it's useful to examine a multiline process and determine where the bottleneck lies in some complicated series of operations.
IPython provides access to a wide array of functionality for this kind of timing and profiling of code.
Here we'll discuss the following IPython magic commands:
- ``%time``: Time the execution of a single statement
- ``%timeit``: Time repeated execution of a single statement for more accuracy
- ``%prun``: Run code with the profiler
- ``%lprun``: Run code with the line-by-line profiler
- ``%memit``: Measure the memory use of a single statement
- ``%mprun``: Run code with the line-by-line memory profiler
The last four commands are not bundled with IPythonyou'll need to get the ``line_profiler`` and ``memory_profiler`` extensions, which we will discuss in the following sections.
## Timing Code Snippets: ``%timeit`` and ``%time``
We saw the ``%timeit`` line-magic and ``%%timeit`` cell-magic in the introduction to magic functions in [IPython Magic Commands](01.03-Magic-Commands.ipynb); it can be used to time the repeated execution of snippets of code:
```python
%timeit sum(range(100))
```
Note that because this operation is so fast, ``%timeit`` automatically does a large number of repetitions.
For slower commands, ``%timeit`` will automatically adjust and perform fewer repetitions:
```python
%%timeit
total = 0
for i in range(1000):
for j in range(1000):
total += i * (-1) ** j
```
Sometimes repeating an operation is not the best option.
For example, if we have a list that we'd like to sort, we might be misled by a repeated operation.
Sorting a pre-sorted list is much faster than sorting an unsorted list, so the repetition will skew the result:
```python
import random
L = [random.random() for i in range(100000)]
%timeit L.sort()
```
For this, the ``%time`` magic function may be a better choice. It also is a good choice for longer-running commands, when short, system-related delays are unlikely to affect the result.
Let's time the sorting of an unsorted and a presorted list:
```python
import random
L = [random.random() for i in range(100000)]
print("sorting an unsorted list:")
%time L.sort()
```
```python
print("sorting an already sorted list:")
%time L.sort()
```
Notice how much faster the presorted list is to sort, but notice also how much longer the timing takes with ``%time`` versus ``%timeit``, even for the presorted list!
This is a result of the fact that ``%timeit`` does some clever things under the hood to prevent system calls from interfering with the timing.
For example, it prevents cleanup of unused Python objects (known as *garbage collection*) which might otherwise affect the timing.
For this reason, ``%timeit`` results are usually noticeably faster than ``%time`` results.
For ``%time`` as with ``%timeit``, using the double-percent-sign cell magic syntax allows timing of multiline scripts:
```python
%%time
total = 0
for i in range(1000):
for j in range(1000):
total += i * (-1) ** j
```
For more information on ``%time`` and ``%timeit``, as well as their available options, use the IPython help functionality (i.e., type ``%time?`` at the IPython prompt).
## Profiling Full Scripts: ``%prun``
A program is made of many single statements, and sometimes timing these statements in context is more important than timing them on their own.
Python contains a built-in code profiler (which you can read about in the Python documentation), but IPython offers a much more convenient way to use this profiler, in the form of the magic function ``%prun``.
By way of example, we'll define a simple function that does some calculations:
```python
def sum_of_lists(N):
total = 0
for i in range(5):
L = [j ^ (j >> i) for j in range(N)]
total += sum(L)
return total
```
Now we can call ``%prun`` with a function call to see the profiled results:
```python
%prun sum_of_lists(1000000)
```
The result is a table that indicates, in order of total time on each function call, where the execution is spending the most time. In this case, the bulk of execution time is in the list comprehension inside ``sum_of_lists``.
From here, we could start thinking about what changes we might make to improve the performance in the algorithm.
For more information on ``%prun``, as well as its available options, use the IPython help functionality (i.e., type ``%prun?`` at the IPython prompt).
## Line-By-Line Profiling with ``%lprun``
The function-by-function profiling of ``%prun`` is useful, but sometimes it's more convenient to have a line-by-line profile report.
This is not built into Python or IPython, but there is a ``line_profiler`` package available for installation that can do this.
Start by using Python's packaging tool, ``pip``, to install the ``line_profiler`` package:
```
$ pip install line_profiler
```
Next, you can use IPython to load the ``line_profiler`` IPython extension, offered as part of this package:
```python
%load_ext line_profiler
```
Now the ``%lprun`` command will do a line-by-line profiling of any functionin this case, we need to tell it explicitly which functions we're interested in profiling:
```python
%lprun -f sum_of_lists sum_of_lists(5000)
```
The information at the top gives us the key to reading the results: the time is reported in microseconds and we can see where the program is spending the most time.
At this point, we may be able to use this information to modify aspects of the script and make it perform better for our desired use case.
For more information on ``%lprun``, as well as its available options, use the IPython help functionality (i.e., type ``%lprun?`` at the IPython prompt).
## Profiling Memory Use: ``%memit`` and ``%mprun``
Another aspect of profiling is the amount of memory an operation uses.
This can be evaluated with another IPython extension, the ``memory_profiler``.
As with the ``line_profiler``, we start by ``pip``-installing the extension:
```
$ pip install memory_profiler
```
Then we can use IPython to load the extension:
```python
%load_ext memory_profiler
```
The memory profiler extension contains two useful magic functions: the ``%memit`` magic (which offers a memory-measuring equivalent of ``%timeit``) and the ``%mprun`` function (which offers a memory-measuring equivalent of ``%lprun``).
The ``%memit`` function can be used rather simply:
```python
%memit sum_of_lists(1000000)
```
We see that this function uses about 140 MB of memory.
For a line-by-line description of memory use, we can use the ``%mprun`` magic.
Unfortunately, this magic works only for functions defined in separate modules rather than the notebook itself, so we'll start by using the ``%%file`` magic to create a simple module called ``mprun_demo.py``, which contains our ``sum_of_lists`` function, with one addition that will make our memory profiling results more clear:
```python
%%file mprun_demo.py
def sum_of_lists(N):
total = 0
for i in range(5):
L = [j ^ (j >> i) for j in range(N)]
total += sum(L)
del L # remove reference to L
return total
```
We can now import the new version of this function and run the memory line profiler:
```python
from mprun_demo import sum_of_lists
%mprun -f sum_of_lists sum_of_lists(1000000)
```
Here the ``Increment`` column tells us how much each line affects the total memory budget: observe that when we create and delete the list ``L``, we are adding about 30 MB of memory usage.
This is on top of the background memory usage from the Python interpreter itself.
For more information on ``%memit`` and ``%mprun``, as well as their available options, use the IPython help functionality (i.e., type ``%memit?`` at the IPython prompt).

View File

@ -0,0 +1,70 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# More IPython Resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this chapter, we've just scratched the surface of using IPython to enable data science tasks.\n",
"Much more information is available both in print and on the Web, and here we'll list some other resources that you may find helpful."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Web Resources\n",
"\n",
"- [The IPython website](http://ipython.org): The IPython website links to documentation, examples, tutorials, and a variety of other resources.\n",
"- [The nbviewer website](http://nbviewer.jupyter.org/): This site shows static renderings of any IPython notebook available on the internet. The front page features some example notebooks that you can browse to see what other folks are using IPython for!\n",
"- [A gallery of interesting Jupyter Notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks/): This ever-growing list of notebooks, powered by nbviewer, shows the depth and breadth of numerical analysis you can do with IPython. It includes everything from short examples and tutorials to full-blown courses and books composed in the notebook format!\n",
"- Video Tutorials: searching the Internet, you will find many video-recorded tutorials on IPython. I'd especially recommend seeking tutorials from the PyCon, SciPy, and PyData conferenes by Fernando Perez and Brian Granger, two of the primary creators and maintainers of IPython and Jupyter."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Books\n",
"\n",
"- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do): Wes McKinney's book includes a chapter that covers using IPython as a data scientist. Although much of the material overlaps what we've discussed here, another perspective is always helpful.\n",
"- [*Learning IPython for Interactive Computing and Data Visualization*](https://www.packtpub.com/big-data-and-business-intelligence/learning-ipython-interactive-computing-and-data-visualization): This short book by Cyrille Rossant offers a good introduction to using IPython for data analysis.\n",
"- [*IPython Interactive Computing and Visualization Cookbook*](https://www.packtpub.com/big-data-and-business-intelligence/ipython-interactive-computing-and-visualization-cookbook): Also by Cyrille Rossant, this book is a longer and more advanced treatment of using IPython for data science. Despite its name, it's not just about IPythonit also goes into some depth on a broad range of data science topics.\n",
"\n",
"Finally, a reminder that you can find help on your own: IPython's ``?``-based help functionality (discussed in [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)) can be useful if you use it well and use it often.\n",
"As you go through the examples here and elsewhere, this can be used to familiarize yourself with all the tools that IPython has to offer."
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,38 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
# More IPython Resources
In this chapter, we've just scratched the surface of using IPython to enable data science tasks.
Much more information is available both in print and on the Web, and here we'll list some other resources that you may find helpful.
## Web Resources
- [The IPython website](http://ipython.org): The IPython website links to documentation, examples, tutorials, and a variety of other resources.
- [The nbviewer website](http://nbviewer.jupyter.org/): This site shows static renderings of any IPython notebook available on the internet. The front page features some example notebooks that you can browse to see what other folks are using IPython for!
- [A gallery of interesting Jupyter Notebooks](https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks/): This ever-growing list of notebooks, powered by nbviewer, shows the depth and breadth of numerical analysis you can do with IPython. It includes everything from short examples and tutorials to full-blown courses and books composed in the notebook format!
- Video Tutorials: searching the Internet, you will find many video-recorded tutorials on IPython. I'd especially recommend seeking tutorials from the PyCon, SciPy, and PyData conferenes by Fernando Perez and Brian Granger, two of the primary creators and maintainers of IPython and Jupyter.
## Books
- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do): Wes McKinney's book includes a chapter that covers using IPython as a data scientist. Although much of the material overlaps what we've discussed here, another perspective is always helpful.
- [*Learning IPython for Interactive Computing and Data Visualization*](https://www.packtpub.com/big-data-and-business-intelligence/learning-ipython-interactive-computing-and-data-visualization): This short book by Cyrille Rossant offers a good introduction to using IPython for data analysis.
- [*IPython Interactive Computing and Visualization Cookbook*](https://www.packtpub.com/big-data-and-business-intelligence/ipython-interactive-computing-and-visualization-cookbook): Also by Cyrille Rossant, this book is a longer and more advanced treatment of using IPython for data science. Despite its name, it's not just about IPythonit also goes into some depth on a broad range of data science topics.
Finally, a reminder that you can find help on your own: IPython's ``?``-based help functionality (discussed in [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)) can be useful if you use it well and use it often.
As you go through the examples here and elsewhere, this can be used to familiarize yourself with all the tools that IPython has to offer.

View File

@ -0,0 +1,200 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [More IPython Resources](01.08-More-IPython-Resources.ipynb) | [Contents](Index.ipynb) | [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.00-Introduction-to-NumPy.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# Introduction to NumPy"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"This chapter, along with chapter 3, outlines techniques for effectively loading, storing, and manipulating in-memory data in Python.\n",
"The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including be collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else.\n",
"Despite this apparent heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.\n",
"\n",
"For example, imagesparticularly digital imagescan be thought of as simply two-dimensional arrays of numbers representing pixel brightness across the area.\n",
"Sound clips can be thought of as one-dimensional arrays of intensity versus time.\n",
"Text can be converted in various ways into numerical representations, perhaps binary digits representing the frequency of certain words or pairs of words.\n",
"No matter what the data are, the first step in making it analyzable will be to transform them into arrays of numbers.\n",
"(We will discuss some specific examples of this process later in [Feature Engineering](05.04-Feature-Engineering.ipynb))\n",
"\n",
"For this reason, efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science.\n",
"We'll now take a look at the specialized tools that Python has for handling such numerical arrays: the NumPy package, and the Pandas package (discussed in Chapter 3).\n",
"\n",
"This chapter will cover NumPy in detail. NumPy (short for *Numerical Python*) provides an efficient interface to store and operate on dense data buffers.\n",
"In some ways, NumPy arrays are like Python's built-in ``list`` type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size.\n",
"NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.\n",
"\n",
"If you followed the advice outlined in the Preface and installed the Anaconda stack, you already have NumPy installed and ready to go.\n",
"If you're more the do-it-yourself type, you can go to http://www.numpy.org/ and follow the installation instructions found there.\n",
"Once you do, you can import NumPy and double-check the version:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"'1.21.2'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy\n",
"numpy.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"For the pieces of the package discussed here, I'd recommend NumPy version 1.8 or later.\n",
"By convention, you'll find that most people in the SciPy/PyData world will import NumPy using ``np`` as an alias:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Throughout this chapter, and indeed the rest of the book, you'll find that this is the way we will import and use NumPy."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Reminder about Built In Documentation\n",
"\n",
"As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature), as well as the documentation of various functions (using the ``?`` character Refer back to [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)).\n",
"\n",
"For example, to display all the contents of the numpy namespace, you can type this:\n",
"\n",
"```ipython\n",
"In [3]: np.<TAB>\n",
"```\n",
"\n",
"And to display NumPy's built-in documentation, you can use this:\n",
"\n",
"```ipython\n",
"In [4]: np?\n",
"```\n",
"\n",
"More detailed documentation, along with tutorials and other resources, can be found at http://www.numpy.org."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [More IPython Resources](01.08-More-IPython-Resources.ipynb) | [Contents](Index.ipynb) | [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.00-Introduction-to-NumPy.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,104 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!-- #region deletable=true editable=true -->
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
<!--NAVIGATION-->
< [More IPython Resources](01.08-More-IPython-Resources.ipynb) | [Contents](Index.ipynb) | [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.00-Introduction-to-NumPy.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
# Introduction to NumPy
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
This chapter, along with chapter 3, outlines techniques for effectively loading, storing, and manipulating in-memory data in Python.
The topic is very broad: datasets can come from a wide range of sources and a wide range of formats, including be collections of documents, collections of images, collections of sound clips, collections of numerical measurements, or nearly anything else.
Despite this apparent heterogeneity, it will help us to think of all data fundamentally as arrays of numbers.
For example, imagesparticularly digital imagescan be thought of as simply two-dimensional arrays of numbers representing pixel brightness across the area.
Sound clips can be thought of as one-dimensional arrays of intensity versus time.
Text can be converted in various ways into numerical representations, perhaps binary digits representing the frequency of certain words or pairs of words.
No matter what the data are, the first step in making it analyzable will be to transform them into arrays of numbers.
(We will discuss some specific examples of this process later in [Feature Engineering](05.04-Feature-Engineering.ipynb))
For this reason, efficient storage and manipulation of numerical arrays is absolutely fundamental to the process of doing data science.
We'll now take a look at the specialized tools that Python has for handling such numerical arrays: the NumPy package, and the Pandas package (discussed in Chapter 3).
This chapter will cover NumPy in detail. NumPy (short for *Numerical Python*) provides an efficient interface to store and operate on dense data buffers.
In some ways, NumPy arrays are like Python's built-in ``list`` type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size.
NumPy arrays form the core of nearly the entire ecosystem of data science tools in Python, so time spent learning to use NumPy effectively will be valuable no matter what aspect of data science interests you.
If you followed the advice outlined in the Preface and installed the Anaconda stack, you already have NumPy installed and ready to go.
If you're more the do-it-yourself type, you can go to http://www.numpy.org/ and follow the installation instructions found there.
Once you do, you can import NumPy and double-check the version:
<!-- #endregion -->
```python deletable=true editable=true jupyter={"outputs_hidden": false}
import numpy
numpy.__version__
```
<!-- #region deletable=true editable=true -->
For the pieces of the package discussed here, I'd recommend NumPy version 1.8 or later.
By convention, you'll find that most people in the SciPy/PyData world will import NumPy using ``np`` as an alias:
<!-- #endregion -->
```python deletable=true editable=true jupyter={"outputs_hidden": false}
import numpy as np
```
<!-- #region deletable=true editable=true -->
Throughout this chapter, and indeed the rest of the book, you'll find that this is the way we will import and use NumPy.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
## Reminder about Built In Documentation
As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature), as well as the documentation of various functions (using the ``?`` character Refer back to [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)).
For example, to display all the contents of the numpy namespace, you can type this:
```ipython
In [3]: np.<TAB>
```
And to display NumPy's built-in documentation, you can use this:
```ipython
In [4]: np?
```
More detailed documentation, along with tutorials and other resources, can be found at http://www.numpy.org.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
<!--NAVIGATION-->
< [More IPython Resources](01.08-More-IPython-Resources.ipynb) | [Contents](Index.ipynb) | [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.00-Introduction-to-NumPy.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
<!-- #endregion -->

View File

@ -0,0 +1,896 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Introduction to NumPy](02.00-Introduction-to-NumPy.ipynb) | [Contents](Index.ipynb) | [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Understanding Data Types in Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Effective data-driven science and computation requires understanding how data is stored and manipulated.\n",
"This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this.\n",
"Understanding this difference is fundamental to understanding much of the material throughout the rest of the book.\n",
"\n",
"Users of Python are often drawn-in by its ease of use, one piece of which is dynamic typing.\n",
"While a statically-typed language like C or Java requires each variable to be explicitly declared, a dynamically-typed language like Python skips this specification. For example, in C you might specify a particular operation as follows:\n",
"\n",
"```C\n",
"/* C code */\n",
"int result = 0;\n",
"for(int i=0; i<100; i++){\n",
" result += i;\n",
"}\n",
"```\n",
"\n",
"While in Python the equivalent operation could be written this way:\n",
"\n",
"```python\n",
"# Python code\n",
"result = 0\n",
"for i in range(100):\n",
" result += i\n",
"```\n",
"\n",
"Notice one main difference: in C, the data types of each variable are explicitly declared, while in Python the types are dynamically inferred. This means, for example, that we can assign any kind of data to any variable:\n",
"\n",
"```python\n",
"# Python code\n",
"x = 4\n",
"x = \"four\"\n",
"```\n",
"\n",
"Here we've switched the contents of ``x`` from an integer to a string. The same thing in C would lead (depending on compiler settings) to a compilation error or other unintented consequences:\n",
"\n",
"```C\n",
"/* C code */\n",
"int x = 4;\n",
"x = \"four\"; // FAILS\n",
"```\n",
"\n",
"This sort of flexibility is one piece that makes Python and other dynamically-typed languages convenient and easy to use.\n",
"Understanding *how* this works is an important piece of learning to analyze data efficiently and effectively with Python.\n",
"But what this type-flexibility also points to is the fact that Python variables are more than just their value; they also contain extra information about the type of the value. We'll explore this more in the sections that follow."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Python Integer Is More Than Just an Integer\n",
"\n",
"The standard Python implementation is written in C.\n",
"This means that every Python object is simply a cleverly-disguised C structure, which contains not only its value, but other information as well. For example, when we define an integer in Python, such as ``x = 10000``, ``x`` is not just a \"raw\" integer. It's actually a pointer to a compound C structure, which contains several values.\n",
"Looking through the Python 3.10 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded):\n",
"\n",
"```C\n",
"struct _longobject {\n",
" long ob_refcnt;\n",
" PyTypeObject *ob_type;\n",
" size_t ob_size;\n",
" long ob_digit[1];\n",
"};\n",
"```\n",
"\n",
"A single integer in Python 3.10 actually contains four pieces:\n",
"\n",
"- ``ob_refcnt``, a reference count that helps Python silently handle memory allocation and deallocation\n",
"- ``ob_type``, which encodes the type of the variable\n",
"- ``ob_size``, which specifies the size of the following data members\n",
"- ``ob_digit``, which contains the actual integer value that we expect the Python variable to represent.\n",
"\n",
"This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in the following figure:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Integer Memory Layout](figures/cint_vs_pyint.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here ``PyObject_HEAD`` is the part of the structure containing the reference count, type code, and other pieces mentioned before.\n",
"\n",
"Notice the difference here: a C integer is essentially a label for a position in memory whose bytes encode an integer value.\n",
"A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes that contain the integer value.\n",
"This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically.\n",
"All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Python List Is More Than Just a List\n",
"\n",
"Let's consider now what happens when we use a Python data structure that holds many Python objects.\n",
"The standard mutable multi-element container in Python is the list.\n",
"We can create a list of integers as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L = list(range(10))\n",
"L"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"int"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(L[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or, similarly, a list of strings:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L2 = [str(c) for c in L]\n",
"L2"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"str"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(L2[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because of Python's dynamic typing, we can even create heterogeneous lists:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"[bool, str, float, int]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"L3 = [True, \"2\", 3.0, 4]\n",
"[type(item) for item in L3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But this flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other informationthat is, each item is a complete Python object.\n",
"In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array.\n",
"The difference between a dynamic-type list and a fixed-type (NumPy-style) array is illustrated in the following figure:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Array Memory Layout](figures/array_vs_list.png)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"At the implementation level, the array essentially contains a single pointer to one contiguous block of data.\n",
"The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier.\n",
"Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type.\n",
"Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fixed-Type Arrays in Python\n",
"\n",
"Python offers several different options for storing data in efficient, fixed-type data buffers.\n",
"The built-in ``array`` module (available since Python 3.3) can be used to create dense arrays of a uniform type:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array('i', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import array\n",
"L = list(range(10))\n",
"A = array.array('i', L)\n",
"A"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here ``'i'`` is a type code indicating the contents are integers.\n",
"\n",
"Much more useful, however, is the ``ndarray`` object of the NumPy package.\n",
"While Python's ``array`` object provides efficient storage of array-based data, NumPy adds to this efficient *operations* on that data.\n",
"We will explore these operations in later sections; here we'll demonstrate several ways of creating a NumPy array.\n",
"\n",
"We'll start with the standard NumPy import, under the alias ``np``:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Arrays from Python Lists\n",
"\n",
"First, we can use ``np.array`` to create arrays from Python lists:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 4, 2, 5, 3])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# integer array:\n",
"np.array([1, 4, 2, 5, 3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type.\n",
"If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([3.14, 4. , 2. , 3. ])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array([3.14, 4, 2, 3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want to explicitly set the data type of the resulting array, we can use the ``dtype`` keyword:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([1., 2., 3., 4.], dtype=float32)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.array([1, 2, 3, 4], dtype=np.float32)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, unlike Python lists, NumPy arrays can explicitly be multi-dimensional; here's one way of initializing a multidimensional array using a list of lists:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[2, 3, 4],\n",
" [4, 5, 6],\n",
" [6, 7, 8]])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# nested lists result in multi-dimensional arrays\n",
"np.array([range(i, i + 3) for i in [2, 4, 6]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The inner lists are treated as rows of the resulting two-dimensional array."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Arrays from Scratch\n",
"\n",
"Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy.\n",
"Here are several examples:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a length-10 integer array filled with zeros\n",
"np.zeros(10, dtype=int)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 1., 1., 1., 1.],\n",
" [1., 1., 1., 1., 1.],\n",
" [1., 1., 1., 1., 1.]])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x5 floating-point array filled with ones\n",
"np.ones((3, 5), dtype=float)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[3.14, 3.14, 3.14, 3.14, 3.14],\n",
" [3.14, 3.14, 3.14, 3.14, 3.14],\n",
" [3.14, 3.14, 3.14, 3.14, 3.14]])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x5 array filled with 3.14\n",
"np.full((3, 5), 3.14)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create an array filled with a linear sequence\n",
"# Starting at 0, ending at 20, stepping by 2\n",
"# (this is similar to the built-in range() function)\n",
"np.arange(0, 20, 2)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([0. , 0.25, 0.5 , 0.75, 1. ])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create an array of five values evenly spaced between 0 and 1\n",
"np.linspace(0, 1, 5)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0.09610171, 0.88193001, 0.70548015],\n",
" [0.35885395, 0.91670468, 0.8721031 ],\n",
" [0.73237865, 0.09708562, 0.52506779]])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x3 array of uniformly distributed\n",
"# pseudo-random values between 0 and 1\n",
"np.random.random((3, 3))"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[-0.46652655, -0.59158776, -1.05392451],\n",
" [-1.72634268, 0.03194069, -0.51048869],\n",
" [ 1.41240208, 1.77734462, -0.43820037]])"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x3 array of normally distributed pseudo-random\n",
"# values with mean 0 and standard deviation 1\n",
"np.random.normal(0, 1, (3, 3))"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[4, 3, 8],\n",
" [6, 5, 0],\n",
" [1, 1, 4]])"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x3 array of pseudo-random integers in the interval [0, 10)\n",
"np.random.randint(0, 10, (3, 3))"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1., 0., 0.],\n",
" [0., 1., 0.],\n",
" [0., 0., 1.]])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create a 3x3 identity matrix\n",
"np.eye(3)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([1., 1., 1.])"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Create an uninitialized array of three integers\n",
"# The values will be whatever happens to already exist at that memory location\n",
"np.empty(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NumPy Standard Data Types\n",
"\n",
"NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations.\n",
"Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.\n",
"\n",
"The standard NumPy data types are listed in the following table.\n",
"Note that when constructing an array, they can be specified using a string:\n",
"\n",
"```python\n",
"np.zeros(10, dtype='int16')\n",
"```\n",
"\n",
"Or using the associated NumPy object:\n",
"\n",
"```python\n",
"np.zeros(10, dtype=np.int16)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"| Data type\t | Description |\n",
"|---------------|-------------|\n",
"| ``bool_`` | Boolean (True or False) stored as a byte |\n",
"| ``int_`` | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)| \n",
"| ``intc`` | Identical to C ``int`` (normally ``int32`` or ``int64``)| \n",
"| ``intp`` | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)| \n",
"| ``int8`` | Byte (-128 to 127)| \n",
"| ``int16`` | Integer (-32768 to 32767)|\n",
"| ``int32`` | Integer (-2147483648 to 2147483647)|\n",
"| ``int64`` | Integer (-9223372036854775808 to 9223372036854775807)| \n",
"| ``uint8`` | Unsigned integer (0 to 255)| \n",
"| ``uint16`` | Unsigned integer (0 to 65535)| \n",
"| ``uint32`` | Unsigned integer (0 to 4294967295)| \n",
"| ``uint64`` | Unsigned integer (0 to 18446744073709551615)| \n",
"| ``float_`` | Shorthand for ``float64``.| \n",
"| ``float16`` | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa| \n",
"| ``float32`` | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa| \n",
"| ``float64`` | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa| \n",
"| ``complex_`` | Shorthand for ``complex128``.| \n",
"| ``complex64`` | Complex number, represented by two 32-bit floats| \n",
"| ``complex128``| Complex number, represented by two 64-bit floats| "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More advanced type specification is possible, such as specifying big or little endian numbers; for more information, refer to the [NumPy documentation](http://numpy.org/).\n",
"NumPy also supports compound data types, which will be covered in [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Introduction to NumPy](02.00-Introduction-to-NumPy.ipynb) | [Contents](Index.ipynb) | [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -0,0 +1,329 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Introduction to NumPy](02.00-Introduction-to-NumPy.ipynb) | [Contents](Index.ipynb) | [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Understanding Data Types in Python
<!-- #region -->
Effective data-driven science and computation requires understanding how data is stored and manipulated.
This section outlines and contrasts how arrays of data are handled in the Python language itself, and how NumPy improves on this.
Understanding this difference is fundamental to understanding much of the material throughout the rest of the book.
Users of Python are often drawn-in by its ease of use, one piece of which is dynamic typing.
While a statically-typed language like C or Java requires each variable to be explicitly declared, a dynamically-typed language like Python skips this specification. For example, in C you might specify a particular operation as follows:
```C
/* C code */
int result = 0;
for(int i=0; i<100; i++){
result += i;
}
```
While in Python the equivalent operation could be written this way:
```python
# Python code
result = 0
for i in range(100):
result += i
```
Notice one main difference: in C, the data types of each variable are explicitly declared, while in Python the types are dynamically inferred. This means, for example, that we can assign any kind of data to any variable:
```python
# Python code
x = 4
x = "four"
```
Here we've switched the contents of ``x`` from an integer to a string. The same thing in C would lead (depending on compiler settings) to a compilation error or other unintented consequences:
```C
/* C code */
int x = 4;
x = "four"; // FAILS
```
This sort of flexibility is one piece that makes Python and other dynamically-typed languages convenient and easy to use.
Understanding *how* this works is an important piece of learning to analyze data efficiently and effectively with Python.
But what this type-flexibility also points to is the fact that Python variables are more than just their value; they also contain extra information about the type of the value. We'll explore this more in the sections that follow.
<!-- #endregion -->
## A Python Integer Is More Than Just an Integer
The standard Python implementation is written in C.
This means that every Python object is simply a cleverly-disguised C structure, which contains not only its value, but other information as well. For example, when we define an integer in Python, such as ``x = 10000``, ``x`` is not just a "raw" integer. It's actually a pointer to a compound C structure, which contains several values.
Looking through the Python 3.10 source code, we find that the integer (long) type definition effectively looks like this (once the C macros are expanded):
```C
struct _longobject {
long ob_refcnt;
PyTypeObject *ob_type;
size_t ob_size;
long ob_digit[1];
};
```
A single integer in Python 3.10 actually contains four pieces:
- ``ob_refcnt``, a reference count that helps Python silently handle memory allocation and deallocation
- ``ob_type``, which encodes the type of the variable
- ``ob_size``, which specifies the size of the following data members
- ``ob_digit``, which contains the actual integer value that we expect the Python variable to represent.
This means that there is some overhead in storing an integer in Python as compared to an integer in a compiled language like C, as illustrated in the following figure:
![Integer Memory Layout](figures/cint_vs_pyint.png)
Here ``PyObject_HEAD`` is the part of the structure containing the reference count, type code, and other pieces mentioned before.
Notice the difference here: a C integer is essentially a label for a position in memory whose bytes encode an integer value.
A Python integer is a pointer to a position in memory containing all the Python object information, including the bytes that contain the integer value.
This extra information in the Python integer structure is what allows Python to be coded so freely and dynamically.
All this additional information in Python types comes at a cost, however, which becomes especially apparent in structures that combine many of these objects.
## A Python List Is More Than Just a List
Let's consider now what happens when we use a Python data structure that holds many Python objects.
The standard mutable multi-element container in Python is the list.
We can create a list of integers as follows:
```python jupyter={"outputs_hidden": false}
L = list(range(10))
L
```
```python jupyter={"outputs_hidden": false}
type(L[0])
```
Or, similarly, a list of strings:
```python jupyter={"outputs_hidden": false}
L2 = [str(c) for c in L]
L2
```
```python jupyter={"outputs_hidden": false}
type(L2[0])
```
Because of Python's dynamic typing, we can even create heterogeneous lists:
```python jupyter={"outputs_hidden": false}
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]
```
But this flexibility comes at a cost: to allow these flexible types, each item in the list must contain its own type info, reference count, and other informationthat is, each item is a complete Python object.
In the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array.
The difference between a dynamic-type list and a fixed-type (NumPy-style) array is illustrated in the following figure:
![Array Memory Layout](figures/array_vs_list.png)
At the implementation level, the array essentially contains a single pointer to one contiguous block of data.
The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object like the Python integer we saw earlier.
Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type.
Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.
## Fixed-Type Arrays in Python
Python offers several different options for storing data in efficient, fixed-type data buffers.
The built-in ``array`` module (available since Python 3.3) can be used to create dense arrays of a uniform type:
```python jupyter={"outputs_hidden": false}
import array
L = list(range(10))
A = array.array('i', L)
A
```
Here ``'i'`` is a type code indicating the contents are integers.
Much more useful, however, is the ``ndarray`` object of the NumPy package.
While Python's ``array`` object provides efficient storage of array-based data, NumPy adds to this efficient *operations* on that data.
We will explore these operations in later sections; here we'll demonstrate several ways of creating a NumPy array.
We'll start with the standard NumPy import, under the alias ``np``:
```python jupyter={"outputs_hidden": false}
import numpy as np
```
## Creating Arrays from Python Lists
First, we can use ``np.array`` to create arrays from Python lists:
```python jupyter={"outputs_hidden": false}
# integer array:
np.array([1, 4, 2, 5, 3])
```
Remember that unlike Python lists, NumPy is constrained to arrays that all contain the same type.
If types do not match, NumPy will upcast if possible (here, integers are up-cast to floating point):
```python jupyter={"outputs_hidden": false}
np.array([3.14, 4, 2, 3])
```
If we want to explicitly set the data type of the resulting array, we can use the ``dtype`` keyword:
```python jupyter={"outputs_hidden": false}
np.array([1, 2, 3, 4], dtype=np.float32)
```
Finally, unlike Python lists, NumPy arrays can explicitly be multi-dimensional; here's one way of initializing a multidimensional array using a list of lists:
```python jupyter={"outputs_hidden": false}
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])
```
The inner lists are treated as rows of the resulting two-dimensional array.
## Creating Arrays from Scratch
Especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy.
Here are several examples:
```python jupyter={"outputs_hidden": false}
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)
```
```python jupyter={"outputs_hidden": false}
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)
```
```python jupyter={"outputs_hidden": false}
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)
```
```python jupyter={"outputs_hidden": false}
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)
```
```python jupyter={"outputs_hidden": false}
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)
```
```python jupyter={"outputs_hidden": false}
# Create a 3x3 array of uniformly distributed
# pseudo-random values between 0 and 1
np.random.random((3, 3))
```
```python jupyter={"outputs_hidden": false}
# Create a 3x3 array of normally distributed pseudo-random
# values with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))
```
```python jupyter={"outputs_hidden": false}
# Create a 3x3 array of pseudo-random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))
```
```python jupyter={"outputs_hidden": false}
# Create a 3x3 identity matrix
np.eye(3)
```
```python jupyter={"outputs_hidden": false}
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty(3)
```
<!-- #region -->
## NumPy Standard Data Types
NumPy arrays contain values of a single type, so it is important to have detailed knowledge of those types and their limitations.
Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.
The standard NumPy data types are listed in the following table.
Note that when constructing an array, they can be specified using a string:
```python
np.zeros(10, dtype='int16')
```
Or using the associated NumPy object:
```python
np.zeros(10, dtype=np.int16)
```
<!-- #endregion -->
| Data type | Description |
|---------------|-------------|
| ``bool_`` | Boolean (True or False) stored as a byte |
| ``int_`` | Default integer type (same as C ``long``; normally either ``int64`` or ``int32``)|
| ``intc`` | Identical to C ``int`` (normally ``int32`` or ``int64``)|
| ``intp`` | Integer used for indexing (same as C ``ssize_t``; normally either ``int32`` or ``int64``)|
| ``int8`` | Byte (-128 to 127)|
| ``int16`` | Integer (-32768 to 32767)|
| ``int32`` | Integer (-2147483648 to 2147483647)|
| ``int64`` | Integer (-9223372036854775808 to 9223372036854775807)|
| ``uint8`` | Unsigned integer (0 to 255)|
| ``uint16`` | Unsigned integer (0 to 65535)|
| ``uint32`` | Unsigned integer (0 to 4294967295)|
| ``uint64`` | Unsigned integer (0 to 18446744073709551615)|
| ``float_`` | Shorthand for ``float64``.|
| ``float16`` | Half precision float: sign bit, 5 bits exponent, 10 bits mantissa|
| ``float32`` | Single precision float: sign bit, 8 bits exponent, 23 bits mantissa|
| ``float64`` | Double precision float: sign bit, 11 bits exponent, 52 bits mantissa|
| ``complex_`` | Shorthand for ``complex128``.|
| ``complex64`` | Complex number, represented by two 32-bit floats|
| ``complex128``| Complex number, represented by two 64-bit floats|
More advanced type specification is possible, such as specifying big or little endian numbers; for more information, refer to the [NumPy documentation](http://numpy.org/).
NumPy also supports compound data types, which will be covered in [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb).
<!--NAVIGATION-->
< [Introduction to NumPy](02.00-Introduction-to-NumPy.ipynb) | [Contents](Index.ipynb) | [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.01-Understanding-Data-Types.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,408 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) | [Contents](Index.ipynb) | [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.02-The-Basics-Of-NumPy-Arrays.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# The Basics of NumPy Arrays
Data manipulation in Python is nearly synonymous with NumPy array manipulation: even newer tools like Pandas ([Chapter 3](03.00-Introduction-to-Pandas.ipynb)) are built around the NumPy array.
This section will present several examples of using NumPy array manipulation to access data and subarrays, and to split, reshape, and join the arrays.
While the types of operations shown here may seem a bit dry and pedantic, they comprise the building blocks of many other examples used throughout the book.
Get to know them well!
We'll cover a few categories of basic array manipulations here:
- *Attributes of arrays*: Determining the size, shape, memory consumption, and data types of arrays
- *Indexing of arrays*: Getting and setting the value of individual array elements
- *Slicing of arrays*: Getting and setting smaller subarrays within a larger array
- *Reshaping of arrays*: Changing the shape of a given array
- *Joining and splitting of arrays*: Combining multiple arrays into one, and splitting one array into many
## NumPy Array Attributes
First let's discuss some useful array attributes.
We'll start by defining random arrays of one, two, and three dimensions.
We'll use NumPy's random number generator, which we will *seed* with a set value in order to ensure that the same random arrays are generated each time this code is run:
```python jupyter={"outputs_hidden": false}
import numpy as np
rng = np.random.default_rng(seed=1701) # seed for reproducibility
x1 = rng.integers(10, size=6) # One-dimensional array
x2 = rng.integers(10, size=(3, 4)) # Two-dimensional array
x3 = rng.integers(10, size=(3, 4, 5)) # Three-dimensional array
```
Each array has attributes including ``ndim`` (the number of dimensions), ``shape`` (the size of each dimension), and ``size`` (the total size of the array), and `dtype` (the type of each element);
```python jupyter={"outputs_hidden": false}
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print("dtype: ", x3.dtype)
```
For more discussion of `dtype`, see [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb)):
## Array Indexing: Accessing Single Elements
If you are familiar with Python's standard list indexing, indexing in NumPy will feel quite familiar.
In a one-dimensional array, the $i^{th}$ value (counting from zero) can be accessed by specifying the desired index in square brackets, just as with Python lists:
```python jupyter={"outputs_hidden": false}
x1
```
```python jupyter={"outputs_hidden": false}
x1[0]
```
```python jupyter={"outputs_hidden": false}
x1[4]
```
To index from the end of the array, you can use negative indices:
```python jupyter={"outputs_hidden": false}
x1[-1]
```
```python jupyter={"outputs_hidden": false}
x1[-2]
```
In a multi-dimensional array, items can be accessed using a comma-separated `(row, column)` tuple:
```python jupyter={"outputs_hidden": false}
x2
```
```python jupyter={"outputs_hidden": false}
x2[0, 0]
```
```python jupyter={"outputs_hidden": false}
x2[2, 0]
```
```python jupyter={"outputs_hidden": false}
x2[2, -1]
```
Values can also be modified using any of the above index notation:
```python jupyter={"outputs_hidden": false}
x2[0, 0] = 12
x2
```
Keep in mind that, unlike Python lists, NumPy arrays have a fixed type.
This means, for example, that if you attempt to insert a floating-point value to an integer array, the value will be silently truncated. Don't be caught unaware by this behavior!
```python jupyter={"outputs_hidden": false}
x1[0] = 3.14159 # this will be truncated!
x1
```
## Array Slicing: Accessing Subarrays
<!-- #region -->
Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the *slice* notation, marked by the colon (``:``) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array ``x``, use this:
``` python
x[start:stop:step]
```
If any of these are unspecified, they default to the values ``start=0``, ``stop=``*``size of dimension``*, ``step=1``.
We'll take a look at accessing sub-arrays in one dimension and in multiple dimensions.
<!-- #endregion -->
### One-dimensional subarrays
```python jupyter={"outputs_hidden": false}
x1
```
```python jupyter={"outputs_hidden": false}
x1[:3] # first three elements
```
```python jupyter={"outputs_hidden": false}
x1[3:] # elements after index 3
```
```python jupyter={"outputs_hidden": false}
x1[1:4] # middle sub-array
```
```python jupyter={"outputs_hidden": false}
x1[::2] # every other element
```
```python jupyter={"outputs_hidden": false}
x1[1::2] # every other element, starting at index 1
```
A potentially confusing case is when the ``step`` value is negative.
In this case, the defaults for ``start`` and ``stop`` are swapped.
This becomes a convenient way to reverse an array:
```python jupyter={"outputs_hidden": false}
x1[::-1] # all elements, reversed
```
```python jupyter={"outputs_hidden": false}
x1[4::-2] # reversed every other from index 4
```
### Multi-dimensional subarrays
Multi-dimensional slices work in the same way, with multiple slices separated by commas.
For example:
```python jupyter={"outputs_hidden": false}
x2
```
```python jupyter={"outputs_hidden": false}
x2[:2, :3] # first two rows & three columns
```
```python jupyter={"outputs_hidden": false}
x2[:3, ::2] # three rows, every other column
```
```python jupyter={"outputs_hidden": false}
x2[::-1, ::-1] # all rows & columns, reversed
```
#### Accessing array rows and columns
One commonly needed routine is accessing of single rows or columns of an array.
This can be done by combining indexing and slicing, using an empty slice marked by a single colon (``:``):
```python jupyter={"outputs_hidden": false}
x2[:, 0] # first column of x2
```
```python jupyter={"outputs_hidden": false}
x2[0, :] # first row of x2
```
In the case of row access, the empty slice can be omitted for a more compact syntax:
```python jupyter={"outputs_hidden": false}
x2[0] # equivalent to x2[0, :]
```
### Subarrays as no-copy views
Unlike Python list slices, NumPy array slices are returned as *views* rather than *copies* of the array data.
Consider our two-dimensional array from before:
```python jupyter={"outputs_hidden": false}
print(x2)
```
Let's extract a $2 \times 2$ subarray from this:
```python jupyter={"outputs_hidden": false}
x2_sub = x2[:2, :2]
print(x2_sub)
```
Now if we modify this subarray, we'll see that the original array is changed! Observe:
```python jupyter={"outputs_hidden": false}
x2_sub[0, 0] = 99
print(x2_sub)
```
```python jupyter={"outputs_hidden": false}
print(x2)
```
Some users may find this surprising, but it can be advantageous: for example, when working with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.
### Creating copies of arrays
Despite the nice features of array views, it is sometimes useful to instead explicitly copy the data within an array or a subarray. This can be most easily done with the ``copy()`` method:
```python jupyter={"outputs_hidden": false}
x2_sub_copy = x2[:2, :2].copy()
print(x2_sub_copy)
```
If we now modify this subarray, the original array is not touched:
```python jupyter={"outputs_hidden": false}
x2_sub_copy[0, 0] = 42
print(x2_sub_copy)
```
```python jupyter={"outputs_hidden": false}
print(x2)
```
## Reshaping of Arrays
Another useful type of operation is reshaping of arrays, which can be done with the ``reshape`` method.
For example, if you want to put the numbers 1 through 9 in a $3 \times 3$ grid, you can do the following:
```python jupyter={"outputs_hidden": false}
grid = np.arange(1, 10).reshape(3, 3)
print(grid)
```
Note that for this to work, the size of the initial array must match the size of the reshaped array, and in most cases the ``reshape`` method will return a no-copy view of the initial array.
A common reshaping operation is converting a one-dimensional array into a two-dimensional row or column matrix:
```python jupyter={"outputs_hidden": false}
x = np.array([1, 2, 3])
x.reshape((1, 3)) # row vector via reshape
```
```python
x.reshape((3, 1)) # column vector via reshape
```
A convenient shorthand for this is to use `np.newaxis` within a slicing syntax:
```python jupyter={"outputs_hidden": false}
x[np.newaxis, :] # row vector via newaxis
```
```python jupyter={"outputs_hidden": false}
x[:, np.newaxis] # column vector via newaxis
```
This is a pattern that we will utilize often through the remainder of the book.
## Array Concatenation and Splitting
All of the preceding routines worked on single arrays. NumPy also provides tools to combine multiple arrays into one, and to conversely split a single array into multiple arrays.
### Concatenation of arrays
Concatenation, or joining of two arrays in NumPy, is primarily accomplished using the routines ``np.concatenate``, ``np.vstack``, and ``np.hstack``.
``np.concatenate`` takes a tuple or list of arrays as its first argument, as we can see here:
```python jupyter={"outputs_hidden": false}
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
np.concatenate([x, y])
```
You can also concatenate more than two arrays at once:
```python jupyter={"outputs_hidden": false}
z = np.array([99, 99, 99])
print(np.concatenate([x, y, z]))
```
It can also be used for two-dimensional arrays:
```python jupyter={"outputs_hidden": false}
grid = np.array([[1, 2, 3],
[4, 5, 6]])
```
```python jupyter={"outputs_hidden": false}
# concatenate along the first axis
np.concatenate([grid, grid])
```
```python jupyter={"outputs_hidden": false}
# concatenate along the second axis (zero-indexed)
np.concatenate([grid, grid], axis=1)
```
For working with arrays of mixed dimensions, it can be clearer to use the ``np.vstack`` (vertical stack) and ``np.hstack`` (horizontal stack) functions:
```python jupyter={"outputs_hidden": false}
# vertically stack the arrays
np.vstack([x, grid])
```
```python jupyter={"outputs_hidden": false}
# horizontally stack the arrays
y = np.array([[99],
[99]])
np.hstack([grid, y])
```
Similary, for higher-dimensional arrays, ``np.dstack`` will stack arrays along the third axis.
### Splitting of arrays
The opposite of concatenation is splitting, which is implemented by the functions ``np.split``, ``np.hsplit``, and ``np.vsplit``. For each of these, we can pass a list of indices giving the split points:
```python jupyter={"outputs_hidden": false}
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print(x1, x2, x3)
```
Notice that *N* split-points, leads to *N + 1* subarrays.
The related functions ``np.hsplit`` and ``np.vsplit`` are similar:
```python jupyter={"outputs_hidden": false}
grid = np.arange(16).reshape((4, 4))
grid
```
```python jupyter={"outputs_hidden": false}
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
```
```python jupyter={"outputs_hidden": false}
left, right = np.hsplit(grid, [2])
print(left)
print(right)
```
Similarly, for higher-dimensional arrays, ``np.dsplit`` will split arrays along the third axis.
<!--NAVIGATION-->
< [Understanding Data Types in Python](02.01-Understanding-Data-Types.ipynb) | [Contents](Index.ipynb) | [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.02-The-Basics-Of-NumPy-Arrays.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,392 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb) | [Contents](Index.ipynb) | [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.03-Computation-on-arrays-ufuncs.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Computation on NumPy Arrays: Universal Functions
Up until now, we have been discussing some of the basic nuts and bolts of NumPy; in the next few sections, we will dive into the reasons that NumPy is so important in the Python data science world.
Namely, it provides an easy and flexible interface to optimized computation with arrays of data.
Computation on NumPy arrays can be very fast, or it can be very slow.
The key to making it fast is to use *vectorized* operations, generally implemented through NumPy's *universal functions* (ufuncs).
This section motivates the need for NumPy's ufuncs, which can be used to make repeated calculations on array elements much more efficient.
It then introduces many of the most common and useful arithmetic ufuncs available in the NumPy package.
## The Slowness of Loops
Python's default implementation (known as CPython) does some operations very slowly.
This is in part due to the dynamic, interpreted nature of the language: the fact that types are flexible, so that sequences of operations cannot be compiled down to efficient machine code as in languages like C and Fortran.
Recently there have been various attempts to address this weakness: well-known examples are the [PyPy](http://pypy.org/) project, a just-in-time compiled implementation of Python; the [Cython](http://cython.org) project, which converts Python code to compilable C code; and the [Numba](http://numba.pydata.org/) project, which converts snippets of Python code to fast LLVM bytecode.
Each of these has its strengths and weaknesses, but it is safe to say that none of the three approaches has yet surpassed the reach and popularity of the standard CPython engine.
The relative sluggishness of Python generally manifests itself in situations where many small operations are being repeated for instance looping over arrays to operate on each element.
For example, imagine we have an array of values and we'd like to compute the reciprocal of each.
A straightforward approach might look like this:
```python jupyter={"outputs_hidden": false}
import numpy as np
rng = np.random.default_rng(seed=1701)
def compute_reciprocals(values):
output = np.empty(len(values))
for i in range(len(values)):
output[i] = 1.0 / values[i]
return output
values = rng.integers(1, 10, size=5)
compute_reciprocals(values)
```
This implementation probably feels fairly natural to someone from, say, a C or Java background.
But if we measure the execution time of this code for a large input, we see that this operation is very slow, perhaps surprisingly so!
We'll benchmark this with IPython's ``%timeit`` magic (discussed in [Profiling and Timing Code](01.07-Timing-and-Profiling.ipynb)):
```python jupyter={"outputs_hidden": false}
big_array = rng.integers(1, 100, size=1000000)
%timeit compute_reciprocals(big_array)
```
It takes several seconds to compute these million operations and to store the result!
When even cell phones have processing speeds measured in Giga-FLOPS (i.e., billions of numerical operations per second), this seems almost absurdly slow.
It turns out that the bottleneck here is not the operations themselves, but the type-checking and function dispatches that CPython must do at each cycle of the loop.
Each time the reciprocal is computed, Python first examines the object's type and does a dynamic lookup of the correct function to use for that type.
If we were working in compiled code instead, this type specification would be known before the code executes and the result could be computed much more efficiently.
## Introducing UFuncs
For many types of operations, NumPy provides a convenient interface into just this kind of statically typed, compiled routine. This is known as a *vectorized* operation.
For simple operations like the element-wise division here, vectorization is as simple as using Python arithmetic operators directly on the array object.
This vectorized approach is designed to push the loop into the compiled layer that underlies NumPy, leading to much faster execution.
Compare the results of the following two:
```python jupyter={"outputs_hidden": false}
print(compute_reciprocals(values))
print(1.0 / values)
```
Looking at the execution time for our big array, we see that it completes orders of magnitude faster than the Python loop:
```python jupyter={"outputs_hidden": false}
%timeit (1.0 / big_array)
```
Vectorized operations in NumPy are implemented via *ufuncs*, whose main purpose is to quickly execute repeated operations on values in NumPy arrays.
Ufuncs are extremely flexible before we saw an operation between a scalar and an array, but we can also operate between two arrays:
```python jupyter={"outputs_hidden": false}
np.arange(5) / np.arange(1, 6)
```
And ufunc operations are not limited to one-dimensional arraysthey can also act on multi-dimensional arrays as well:
```python jupyter={"outputs_hidden": false}
x = np.arange(9).reshape((3, 3))
2 ** x
```
Computations using vectorization through ufuncs are nearly always more efficient than their counterpart implemented using Python loops, especially as the arrays grow in size.
Any time you see such a loop in a NumPy script, you should consider whether it can be replaced with a vectorized expression.
## Exploring NumPy's UFuncs
Ufuncs exist in two flavors: *unary ufuncs*, which operate on a single input, and *binary ufuncs*, which operate on two inputs.
We'll see examples of both these types of functions here.
### Array arithmetic
NumPy's ufuncs feel very natural to use because they make use of Python's native arithmetic operators.
The standard addition, subtraction, multiplication, and division can all be used:
```python jupyter={"outputs_hidden": false}
x = np.arange(4)
print("x =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2) # floor division
```
There is also a unary ufunc for negation, and a ``**`` operator for exponentiation, and a ``%`` operator for modulus:
```python jupyter={"outputs_hidden": false}
print("-x = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2 = ", x % 2)
```
In addition, these can be strung together however you wish, and the standard order of operations is respected:
```python jupyter={"outputs_hidden": false}
-(0.5*x + 1) ** 2
```
Each of these arithmetic operations are simply convenient wrappers around specific ufuncs built into NumPy; for example, the ``+`` operator is a wrapper for the ``add`` ufunc:
```python jupyter={"outputs_hidden": false}
np.add(x, 2)
```
The following table lists the arithmetic operators implemented in NumPy:
| Operator | Equivalent ufunc | Description |
|---------------|---------------------|---------------------------------------|
|``+`` |``np.add`` |Addition (e.g., ``1 + 1 = 2``) |
|``-`` |``np.subtract`` |Subtraction (e.g., ``3 - 2 = 1``) |
|``-`` |``np.negative`` |Unary negation (e.g., ``-2``) |
|``*`` |``np.multiply`` |Multiplication (e.g., ``2 * 3 = 6``) |
|``/`` |``np.divide`` |Division (e.g., ``3 / 2 = 1.5``) |
|``//`` |``np.floor_divide`` |Floor division (e.g., ``3 // 2 = 1``) |
|``**`` |``np.power`` |Exponentiation (e.g., ``2 ** 3 = 8``) |
|``%`` |``np.mod`` |Modulus/remainder (e.g., ``9 % 4 = 1``)|
Additionally there are Boolean/bitwise operators; we will explore these in [Comparisons, Masks, and Boolean Logic](02.06-Boolean-Arrays-and-Masks.ipynb).
### Absolute value
Just as NumPy understands Python's built-in arithmetic operators, it also understands Python's built-in absolute value function:
```python jupyter={"outputs_hidden": false}
x = np.array([-2, -1, 0, 1, 2])
abs(x)
```
The corresponding NumPy ufunc is ``np.absolute``, which is also available under the alias ``np.abs``:
```python jupyter={"outputs_hidden": false}
np.absolute(x)
```
```python jupyter={"outputs_hidden": false}
np.abs(x)
```
This ufunc can also handle complex data, in which the absolute value returns the magnitude:
```python jupyter={"outputs_hidden": false}
x = np.array([3 - 4j, 4 - 3j, 2 + 0j, 0 + 1j])
np.abs(x)
```
### Trigonometric functions
NumPy provides a large number of useful ufuncs, and some of the most useful for the data scientist are the trigonometric functions.
We'll start by defining an array of angles:
```python jupyter={"outputs_hidden": false}
theta = np.linspace(0, np.pi, 3)
```
Now we can compute some trigonometric functions on these values:
```python jupyter={"outputs_hidden": false}
print("theta = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))
```
The values are computed to within machine precision, which is why values that should be zero do not always hit exactly zero.
Inverse trigonometric functions are also available:
```python jupyter={"outputs_hidden": false}
x = [-1, 0, 1]
print("x = ", x)
print("arcsin(x) = ", np.arcsin(x))
print("arccos(x) = ", np.arccos(x))
print("arctan(x) = ", np.arctan(x))
```
### Exponents and logarithms
Another common type of operation available in a NumPy ufunc are the exponentials:
```python jupyter={"outputs_hidden": false}
x = [1, 2, 3]
print("x =", x)
print("e^x =", np.exp(x))
print("2^x =", np.exp2(x))
print("3^x =", np.power(3., x))
```
The inverse of the exponentials, the logarithms, are also available.
The basic ``np.log`` gives the natural logarithm; if you prefer to compute the base-2 logarithm or the base-10 logarithm, these are available as well:
```python jupyter={"outputs_hidden": false}
x = [1, 2, 4, 10]
print("x =", x)
print("ln(x) =", np.log(x))
print("log2(x) =", np.log2(x))
print("log10(x) =", np.log10(x))
```
There are also some specialized versions that are useful for maintaining precision with very small input:
```python jupyter={"outputs_hidden": false}
x = [0, 0.001, 0.01, 0.1]
print("exp(x) - 1 =", np.expm1(x))
print("log(1 + x) =", np.log1p(x))
```
When ``x`` is very small, these functions give more precise values than if the raw ``np.log`` or ``np.exp`` were to be used.
### Specialized ufuncs
NumPy has many more ufuncs available, including hyperbolic trig functions, bitwise arithmetic, comparison operators, conversions from radians to degrees, rounding and remainders, and much more.
A look through the NumPy documentation reveals a lot of interesting functionality.
Another excellent source for more specialized and obscure ufuncs is the submodule ``scipy.special``.
If you want to compute some obscure mathematical function on your data, chances are it is implemented in ``scipy.special``.
There are far too many functions to list them all, but the following snippet shows a couple that might come up in a statistics context:
```python jupyter={"outputs_hidden": false}
from scipy import special
```
```python jupyter={"outputs_hidden": false}
# Gamma functions (generalized factorials) and related functions
x = [1, 5, 10]
print("gamma(x) =", special.gamma(x))
print("ln|gamma(x)| =", special.gammaln(x))
print("beta(x, 2) =", special.beta(x, 2))
```
```python jupyter={"outputs_hidden": false}
# Error function (integral of Gaussian)
# its complement, and its inverse
x = np.array([0, 0.3, 0.7, 1.0])
print("erf(x) =", special.erf(x))
print("erfc(x) =", special.erfc(x))
print("erfinv(x) =", special.erfinv(x))
```
There are many, many more ufuncs available in both NumPy and ``scipy.special``.
Because the documentation of these packages is available online, a web search along the lines of "gamma function python" will generally find the relevant information.
## Advanced Ufunc Features
Many NumPy users make use of ufuncs without ever learning their full set of features.
We'll outline a few specialized features of ufuncs here.
### Specifying output
For large calculations, it is sometimes useful to be able to specify the array where the result of the calculation will be stored.
Rather than creating a temporary array, this can be used to write computation results directly to the memory location where you'd like them to be.
For all ufuncs, this can be done using the ``out`` argument of the function:
```python jupyter={"outputs_hidden": false}
x = np.arange(5)
y = np.empty(5)
np.multiply(x, 10, out=y)
print(y)
```
This can even be used with array views. For example, we can write the results of a computation to every other element of a specified array:
```python jupyter={"outputs_hidden": false}
y = np.zeros(10)
np.power(2, x, out=y[::2])
print(y)
```
If we had instead written ``y[::2] = 2 ** x``, this would have resulted in the creation of a temporary array to hold the results of ``2 ** x``, followed by a second operation copying those values into the ``y`` array.
This doesn't make much of a difference for such a small computation, but for very large arrays the memory savings from careful use of the ``out`` argument can be significant.
### Aggregations
For binary ufuncs, there are some interesting aggregations that can be computed directly from the object.
For example, if we'd like to *reduce* an array with a particular operation, we can use the ``reduce`` method of any ufunc.
A reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
For example, calling ``reduce`` on the ``add`` ufunc returns the sum of all elements in the array:
```python jupyter={"outputs_hidden": false}
x = np.arange(1, 6)
np.add.reduce(x)
```
Similarly, calling ``reduce`` on the ``multiply`` ufunc results in the product of all array elements:
```python jupyter={"outputs_hidden": false}
np.multiply.reduce(x)
```
If we'd like to store all the intermediate results of the computation, we can instead use ``accumulate``:
```python jupyter={"outputs_hidden": false}
np.add.accumulate(x)
```
```python jupyter={"outputs_hidden": false}
np.multiply.accumulate(x)
```
Note that for these particular cases, there are dedicated NumPy functions to compute the results (``np.sum``, ``np.prod``, ``np.cumsum``, ``np.cumprod``), which we'll explore in [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb).
### Outer products
Finally, any ufunc can compute the output of all pairs of two different inputs using the ``outer`` method.
This allows you, in one line, to do things like create a multiplication table:
```python jupyter={"outputs_hidden": false}
x = np.arange(1, 6)
np.multiply.outer(x, x)
```
The ``ufunc.at`` and ``ufunc.reduceat`` methods are useful as well, and we will explore them in [Fancy Indexing](02.07-Fancy-Indexing.ipynb).
We will also encounter the ability of ufuncs to operate between arrays of different shapes and sizes, a set of operations known as *broadcasting*.
This subject is important enough that we will devote a whole section to it (see [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb)).
## Ufuncs: Learning More
More information on universal functions (including the full list of available functions) can be found on the [NumPy](http://www.numpy.org) and [SciPy](http://www.scipy.org) documentation websites.
Recall that you can also access information directly from within IPython by importing the packages and using IPython's tab-completion and help (``?``) functionality, as described in [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb).
<!--NAVIGATION-->
< [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb) | [Contents](Index.ipynb) | [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.03-Computation-on-arrays-ufuncs.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,220 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) | [Contents](Index.ipynb) | [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.04-Computation-on-arrays-aggregates.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Aggregations: Min, Max, and Everything In Between
A first step in exploring any dataset is often to compute various summary statistics.
Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the "typical" values in a dataset, but other aggregations are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).
NumPy has fast built-in aggregation functions for working on arrays; we'll discuss and demonstrate some of them here.
## Summing the Values in an Array
As a quick example, consider computing the sum of all values in an array.
Python itself can do this using the built-in ``sum`` function:
```python jupyter={"outputs_hidden": false}
import numpy as np
rng = np.random.default_rng()
```
```python jupyter={"outputs_hidden": false}
L = rng.random(100)
sum(L)
```
The syntax is quite similar to that of NumPy's ``sum`` function, and the result is the same in the simplest case:
```python jupyter={"outputs_hidden": false}
np.sum(L)
```
However, because it executes the operation in compiled code, NumPy's version of the operation is computed much more quickly:
```python jupyter={"outputs_hidden": false}
big_array = rng.random(1000000)
%timeit sum(big_array)
%timeit np.sum(big_array)
```
Be careful, though: the ``sum`` function and the ``np.sum`` function are not identical, which can sometimes lead to confusion!
In particular, their optional arguments have different meanings, and ``np.sum`` is aware of multiple array dimensions, as we will see in the following section.
## Minimum and Maximum
Similarly, Python has built-in ``min`` and ``max`` functions, used to find the minimum value and maximum value of any given array:
```python jupyter={"outputs_hidden": false}
min(big_array), max(big_array)
```
NumPy's corresponding functions have similar syntax, and again operate much more quickly:
```python jupyter={"outputs_hidden": false}
np.min(big_array), np.max(big_array)
```
```python jupyter={"outputs_hidden": false}
%timeit min(big_array)
%timeit np.min(big_array)
```
For ``min``, ``max``, ``sum``, and several other NumPy aggregates, a shorter syntax is to use methods of the array object itself:
```python jupyter={"outputs_hidden": false}
print(big_array.min(), big_array.max(), big_array.sum())
```
Whenever possible, make sure that you are using the NumPy version of these aggregates when operating on NumPy arrays!
### Multi dimensional aggregates
One common type of aggregation operation is an aggregate along a row or column.
Say you have some data stored in a two-dimensional array:
```python jupyter={"outputs_hidden": false}
M = rng.integers(0, 10, (3, 4))
print(M)
```
Numpy aggregations will apply across all elements of a multi-dimensional array:
```python jupyter={"outputs_hidden": false}
M.sum()
```
Aggregation functions take an additional argument specifying the *axis* along which the aggregate is computed. For example, we can find the minimum value within each column by specifying ``axis=0``:
```python jupyter={"outputs_hidden": false}
M.min(axis=0)
```
The function returns four values, corresponding to the four columns of numbers.
Similarly, we can find the maximum value within each row:
```python jupyter={"outputs_hidden": false}
M.max(axis=1)
```
The way the axis is specified here can be confusing to users coming from other languages.
The ``axis`` keyword specifies the *dimension of the array that will be collapsed*, rather than the dimension that will be returned.
So specifying ``axis=0`` means that the first axis will be collapsed: for two-dimensional arrays, values within each column will be aggregated.
### Other aggregation functions
NumPy provides several other aggregation functions with a similar API, and additionally most have a ``NaN``-safe counterpart that computes the result while ignoring missing values, which are marked by the special IEEE floating-point ``NaN`` value (see [Handling Missing Data](03.04-Missing-Values.ipynb)).
The following table provides a list of useful aggregation functions available in NumPy:
|Function Name | NaN-safe Version | Description |
|-------------------|---------------------|-----------------------------------------------|
| ``np.sum`` | ``np.nansum`` | Compute sum of elements |
| ``np.prod`` | ``np.nanprod`` | Compute product of elements |
| ``np.mean`` | ``np.nanmean`` | Compute mean of elements |
| ``np.std`` | ``np.nanstd`` | Compute standard deviation |
| ``np.var`` | ``np.nanvar`` | Compute variance |
| ``np.min`` | ``np.nanmin`` | Find minimum value |
| ``np.max`` | ``np.nanmax`` | Find maximum value |
| ``np.argmin`` | ``np.nanargmin`` | Find index of minimum value |
| ``np.argmax`` | ``np.nanargmax`` | Find index of maximum value |
| ``np.median`` | ``np.nanmedian`` | Compute median of elements |
| ``np.percentile`` | ``np.nanpercentile``| Compute rank-based statistics of elements |
| ``np.any`` | N/A | Evaluate whether any elements are true |
| ``np.all`` | N/A | Evaluate whether all elements are true |
We will see these aggregates often throughout the rest of the book.
## Example: What is the Average Height of US Presidents?
Aggregates available in NumPy can act as summary statistics for a set of values.
As a simple example, let's consider the heights of all US presidents.
This data is available in the file *president_heights.csv*, which is a simple comma-separated list of labels and values:
```python jupyter={"outputs_hidden": false}
!head -4 data/president_heights.csv
```
We'll use the Pandas package, which we'll explore more fully in [Chapter 3](03.00-Introduction-to-Pandas.ipynb), to read the file and extract this information (note that the heights are measured in centimeters).
```python jupyter={"outputs_hidden": false}
import pandas as pd
data = pd.read_csv('data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)
```
Now that we have this data array, we can compute a variety of summary statistics:
```python jupyter={"outputs_hidden": false}
print("Mean height: ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height: ", heights.min())
print("Maximum height: ", heights.max())
```
Note that in each case, the aggregation operation reduced the entire array to a single summarizing value, which gives us information about the distribution of values.
We may also wish to compute quantiles:
```python jupyter={"outputs_hidden": false}
print("25th percentile: ", np.percentile(heights, 25))
print("Median: ", np.median(heights))
print("75th percentile: ", np.percentile(heights, 75))
```
We see that the median height of US presidents is 182 cm, or just shy of six feet.
Of course, sometimes it's more useful to see a visual representation of this data, which we can accomplish using tools in Matplotlib (we'll discuss Matplotlib more fully in [Chapter 4](04.00-Introduction-To-Matplotlib.ipynb)). For example, this code generates the following chart:
```python jupyter={"outputs_hidden": false}
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
```
```python jupyter={"outputs_hidden": false}
plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');
```
<!--NAVIGATION-->
< [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) | [Contents](Index.ipynb) | [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.04-Computation-on-arrays-aggregates.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,298 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb) | [Contents](Index.ipynb) | [Comparisons, Masks, and Boolean Logic](02.06-Boolean-Arrays-and-Masks.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.05-Computation-on-arrays-broadcasting.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Computation on Arrays: Broadcasting
We saw in a previous section how NumPy's universal functions can be used to *vectorize* operations and thereby remove slow Python loops.
This section discusses *broadcasting*: a set of rules by which NumPy lets you apply binary operations (e.g., addition, subtraction, multiplication, etc.) between arrays of different sizes and shapes.
## Introducing Broadcasting
Recall that for arrays of the same size, binary operations are performed on an element-by-element basis:
```python jupyter={"outputs_hidden": false}
import numpy as np
```
```python jupyter={"outputs_hidden": false}
a = np.array([0, 1, 2])
b = np.array([5, 5, 5])
a + b
```
Broadcasting allows these types of binary operations to be performed on arrays of different sizesfor example, we can just as easily add a scalar (think of it as a zero-dimensional array) to an array:
```python jupyter={"outputs_hidden": false}
a + 5
```
We can think of this as an operation that stretches or duplicates the value ``5`` into the array ``[5, 5, 5]``, and adds the results.
The advantage of NumPy's broadcasting is that this duplication of values does not actually take place, but it is a useful mental model as we think about broadcasting.
We can similarly extend this idea to arrays of higher dimension. Observe the result when we add a one-dimensional array to a two-dimensional array:
```python jupyter={"outputs_hidden": false}
M = np.ones((3, 3))
M
```
```python jupyter={"outputs_hidden": false}
M + a
```
Here the one-dimensional array ``a`` is stretched, or broadcast across the second dimension in order to match the shape of ``M``.
While these examples are relatively easy to understand, more complicated cases can involve broadcasting of both arrays. Consider the following example:
```python jupyter={"outputs_hidden": false}
a = np.arange(3)
b = np.arange(3)[:, np.newaxis]
print(a)
print(b)
```
```python jupyter={"outputs_hidden": false}
a + b
```
Just as before we stretched or broadcasted one value to match the shape of the other, here we've stretched *both* ``a`` and ``b`` to match a common shape, and the result is a two-dimensional array!
The geometry of these examples is visualized in the following figure (Code to produce this plot can be found in the online [appendix](06.00-Figure-Code.ipynb#Broadcasting), and is adapted from source published in the [astroML](http://astroml.org) documentation. Used by permission).
![Broadcasting Visual](figures/02.05-broadcasting.png)
The light boxes represent the broadcasted values: again, this extra memory is not actually allocated in the course of the operation, but it can be useful conceptually to imagine that it is.
## Rules of Broadcasting
Broadcasting in NumPy follows a strict set of rules to determine the interaction between the two arrays:
- Rule 1: If the two arrays differ in their number of dimensions, the shape of the one with fewer dimensions is *padded* with ones on its leading (left) side.
- Rule 2: If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
- Rule 3: If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
To make these rules clear, let's consider a few examples in detail.
### Broadcasting example 1
Let's look at adding a two-dimensional array to a one-dimensional array:
```python jupyter={"outputs_hidden": false}
M = np.ones((2, 3))
a = np.arange(3)
```
Let's consider an operation on these two arrays. The shape of the arrays are
- ``M.shape = (2, 3)``
- ``a.shape = (3,)``
We see by rule 1 that the array ``a`` has fewer dimensions, so we pad it on the left with ones:
- ``M.shape -> (2, 3)``
- ``a.shape -> (1, 3)``
By rule 2, we now see that the first dimension disagrees, so we stretch this dimension to match:
- ``M.shape -> (2, 3)``
- ``a.shape -> (2, 3)``
The shapes match, and we see that the final shape will be ``(2, 3)``:
```python jupyter={"outputs_hidden": false}
M + a
```
### Broadcasting example 2
Let's take a look at an example where both arrays need to be broadcast:
```python jupyter={"outputs_hidden": false}
a = np.arange(3).reshape((3, 1))
b = np.arange(3)
```
Again, we'll start by writing out the shape of the arrays:
- ``a.shape = (3, 1)``
- ``b.shape = (3,)``
Rule 1 says we must pad the shape of ``b`` with ones:
- ``a.shape -> (3, 1)``
- ``b.shape -> (1, 3)``
And rule 2 tells us that we upgrade each of these ones to match the corresponding size of the other array:
- ``a.shape -> (3, 3)``
- ``b.shape -> (3, 3)``
Because the result matches, these shapes are compatible. We can see this here:
```python jupyter={"outputs_hidden": false}
a + b
```
### Broadcasting example 3
Now let's take a look at an example in which the two arrays are not compatible:
```python jupyter={"outputs_hidden": false}
M = np.ones((3, 2))
a = np.arange(3)
```
This is just a slightly different situation than in the first example: the matrix ``M`` is transposed.
How does this affect the calculation? The shape of the arrays are
- ``M.shape = (3, 2)``
- ``a.shape = (3,)``
Again, rule 1 tells us that we must pad the shape of ``a`` with ones:
- ``M.shape -> (3, 2)``
- ``a.shape -> (1, 3)``
By rule 2, the first dimension of ``a`` is stretched to match that of ``M``:
- ``M.shape -> (3, 2)``
- ``a.shape -> (3, 3)``
Now we hit rule 3the final shapes do not match, so these two arrays are incompatible, as we can observe by attempting this operation:
```python jupyter={"outputs_hidden": false}
M + a
```
Note the potential confusion here: you could imagine making ``a`` and ``M`` compatible by, say, padding ``a``'s shape with ones on the right rather than the left.
But this is not how the broadcasting rules work!
That sort of flexibility might be useful in some cases, but it would lead to potential areas of ambiguity.
If right-side padding is what you'd like, you can do this explicitly by reshaping the array (we'll use the ``np.newaxis`` keyword introduced in [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb)):
```python jupyter={"outputs_hidden": false}
a[:, np.newaxis].shape
```
```python jupyter={"outputs_hidden": false}
M + a[:, np.newaxis]
```
Also notice that while we've been focusing on the ``+`` operator here, these broadcasting rules apply to *any* binary ``ufunc``.
For example, here is the ``logaddexp(a, b)`` function, which computes ``log(exp(a) + exp(b))`` with more precision than the naive approach:
```python jupyter={"outputs_hidden": false}
np.logaddexp(M, a[:, np.newaxis])
```
For more information on the many available universal functions, refer to [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb).
## Broadcasting in Practice
Broadcasting operations form the core of many examples we'll see throughout this book.
We'll now take a look at a couple simple examples of where they can be useful.
### Centering an array
In the previous section, we saw that ufuncs allow a NumPy user to remove the need to explicitly write slow Python loops. Broadcasting extends this ability.
One commonly seen example is when centering an array of data.
Imagine you have an array of 10 observations, each of which consists of 3 values.
Using the standard convention (see [Data Representation in Scikit-Learn](05.02-Introducing-Scikit-Learn.ipynb#Data-Representation-in-Scikit-Learn)), we'll store this in a $10 \times 3$ array:
```python jupyter={"outputs_hidden": false}
rng = np.random.default_rng(seed=1701)
X = rng.random((10, 3))
```
We can compute the mean of each feature using the ``mean`` aggregate across the first dimension:
```python jupyter={"outputs_hidden": false}
Xmean = X.mean(0)
Xmean
```
And now we can center the ``X`` array by subtracting the mean (this is a broadcasting operation):
```python jupyter={"outputs_hidden": false}
X_centered = X - Xmean
```
To double-check that we've done this correctly, we can check that the centered array has near zero mean:
```python jupyter={"outputs_hidden": false}
X_centered.mean(0)
```
To within machine precision, the mean is now zero.
### Plotting a two-dimensional function
One place that broadcasting comes in handy is in displaying images based on two-dimensional functions.
If we want to define a function $z = f(x, y)$, broadcasting can be used to compute the function across the grid:
```python jupyter={"outputs_hidden": false}
# x and y have 50 steps from 0 to 5
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 50)[:, np.newaxis]
z = np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
```
We'll use Matplotlib to plot this two-dimensional array (these tools will be discussed in full in [Density and Contour Plots](04.04-Density-and-Contour-Plots.ipynb)):
```python jupyter={"outputs_hidden": false}
%matplotlib inline
import matplotlib.pyplot as plt
```
```python jupyter={"outputs_hidden": false}
plt.imshow(z, origin='lower', extent=[0, 5, 0, 5])
plt.colorbar();
```
The result is a compelling visualization of the two-dimensional function.
<!--NAVIGATION-->
< [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb) | [Contents](Index.ipynb) | [Comparisons, Masks, and Boolean Logic](02.06-Boolean-Arrays-and-Masks.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.05-Computation-on-arrays-broadcasting.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,391 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) | [Contents](Index.ipynb) | [Fancy Indexing](02.07-Fancy-Indexing.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.06-Boolean-Arrays-and-Masks.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Comparisons, Masks, and Boolean Logic
This section covers the use of Boolean masks to examine and manipulate values within NumPy arrays.
Masking comes up when you want to extract, modify, count, or otherwise manipulate values in an array based on some criterion: for example, you might wish to count all values greater than a certain value, or perhaps remove all outliers that are above some threshold.
In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
## Example: Counting Rainy Days
Imagine you have a series of data that represents the amount of precipitation each day for a year in a given city.
For example, here we'll load the daily rainfall statistics for the city of Seattle in 2014, using Pandas (which is covered in more detail in [Chapter 3](03.00-Introduction-to-Pandas.ipynb)):
```python
import numpy as np
import pandas as pd
# use pandas to extract rainfall inches as a NumPy array
rainfall = pd.read_csv('data/Seattle2014.csv')['PRCP'].values
inches = rainfall / 254.0 # 1/10mm -> inches
inches.shape
```
The array contains 365 values, giving daily rainfall in inches from January 1 to December 31, 2014.
As a first quick visualization, let's look at the histogram of rainy days, which was generated using Matplotlib (we will explore this tool more fully in [Chapter 4](04.00-Introduction-To-Matplotlib.ipynb)):
```python
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # set plot styles
```
```python
plt.hist(inches, 40);
```
This histogram gives us a general idea of what the data looks like: despite its reputation, the vast majority of days in Seattle saw near zero measured rainfall in 2014.
But this doesn't do a good job of conveying some information we'd like to see: for example, how many rainy days were there in the year? What is the average precipitation on those rainy days? How many days were there with more than half an inch of rain?
### Digging into the data
One approach to this would be to answer these questions by hand: loop through the data, incrementing a counter each time we see values in some desired range.
For reasons discussed throughout this chapter, such an approach is very inefficient, both from the standpoint of time writing code and time computing the result.
We saw in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) that NumPy's ufuncs can be used in place of loops to do fast element-wise arithmetic operations on arrays; in the same way, we can use other ufuncs to do element-wise *comparisons* over arrays, and we can then manipulate the results to answer the questions we have.
We'll leave the data aside for right now, and discuss some general tools in NumPy to use *masking* to quickly answer these types of questions.
## Comparison Operators as ufuncs
In [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) we introduced ufuncs, and focused in particular on arithmetic operators. We saw that using ``+``, ``-``, ``*``, ``/``, and others on arrays leads to element-wise operations.
NumPy also implements comparison operators such as ``<`` (less than) and ``>`` (greater than) as element-wise ufuncs.
The result of these comparison operators is always an array with a Boolean data type.
All six of the standard comparison operations are available:
```python
x = np.array([1, 2, 3, 4, 5])
```
```python
x < 3 # less than
```
```python
x > 3 # greater than
```
```python
x <= 3 # less than or equal
```
```python
x >= 3 # greater than or equal
```
```python
x != 3 # not equal
```
```python
x == 3 # equal
```
It is also possible to do an element-wise comparison of two arrays, and to include compound expressions:
```python
(2 * x) == (x ** 2)
```
As in the case of arithmetic operators, the comparison operators are implemented as ufuncs in NumPy; for example, when you write ``x < 3``, internally NumPy uses ``np.less(x, 3)``.
A summary of the comparison operators and their equivalent ufunc is shown here:
| Operator | Equivalent ufunc || Operator | Equivalent ufunc |
|---------------|---------------------||---------------|---------------------|
|``==`` |``np.equal`` ||``!=`` |``np.not_equal`` |
|``<`` |``np.less`` ||``<=`` |``np.less_equal`` |
|``>`` |``np.greater`` ||``>=`` |``np.greater_equal`` |
Just as in the case of arithmetic ufuncs, these will work on arrays of any size and shape.
Here is a two-dimensional example:
```python
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
x
```
```python
x < 6
```
In each case, the result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these Boolean results.
## Working with Boolean Arrays
Given a Boolean array, there are a host of useful operations you can do.
We'll work with ``x``, the two-dimensional array we created earlier.
```python
print(x)
```
### Counting entries
To count the number of ``True`` entries in a Boolean array, ``np.count_nonzero`` is useful:
```python
# how many values less than 6?
np.count_nonzero(x < 6)
```
We see that there are eight array entries that are less than 6.
Another way to get at this information is to use ``np.sum``; in this case, ``False`` is interpreted as ``0``, and ``True`` is interpreted as ``1``:
```python
np.sum(x < 6)
```
The benefit of ``sum()`` is that like with other NumPy aggregation functions, this summation can be done along rows or columns as well:
```python
# how many values less than 6 in each row?
np.sum(x < 6, axis=1)
```
This counts the number of values less than 6 in each row of the matrix.
If we're interested in quickly checking whether any or all the values are true, we can use (you guessed it) ``np.any`` or ``np.all``:
```python
# are there any values greater than 8?
np.any(x > 8)
```
```python
# are there any values less than zero?
np.any(x < 0)
```
```python
# are all values less than 10?
np.all(x < 10)
```
```python
# are all values equal to 6?
np.all(x == 6)
```
``np.all`` and ``np.any`` can be used along particular axes as well. For example:
```python
# are all values in each row less than 8?
np.all(x < 8, axis=1)
```
Here all the elements in the first and third rows are less than 8, while this is not the case for the second row.
Finally, a quick warning: as mentioned in [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb), Python has built-in ``sum()``, ``any()``, and ``all()`` functions. These have a different syntax than the NumPy versions, and in particular will fail or produce unintended results when used on multidimensional arrays. Be sure that you are using ``np.sum()``, ``np.any()``, and ``np.all()`` for these examples!
### Boolean operators
We've already seen how we might count, say, all days with rain less than four inches, or all days with rain greater than two inches.
But what if we want to know about all days with rain less than four inches and greater than one inch?
This is accomplished through Python's *bitwise logic operators*, ``&``, ``|``, ``^``, and ``~``.
Like with the standard arithmetic operators, NumPy overloads these as ufuncs which work element-wise on (usually Boolean) arrays.
For example, we can address this sort of compound question as follows:
```python
np.sum((inches > 0.5) & (inches < 1))
```
<!-- #region -->
So we see that there are 29 days with rainfall between 0.5 and 1.0 inches.
Note that the parentheses here are importantbecause of operator precedence rules, with parentheses removed this expression would be evaluated as follows, which results in an error:
``` python
inches > (0.5 & inches) < 1
```
Using the equivalence of *A AND B* and *NOT (NOT A OR NOT B)* (which you may remember if you've taken an introductory logic course), we can compute the same result in a different manner:
<!-- #endregion -->
```python
np.sum(~( (inches <= 0.5) | (inches >= 1) ))
```
Combining comparison operators and Boolean operators on arrays can lead to a wide range of efficient logical operations.
The following table summarizes the bitwise Boolean operators and their equivalent ufuncs:
| Operator | Equivalent ufunc || Operator | Equivalent ufunc |
|---------------|---------------------||---------------|---------------------|
|``&`` |``np.bitwise_and`` ||&#124; |``np.bitwise_or`` |
|``^`` |``np.bitwise_xor`` ||``~`` |``np.bitwise_not`` |
Using these tools, we might start to answer the types of questions we have about our weather data.
Here are some examples of results we can compute when combining masking with aggregations:
```python
print("Number days without rain: ", np.sum(inches == 0))
print("Number days with rain: ", np.sum(inches != 0))
print("Days with more than 0.5 inches:", np.sum(inches > 0.5))
print("Rainy days with < 0.2 inches :", np.sum((inches > 0) &
(inches < 0.2)))
```
## Boolean Arrays as Masks
In the preceding section we looked at aggregates computed directly on Boolean arrays.
A more powerful pattern is to use Boolean arrays as masks, to select particular subsets of the data themselves.
Returning to our ``x`` array from before, suppose we want an array of all values in the array that are less than, say, 5:
```python
x
```
We can obtain a Boolean array for this condition easily, as we've already seen:
```python
x < 5
```
Now to *select* these values from the array, we can simply index on this Boolean array; this is known as a *masking* operation:
```python
x[x < 5]
```
What is returned is a one-dimensional array filled with all the values that meet this condition; in other words, all the values in positions at which the mask array is ``True``.
We are then free to operate on these values as we wish.
For example, we can compute some relevant statistics on our Seattle rain data:
```python
# construct a mask of all rainy days
rainy = (inches > 0)
# construct a mask of all summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)
print("Median precip on rainy days in 2014 (inches): ",
np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches): ",
np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
np.median(inches[rainy & ~summer]))
```
By combining Boolean operations, masking operations, and aggregates, we can very quickly answer these sorts of questions for our dataset.
## Aside: Using the Keywords and/or Versus the Operators &/|
One common point of confusion is the difference between the keywords ``and`` and ``or`` on one hand, and the operators ``&`` and ``|`` on the other hand.
When would you use one versus the other?
The difference is this: ``and`` and ``or`` gauge the truth or falsehood of *entire object*, while ``&`` and ``|`` refer to *bits within each object*.
When you use ``and`` or ``or``, it's equivalent to asking Python to treat the object as a single Boolean entity.
In Python, all nonzero integers will evaluate as True. Thus:
```python
bool(42), bool(0)
```
```python
bool(42 and 0)
```
```python
bool(42 or 0)
```
When you use ``&`` and ``|`` on integers, the expression operates on the bits of the element, applying the *and* or the *or* to the individual bits making up the number:
```python
bin(42)
```
```python
bin(59)
```
```python
bin(42 & 59)
```
```python
bin(42 | 59)
```
Notice that the corresponding bits of the binary representation are compared in order to yield the result.
When you have an array of Boolean values in NumPy, this can be thought of as a string of bits where ``1 = True`` and ``0 = False``, and the result of ``&`` and ``|`` operates similarly to above:
```python
A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
A | B
```
Using ``or`` on these arrays will try to evaluate the truth or falsehood of the entire array object, which is not a well-defined value:
```python
A or B
```
Similarly, when doing a Boolean expression on a given array, you should use ``|`` or ``&`` rather than ``or`` or ``and``:
```python
x = np.arange(10)
(x > 4) & (x < 8)
```
Trying to evaluate the truth or falsehood of the entire array will give the same ``ValueError`` we saw previously:
```python
(x > 4) and (x < 8)
```
So remember this: ``and`` and ``or`` perform a single Boolean evaluation on an entire object, while ``&`` and ``|`` perform multiple Boolean evaluations on the content (the individual bits or bytes) of an object.
For Boolean NumPy arrays, the latter is nearly always the desired operation.
<!--NAVIGATION-->
< [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb) | [Contents](Index.ipynb) | [Fancy Indexing](02.07-Fancy-Indexing.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.06-Boolean-Arrays-and-Masks.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,306 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Comparisons, Masks, and Boolean Logic](02.06-Boolean-Arrays-and-Masks.ipynb) | [Contents](Index.ipynb) | [Sorting Arrays](02.08-Sorting.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.07-Fancy-Indexing.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Fancy Indexing
In the previous sections, we saw how to access and modify portions of arrays using simple indices (e.g., ``arr[0]``), slices (e.g., ``arr[:5]``), and Boolean masks (e.g., ``arr[arr > 0]``).
In this section, we'll look at another style of array indexing, known as *fancy indexing*.
Fancy indexing is like the simple indexing we've already seen, but we pass arrays of indices in place of single scalars.
This allows us to very quickly access and modify complicated subsets of an array's values.
## Exploring Fancy Indexing
Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at once.
For example, consider the following array:
```python
import numpy as np
rand = np.random.RandomState(42)
x = rand.randint(100, size=10)
print(x)
```
Suppose we want to access three different elements. We could do it like this:
```python
[x[3], x[7], x[2]]
```
Alternatively, we can pass a single list or array of indices to obtain the same result:
```python
ind = [3, 7, 4]
x[ind]
```
When using fancy indexing, the shape of the result reflects the shape of the *index arrays* rather than the shape of the *array being indexed*:
```python
ind = np.array([[3, 7],
[4, 5]])
x[ind]
```
Fancy indexing also works in multiple dimensions. Consider the following array:
```python
X = np.arange(12).reshape((3, 4))
X
```
Like with standard indexing, the first index refers to the row, and the second to the column:
```python
row = np.array([0, 1, 2])
col = np.array([2, 1, 3])
X[row, col]
```
Notice that the first value in the result is ``X[0, 2]``, the second is ``X[1, 1]``, and the third is ``X[2, 3]``.
The pairing of indices in fancy indexing follows all the broadcasting rules that were mentioned in [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb).
So, for example, if we combine a column vector and a row vector within the indices, we get a two-dimensional result:
```python
X[row[:, np.newaxis], col]
```
Here, each row value is matched with each column vector, exactly as we saw in broadcasting of arithmetic operations.
For example:
```python
row[:, np.newaxis] * col
```
It is always important to remember with fancy indexing that the return value reflects the *broadcasted shape of the indices*, rather than the shape of the array being indexed.
## Combined Indexing
For even more powerful operations, fancy indexing can be combined with the other indexing schemes we've seen:
```python
print(X)
```
We can combine fancy and simple indices:
```python
X[2, [2, 0, 1]]
```
We can also combine fancy indexing with slicing:
```python
X[1:, [2, 0, 1]]
```
And we can combine fancy indexing with masking:
```python
mask = np.array([1, 0, 1, 0], dtype=bool)
X[row[:, np.newaxis], mask]
```
All of these indexing options combined lead to a very flexible set of operations for accessing and modifying array values.
## Example: Selecting Random Points
One common use of fancy indexing is the selection of subsets of rows from a matrix.
For example, we might have an $N$ by $D$ matrix representing $N$ points in $D$ dimensions, such as the following points drawn from a two-dimensional normal distribution:
```python
mean = [0, 0]
cov = [[1, 2],
[2, 5]]
X = rand.multivariate_normal(mean, cov, 100)
X.shape
```
Using the plotting tools we will discuss in [Introduction to Matplotlib](04.00-Introduction-To-Matplotlib.ipynb), we can visualize these points as a scatter-plot:
```python
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # for plot styling
plt.scatter(X[:, 0], X[:, 1]);
```
Let's use fancy indexing to select 20 random points. We'll do this by first choosing 20 random indices with no repeats, and use these indices to select a portion of the original array:
```python
indices = np.random.choice(X.shape[0], 20, replace=False)
indices
```
```python
selection = X[indices] # fancy indexing here
selection.shape
```
Now to see which points were selected, let's over-plot large circles at the locations of the selected points:
```python
plt.scatter(X[:, 0], X[:, 1], alpha=0.3)
plt.scatter(selection[:, 0], selection[:, 1],
facecolor='none', s=200);
```
This sort of strategy is often used to quickly partition datasets, as is often needed in train/test splitting for validation of statistical models (see [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb)), and in sampling approaches to answering statistical questions.
## Modifying Values with Fancy Indexing
Just as fancy indexing can be used to access parts of an array, it can also be used to modify parts of an array.
For example, imagine we have an array of indices and we'd like to set the corresponding items in an array to some value:
```python
x = np.arange(10)
i = np.array([2, 1, 8, 4])
x[i] = 99
print(x)
```
We can use any assignment-type operator for this. For example:
```python
x[i] -= 10
print(x)
```
Notice, though, that repeated indices with these operations can cause some potentially unexpected results. Consider the following:
```python
x = np.zeros(10)
x[[0, 0]] = [4, 6]
print(x)
```
Where did the 4 go? The result of this operation is to first assign ``x[0] = 4``, followed by ``x[0] = 6``.
The result, of course, is that ``x[0]`` contains the value 6.
Fair enough, but consider this operation:
```python
i = [2, 3, 3, 4, 4, 4]
x[i] += 1
x
```
You might expect that ``x[3]`` would contain the value 2, and ``x[4]`` would contain the value 3, as this is how many times each index is repeated. Why is this not the case?
Conceptually, this is because ``x[i] += 1`` is meant as a shorthand of ``x[i] = x[i] + 1``. ``x[i] + 1`` is evaluated, and then the result is assigned to the indices in x.
With this in mind, it is not the augmentation that happens multiple times, but the assignment, which leads to the rather nonintuitive results.
So what if you want the other behavior where the operation is repeated? For this, you can use the ``at()`` method of ufuncs (available since NumPy 1.8), and do the following:
```python
x = np.zeros(10)
np.add.at(x, i, 1)
print(x)
```
The ``at()`` method does an in-place application of the given operator at the specified indices (here, ``i``) with the specified value (here, 1).
Another method that is similar in spirit is the ``reduceat()`` method of ufuncs, which you can read about in the NumPy documentation.
## Example: Binning Data
You can use these ideas to efficiently bin data to create a histogram by hand.
For example, imagine we have 1,000 values and would like to quickly find where they fall within an array of bins.
We could compute it using ``ufunc.at`` like this:
```python
np.random.seed(42)
x = np.random.randn(100)
# compute a histogram by hand
bins = np.linspace(-5, 5, 20)
counts = np.zeros_like(bins)
# find the appropriate bin for each x
i = np.searchsorted(bins, x)
# add 1 to each of these bins
np.add.at(counts, i, 1)
```
The counts now reflect the number of points within each binin other words, a histogram:
```python
# plot the results
plt.plot(bins, counts, linestyle='steps');
```
<!-- #region -->
Of course, it would be silly to have to do this each time you want to plot a histogram.
This is why Matplotlib provides the ``plt.hist()`` routine, which does the same in a single line:
```python
plt.hist(x, bins, histtype='step');
```
This function will create a nearly identical plot to the one seen here.
To compute the binning, ``matplotlib`` uses the ``np.histogram`` function, which does a very similar computation to what we did before. Let's compare the two here:
<!-- #endregion -->
```python
print("NumPy routine:")
%timeit counts, edges = np.histogram(x, bins)
print("Custom routine:")
%timeit np.add.at(counts, np.searchsorted(bins, x), 1)
```
Our own one-line algorithm is several times faster than the optimized algorithm in NumPy! How can this be?
If you dig into the ``np.histogram`` source code (you can do this in IPython by typing ``np.histogram??``), you'll see that it's quite a bit more involved than the simple search-and-count that we've done; this is because NumPy's algorithm is more flexible, and particularly is designed for better performance when the number of data points becomes large:
```python
x = np.random.randn(1000000)
print("NumPy routine:")
%timeit counts, edges = np.histogram(x, bins)
print("Custom routine:")
%timeit np.add.at(counts, np.searchsorted(bins, x), 1)
```
What this comparison shows is that algorithmic efficiency is almost never a simple question. An algorithm efficient for large datasets will not always be the best choice for small datasets, and vice versa (see [Big-O Notation](02.08-Sorting.ipynb#Aside:-Big-O-Notation)).
But the advantage of coding this algorithm yourself is that with an understanding of these basic methods, you could use these building blocks to extend this to do some very interesting custom behaviors.
The key to efficiently using Python in data-intensive applications is knowing about general convenience routines like ``np.histogram`` and when they're appropriate, but also knowing how to make use of lower-level functionality when you need more pointed behavior.
<!--NAVIGATION-->
< [Comparisons, Masks, and Boolean Logic](02.06-Boolean-Arrays-and-Masks.ipynb) | [Contents](Index.ipynb) | [Sorting Arrays](02.08-Sorting.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.07-Fancy-Indexing.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,282 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Fancy Indexing](02.07-Fancy-Indexing.ipynb) | [Contents](Index.ipynb) | [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.08-Sorting.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Sorting Arrays
Up to this point we have been concerned mainly with tools to access and operate on array data with NumPy.
This section covers algorithms related to sorting values in NumPy arrays.
These algorithms are a favorite topic in introductory computer science courses: if you've ever taken one, you probably have had dreams (or, depending on your temperament, nightmares) about *insertion sorts*, *selection sorts*, *merge sorts*, *quick sorts*, *bubble sorts*, and many, many more.
All are means of accomplishing a similar task: sorting the values in a list or array.
For example, a simple *selection sort* repeatedly finds the minimum value from a list, and makes swaps until the list is sorted. We can code this in just a few lines of Python:
```python
import numpy as np
def selection_sort(x):
for i in range(len(x)):
swap = i + np.argmin(x[i:])
(x[i], x[swap]) = (x[swap], x[i])
return x
```
```python
x = np.array([2, 1, 4, 3, 5])
selection_sort(x)
```
As any first-year computer science major will tell you, the selection sort is useful for its simplicity, but is much too slow to be useful for larger arrays.
For a list of $N$ values, it requires $N$ loops, each of which does on order $\sim N$ comparisons to find the swap value.
In terms of the "big-O" notation often used to characterize these algorithms (see [Big-O Notation](#Aside:-Big-O-Notation)), selection sort averages $\mathcal{O}[N^2]$: if you double the number of items in the list, the execution time will go up by about a factor of four.
Even selection sort, though, is much better than my all-time favorite sorting algorithms, the *bogosort*:
```python
def bogosort(x):
while np.any(x[:-1] > x[1:]):
np.random.shuffle(x)
return x
```
```python
x = np.array([2, 1, 4, 3, 5])
bogosort(x)
```
This silly sorting method relies on pure chance: it repeatedly applies a random shuffling of the array until the result happens to be sorted.
With an average scaling of $\mathcal{O}[N \times N!]$, (that's *N* times *N* factorial) this shouldquite obviouslynever be used for any real computation.
Fortunately, Python contains built-in sorting algorithms that are *much* more efficient than either of the simplistic algorithms just shown. We'll start by looking at the Python built-ins, and then take a look at the routines included in NumPy and optimized for NumPy arrays.
## Fast Sorting in NumPy: ``np.sort`` and ``np.argsort``
Although Python has built-in ``sort`` and ``sorted`` functions to work with lists, we won't discuss them here because NumPy's ``np.sort`` function turns out to be much more efficient and useful for our purposes.
By default ``np.sort`` uses an $\mathcal{O}[N\log N]$, *quicksort* algorithm, though *mergesort* and *heapsort* are also available. For most applications, the default quicksort is more than sufficient.
To return a sorted version of the array without modifying the input, you can use ``np.sort``:
```python
x = np.array([2, 1, 4, 3, 5])
np.sort(x)
```
If you prefer to sort the array in-place, you can instead use the ``sort`` method of arrays:
```python
x.sort()
print(x)
```
A related function is ``argsort``, which instead returns the *indices* of the sorted elements:
```python
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
print(i)
```
The first element of this result gives the index of the smallest element, the second value gives the index of the second smallest, and so on.
These indices can then be used (via fancy indexing) to construct the sorted array if desired:
```python
x[i]
```
### Sorting along rows or columns
A useful feature of NumPy's sorting algorithms is the ability to sort along specific rows or columns of a multidimensional array using the ``axis`` argument. For example:
```python
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
print(X)
```
```python
# sort each column of X
np.sort(X, axis=0)
```
```python
# sort each row of X
np.sort(X, axis=1)
```
Keep in mind that this treats each row or column as an independent array, and any relationships between the row or column values will be lost!
## Partial Sorts: Partitioning
Sometimes we're not interested in sorting the entire array, but simply want to find the *k* smallest values in the array. NumPy provides this in the ``np.partition`` function. ``np.partition`` takes an array and a number *K*; the result is a new array with the smallest *K* values to the left of the partition, and the remaining values to the right, in arbitrary order:
```python
x = np.array([7, 2, 3, 1, 6, 5, 4])
np.partition(x, 3)
```
Note that the first three values in the resulting array are the three smallest in the array, and the remaining array positions contain the remaining values.
Within the two partitions, the elements have arbitrary order.
Similarly to sorting, we can partition along an arbitrary axis of a multidimensional array:
```python
np.partition(X, 2, axis=1)
```
The result is an array where the first two slots in each row contain the smallest values from that row, with the remaining values filling the remaining slots.
Finally, just as there is a ``np.argsort`` that computes indices of the sort, there is a ``np.argpartition`` that computes indices of the partition.
We'll see this in action in the following section.
## Example: k-Nearest Neighbors
Let's quickly see how we might use this ``argsort`` function along multiple axes to find the nearest neighbors of each point in a set.
We'll start by creating a random set of 10 points on a two-dimensional plane.
Using the standard convention, we'll arrange these in a $10\times 2$ array:
```python
X = rand.rand(10, 2)
```
To get an idea of how these points look, let's quickly scatter plot them:
```python
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # Plot styling
plt.scatter(X[:, 0], X[:, 1], s=100);
```
Now we'll compute the distance between each pair of points.
Recall that the squared-distance between two points is the sum of the squared differences in each dimension;
using the efficient broadcasting ([Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb)) and aggregation ([Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb)) routines provided by NumPy we can compute the matrix of square distances in a single line of code:
```python
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis=-1)
```
This operation has a lot packed into it, and it might be a bit confusing if you're unfamiliar with NumPy's broadcasting rules. When you come across code like this, it can be useful to break it down into its component steps:
```python
# for each pair of points, compute differences in their coordinates
differences = X[:, np.newaxis, :] - X[np.newaxis, :, :]
differences.shape
```
```python
# square the coordinate differences
sq_differences = differences ** 2
sq_differences.shape
```
```python
# sum the coordinate differences to get the squared distance
dist_sq = sq_differences.sum(-1)
dist_sq.shape
```
Just to double-check what we are doing, we should see that the diagonal of this matrix (i.e., the set of distances between each point and itself) is all zero:
```python
dist_sq.diagonal()
```
It checks out!
With the pairwise square-distances converted, we can now use ``np.argsort`` to sort along each row. The leftmost columns will then give the indices of the nearest neighbors:
```python
nearest = np.argsort(dist_sq, axis=1)
print(nearest)
```
Notice that the first column gives the numbers 0 through 9 in order: this is due to the fact that each point's closest neighbor is itself, as we would expect.
By using a full sort here, we've actually done more work than we need to in this case. If we're simply interested in the nearest $k$ neighbors, all we need is to partition each row so that the smallest $k + 1$ squared distances come first, with larger distances filling the remaining positions of the array. We can do this with the ``np.argpartition`` function:
```python
K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)
```
In order to visualize this network of neighbors, let's quickly plot the points along with lines representing the connections from each point to its two nearest neighbors:
```python
plt.scatter(X[:, 0], X[:, 1], s=100)
# draw lines from each point to its two nearest neighbors
K = 2
for i in range(X.shape[0]):
for j in nearest_partition[i, :K+1]:
# plot a line from X[i] to X[j]
# use some zip magic to make it happen:
plt.plot(*zip(X[j], X[i]), color='black')
```
Each point in the plot has lines drawn to its two nearest neighbors.
At first glance, it might seem strange that some of the points have more than two lines coming out of them: this is due to the fact that if point A is one of the two nearest neighbors of point B, this does not necessarily imply that point B is one of the two nearest neighbors of point A.
Although the broadcasting and row-wise sorting of this approach might seem less straightforward than writing a loop, it turns out to be a very efficient way of operating on this data in Python.
You might be tempted to do the same type of operation by manually looping through the data and sorting each set of neighbors individually, but this would almost certainly lead to a slower algorithm than the vectorized version we used. The beauty of this approach is that it's written in a way that's agnostic to the size of the input data: we could just as easily compute the neighbors among 100 or 1,000,000 points in any number of dimensions, and the code would look the same.
Finally, I'll note that when doing very large nearest neighbor searches, there are tree-based and/or approximate algorithms that can scale as $\mathcal{O}[N\log N]$ or better rather than the $\mathcal{O}[N^2]$ of the brute-force algorithm. One example of this is the KD-Tree, [implemented in Scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KDTree.html).
## Aside: Big-O Notation
Big-O notation is a means of describing how the number of operations required for an algorithm scales as the input grows in size.
To use it correctly is to dive deeply into the realm of computer science theory, and to carefully distinguish it from the related small-o notation, big-$\theta$ notation, big-$\Omega$ notation, and probably many mutant hybrids thereof.
While these distinctions add precision to statements about algorithmic scaling, outside computer science theory exams and the remarks of pedantic blog commenters, you'll rarely see such distinctions made in practice.
Far more common in the data science world is a less rigid use of big-O notation: as a general (if imprecise) description of the scaling of an algorithm.
With apologies to theorists and pedants, this is the interpretation we'll use throughout this book.
Big-O notation, in this loose sense, tells you how much time your algorithm will take as you increase the amount of data.
If you have an $\mathcal{O}[N]$ (read "order $N$") algorithm that takes 1 second to operate on a list of length *N*=1,000, then you should expect it to take roughly 5 seconds for a list of length *N*=5,000.
If you have an $\mathcal{O}[N^2]$ (read "order *N* squared") algorithm that takes 1 second for *N*=1000, then you should expect it to take about 25 seconds for *N*=5000.
For our purposes, the *N* will usually indicate some aspect of the size of the dataset (the number of points, the number of dimensions, etc.). When trying to analyze billions or trillions of samples, the difference between $\mathcal{O}[N]$ and $\mathcal{O}[N^2]$ can be far from trivial!
Notice that the big-O notation by itself tells you nothing about the actual wall-clock time of a computation, but only about its scaling as you change *N*.
Generally, for example, an $\mathcal{O}[N]$ algorithm is considered to have better scaling than an $\mathcal{O}[N^2]$ algorithm, and for good reason. But for small datasets in particular, the algorithm with better scaling might not be faster.
For example, in a given problem an $\mathcal{O}[N^2]$ algorithm might take 0.01 seconds, while a "better" $\mathcal{O}[N]$ algorithm might take 1 second.
Scale up *N* by a factor of 1,000, though, and the $\mathcal{O}[N]$ algorithm will win out.
Even this loose version of Big-O notation can be very useful when comparing the performance of algorithms, and we'll use this notation throughout the book when talking about how algorithms scale.
<!--NAVIGATION-->
< [Fancy Indexing](02.07-Fancy-Indexing.ipynb) | [Contents](Index.ipynb) | [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.08-Sorting.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

View File

@ -0,0 +1,606 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Sorting Arrays](02.08-Sorting.ipynb) | [Contents](Index.ipynb) | [Data Manipulation with Pandas](03.00-Introduction-to-Pandas.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.09-Structured-Data-NumPy.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Structured Data: NumPy's Structured Arrays"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy's *structured arrays* and *record arrays*, which provide efficient storage for compound, heterogeneous data. While the patterns shown here are useful for simple operations, scenarios like this often lend themselves to the use of Pandas ``Dataframe``s, which we'll explore in [Chapter 3](03.00-Introduction-to-Pandas.ipynb)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we'd like to store these values for use in a Python program.\n",
"It would be possible to store these in three separate arrays:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"name = ['Alice', 'Bob', 'Cathy', 'Doug']\n",
"age = [25, 45, 37, 19]\n",
"weight = [55.0, 85.5, 68.0, 61.5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But this is a bit clumsy. There's nothing here that tells us that the three arrays are related; it would be more natural if we could use a single structure to store all of this data.\n",
"NumPy can handle this through structured arrays, which are arrays with compound data types.\n",
"\n",
"Recall that previously we created a simple array using an expression like this:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"x = np.zeros(4, dtype=int)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can similarly create a structured array using a compound data type specification:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('name', '<U10'), ('age', '<i4'), ('weight', '<f8')]\n"
]
}
],
"source": [
"# Use a compound data type for structured arrays\n",
"data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),\n",
" 'formats':('U10', 'i4', 'f8')})\n",
"print(data.dtype)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here ``'U10'`` translates to \"Unicode string of maximum length 10,\" ``'i4'`` translates to \"4-byte (i.e., 32 bit) integer,\" and ``'f8'`` translates to \"8-byte (i.e., 64 bit) float.\"\n",
"We'll discuss other options for these type codes in the following section.\n",
"\n",
"Now that we've created an empty container array, we can fill the array with our lists of values:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('Alice', 25, 55.0) ('Bob', 45, 85.5) ('Cathy', 37, 68.0)\n",
" ('Doug', 19, 61.5)]\n"
]
}
],
"source": [
"data['name'] = name\n",
"data['age'] = age\n",
"data['weight'] = weight\n",
"print(data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we had hoped, the data is now arranged together in one convenient block of memory.\n",
"\n",
"The handy thing with structured arrays is that you can now refer to values either by index or by name:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['Alice', 'Bob', 'Cathy', 'Doug'], \n",
" dtype='<U10')"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get all names\n",
"data['name']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"('Alice', 25, 55.0)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get first row of data\n",
"data[0]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'Doug'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the name from the last row\n",
"data[-1]['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['Alice', 'Doug'], \n",
" dtype='<U10')"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get names where age is under 30\n",
"data[data['age'] < 30]['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that if you'd like to do any operations that are any more complicated than these, you should probably consider the Pandas package, covered in the next chapter.\n",
"As we'll see, Pandas provides a ``Dataframe`` object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we've shown here, as well as much, much more."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Structured Arrays\n",
"\n",
"Structured array data types can be specified in a number of ways.\n",
"Earlier, we saw the dictionary method:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.dtype({'names':('name', 'age', 'weight'),\n",
" 'formats':('U10', 'i4', 'f8')})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For clarity, numerical types can be specified using Python types or NumPy ``dtype``s instead:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.dtype({'names':('name', 'age', 'weight'),\n",
" 'formats':((np.str_, 10), int, np.float32)})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A compound type can also be specified as a list of tuples:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the names of the types do not matter to you, you can specify the types alone in a comma-separated string:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"dtype([('f0', 'S10'), ('f1', '<i4'), ('f2', '<f8')])"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.dtype('S10,i4,f8')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The shortened string format codes may seem confusing, but they are built on simple principles.\n",
"The first (optional) character is ``<`` or ``>``, which means \"little endian\" or \"big endian,\" respectively, and specifies the ordering convention for significant bits.\n",
"The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below).\n",
"The last character or characters represents the size of the object in bytes.\n",
"\n",
"| Character | Description | Example |\n",
"| --------- | ----------- | ------- | \n",
"| ``'b'`` | Byte | ``np.dtype('b')`` |\n",
"| ``'i'`` | Signed integer | ``np.dtype('i4') == np.int32`` |\n",
"| ``'u'`` | Unsigned integer | ``np.dtype('u1') == np.uint8`` |\n",
"| ``'f'`` | Floating point | ``np.dtype('f8') == np.int64`` |\n",
"| ``'c'`` | Complex floating point| ``np.dtype('c16') == np.complex128``|\n",
"| ``'S'``, ``'a'`` | String | ``np.dtype('S5')`` |\n",
"| ``'U'`` | Unicode string | ``np.dtype('U') == np.str_`` |\n",
"| ``'V'`` | Raw data (void) | ``np.dtype('V') == np.void`` |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## More Advanced Compound Types\n",
"\n",
"It is possible to define even more advanced compound types.\n",
"For example, you can create a type where each element contains an array or matrix of values.\n",
"Here, we'll create a data type with a ``mat`` component consisting of a $3\\times 3$ floating-point matrix:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(0, [[0.0, 0.0, 0.0], [0.0, 0.0, 0.0], [0.0, 0.0, 0.0]])\n",
"[[ 0. 0. 0.]\n",
" [ 0. 0. 0.]\n",
" [ 0. 0. 0.]]\n"
]
}
],
"source": [
"tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])\n",
"X = np.zeros(1, dtype=tp)\n",
"print(X[0])\n",
"print(X['mat'][0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now each element in the ``X`` array consists of an ``id`` and a $3\\times 3$ matrix.\n",
"Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary?\n",
"The reason is that this NumPy ``dtype`` directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program.\n",
"If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you'll probably find structured arrays quite useful!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## RecordArrays: Structured Arrays with a Twist\n",
"\n",
"NumPy also provides the ``np.recarray`` class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys.\n",
"Recall that we previously accessed the ages by writing:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([25, 45, 37, 19], dtype=int32)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data['age']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we view our data as a record array instead, we can access this with slightly fewer keystrokes:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([25, 45, 37, 19], dtype=int32)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_rec = data.view(np.recarray)\n",
"data_rec.age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The downside is that for record arrays, there is some extra overhead involved in accessing the fields, even when using the same syntax. We can see this here:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000000 loops, best of 3: 241 ns per loop\n",
"100000 loops, best of 3: 4.61 µs per loop\n",
"100000 loops, best of 3: 7.27 µs per loop\n"
]
}
],
"source": [
"%timeit data['age']\n",
"%timeit data_rec['age']\n",
"%timeit data_rec.age"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whether the more convenient notation is worth the additional overhead will depend on your own application."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## On to Pandas\n",
"\n",
"This section on structured and record arrays is purposely at the end of this chapter, because it leads so well into the next package we will cover: Pandas.\n",
"Structured arrays like the ones discussed here are good to know about for certain situations, especially in case you're using NumPy arrays to map onto binary data formats in C, Fortran, or another language.\n",
"For day-to-day use of structured data, the Pandas package is a much better choice, and we'll dive into a full discussion of it in the chapter that follows."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Sorting Arrays](02.08-Sorting.ipynb) | [Contents](Index.ipynb) | [Data Manipulation with Pandas](03.00-Introduction-to-Pandas.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.09-Structured-Data-NumPy.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,212 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Sorting Arrays](02.08-Sorting.ipynb) | [Contents](Index.ipynb) | [Data Manipulation with Pandas](03.00-Introduction-to-Pandas.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.09-Structured-Data-NumPy.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Structured Data: NumPy's Structured Arrays
While often our data can be well represented by a homogeneous array of values, sometimes this is not the case. This section demonstrates the use of NumPy's *structured arrays* and *record arrays*, which provide efficient storage for compound, heterogeneous data. While the patterns shown here are useful for simple operations, scenarios like this often lend themselves to the use of Pandas ``Dataframe``s, which we'll explore in [Chapter 3](03.00-Introduction-to-Pandas.ipynb).
```python
import numpy as np
```
Imagine that we have several categories of data on a number of people (say, name, age, and weight), and we'd like to store these values for use in a Python program.
It would be possible to store these in three separate arrays:
```python
name = ['Alice', 'Bob', 'Cathy', 'Doug']
age = [25, 45, 37, 19]
weight = [55.0, 85.5, 68.0, 61.5]
```
But this is a bit clumsy. There's nothing here that tells us that the three arrays are related; it would be more natural if we could use a single structure to store all of this data.
NumPy can handle this through structured arrays, which are arrays with compound data types.
Recall that previously we created a simple array using an expression like this:
```python
x = np.zeros(4, dtype=int)
```
We can similarly create a structured array using a compound data type specification:
```python
# Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
print(data.dtype)
```
Here ``'U10'`` translates to "Unicode string of maximum length 10," ``'i4'`` translates to "4-byte (i.e., 32 bit) integer," and ``'f8'`` translates to "8-byte (i.e., 64 bit) float."
We'll discuss other options for these type codes in the following section.
Now that we've created an empty container array, we can fill the array with our lists of values:
```python
data['name'] = name
data['age'] = age
data['weight'] = weight
print(data)
```
As we had hoped, the data is now arranged together in one convenient block of memory.
The handy thing with structured arrays is that you can now refer to values either by index or by name:
```python
# Get all names
data['name']
```
```python
# Get first row of data
data[0]
```
```python
# Get the name from the last row
data[-1]['name']
```
Using Boolean masking, this even allows you to do some more sophisticated operations such as filtering on age:
```python
# Get names where age is under 30
data[data['age'] < 30]['name']
```
Note that if you'd like to do any operations that are any more complicated than these, you should probably consider the Pandas package, covered in the next chapter.
As we'll see, Pandas provides a ``Dataframe`` object, which is a structure built on NumPy arrays that offers a variety of useful data manipulation functionality similar to what we've shown here, as well as much, much more.
## Creating Structured Arrays
Structured array data types can be specified in a number of ways.
Earlier, we saw the dictionary method:
```python
np.dtype({'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
```
For clarity, numerical types can be specified using Python types or NumPy ``dtype``s instead:
```python
np.dtype({'names':('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
```
A compound type can also be specified as a list of tuples:
```python
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
```
If the names of the types do not matter to you, you can specify the types alone in a comma-separated string:
```python
np.dtype('S10,i4,f8')
```
The shortened string format codes may seem confusing, but they are built on simple principles.
The first (optional) character is ``<`` or ``>``, which means "little endian" or "big endian," respectively, and specifies the ordering convention for significant bits.
The next character specifies the type of data: characters, bytes, ints, floating points, and so on (see the table below).
The last character or characters represents the size of the object in bytes.
| Character | Description | Example |
| --------- | ----------- | ------- |
| ``'b'`` | Byte | ``np.dtype('b')`` |
| ``'i'`` | Signed integer | ``np.dtype('i4') == np.int32`` |
| ``'u'`` | Unsigned integer | ``np.dtype('u1') == np.uint8`` |
| ``'f'`` | Floating point | ``np.dtype('f8') == np.int64`` |
| ``'c'`` | Complex floating point| ``np.dtype('c16') == np.complex128``|
| ``'S'``, ``'a'`` | String | ``np.dtype('S5')`` |
| ``'U'`` | Unicode string | ``np.dtype('U') == np.str_`` |
| ``'V'`` | Raw data (void) | ``np.dtype('V') == np.void`` |
## More Advanced Compound Types
It is possible to define even more advanced compound types.
For example, you can create a type where each element contains an array or matrix of values.
Here, we'll create a data type with a ``mat`` component consisting of a $3\times 3$ floating-point matrix:
```python
tp = np.dtype([('id', 'i8'), ('mat', 'f8', (3, 3))])
X = np.zeros(1, dtype=tp)
print(X[0])
print(X['mat'][0])
```
Now each element in the ``X`` array consists of an ``id`` and a $3\times 3$ matrix.
Why would you use this rather than a simple multidimensional array, or perhaps a Python dictionary?
The reason is that this NumPy ``dtype`` directly maps onto a C structure definition, so the buffer containing the array content can be accessed directly within an appropriately written C program.
If you find yourself writing a Python interface to a legacy C or Fortran library that manipulates structured data, you'll probably find structured arrays quite useful!
## RecordArrays: Structured Arrays with a Twist
NumPy also provides the ``np.recarray`` class, which is almost identical to the structured arrays just described, but with one additional feature: fields can be accessed as attributes rather than as dictionary keys.
Recall that we previously accessed the ages by writing:
```python
data['age']
```
If we view our data as a record array instead, we can access this with slightly fewer keystrokes:
```python
data_rec = data.view(np.recarray)
data_rec.age
```
The downside is that for record arrays, there is some extra overhead involved in accessing the fields, even when using the same syntax. We can see this here:
```python
%timeit data['age']
%timeit data_rec['age']
%timeit data_rec.age
```
Whether the more convenient notation is worth the additional overhead will depend on your own application.
## On to Pandas
This section on structured and record arrays is purposely at the end of this chapter, because it leads so well into the next package we will cover: Pandas.
Structured arrays like the ones discussed here are good to know about for certain situations, especially in case you're using NumPy arrays to map onto binary data formats in C, Fortran, or another language.
For day-to-day use of structured data, the Pandas package is a much better choice, and we'll dive into a full discussion of it in the chapter that follows.
<!--NAVIGATION-->
< [Sorting Arrays](02.08-Sorting.ipynb) | [Contents](Index.ipynb) | [Data Manipulation with Pandas](03.00-Introduction-to-Pandas.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.09-Structured-Data-NumPy.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

View File

@ -0,0 +1,170 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb) | [Contents](Index.ipynb) | [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Manipulation with Pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous chapter, we dove into detail on NumPy and its ``ndarray`` object, which provides efficient storage and manipulation of dense typed arrays in Python.\n",
"Here we'll build on this knowledge by looking in detail at the data structures provided by the Pandas library.\n",
"Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a ``DataFrame``.\n",
"``DataFrame``s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.\n",
"As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.\n",
"\n",
"As we saw, NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.\n",
"While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.\n",
"Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of \"data munging\" tasks that occupy much of a data scientist's time.\n",
"\n",
"In this chapter, we will focus on the mechanics of using ``Series``, ``DataFrame``, and related structures effectively.\n",
"We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Installing and Using Pandas\n",
"\n",
"Installation of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built.\n",
"Details on this installation can be found in the [Pandas documentation](http://pandas.pydata.org/).\n",
"If you followed the advice outlined in the [Preface](00.00-Preface.ipynb) and used the Anaconda stack, you already have Pandas installed.\n",
"\n",
"Once Pandas is installed, you can import it and check the version:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'0.18.1'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas\n",
"pandas.__version__"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This import convention will be used throughout the remainder of this book."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reminder about Built-In Documentation\n",
"\n",
"As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ``?`` character). (Refer back to [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb) if you need a refresher on this.)\n",
"\n",
"For example, to display all the contents of the pandas namespace, you can type\n",
"\n",
"```ipython\n",
"In [3]: pd.<TAB>\n",
"```\n",
"\n",
"And to display Pandas's built-in documentation, you can use this:\n",
"\n",
"```ipython\n",
"In [4]: pd?\n",
"```\n",
"\n",
"More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<!--NAVIGATION-->\n",
"< [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb) | [Contents](Index.ipynb) | [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,93 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb) | [Contents](Index.ipynb) | [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Data Manipulation with Pandas
In the previous chapter, we dove into detail on NumPy and its ``ndarray`` object, which provides efficient storage and manipulation of dense typed arrays in Python.
Here we'll build on this knowledge by looking in detail at the data structures provided by the Pandas library.
Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a ``DataFrame``.
``DataFrame``s are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data.
As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.
As we saw, NumPy's ``ndarray`` data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks.
While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.
Pandas, and in particular its ``Series`` and ``DataFrame`` objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.
In this chapter, we will focus on the mechanics of using ``Series``, ``DataFrame``, and related structures effectively.
We will use examples drawn from real datasets where appropriate, but these examples are not necessarily the focus.
## Installing and Using Pandas
Installation of Pandas on your system requires NumPy to be installed, and if building the library from source, requires the appropriate tools to compile the C and Cython sources on which Pandas is built.
Details on this installation can be found in the [Pandas documentation](http://pandas.pydata.org/).
If you followed the advice outlined in the [Preface](00.00-Preface.ipynb) and used the Anaconda stack, you already have Pandas installed.
Once Pandas is installed, you can import it and check the version:
```python
import pandas
pandas.__version__
```
Just as we generally import NumPy under the alias ``np``, we will import Pandas under the alias ``pd``:
```python
import pandas as pd
```
This import convention will be used throughout the remainder of this book.
## Reminder about Built-In Documentation
As you read through this chapter, don't forget that IPython gives you the ability to quickly explore the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ``?`` character). (Refer back to [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb) if you need a refresher on this.)
For example, to display all the contents of the pandas namespace, you can type
```ipython
In [3]: pd.<TAB>
```
And to display Pandas's built-in documentation, you can use this:
```ipython
In [4]: pd?
```
More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.
<!--NAVIGATION-->
< [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb) | [Contents](Index.ipynb) | [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.00-Introduction-to-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,379 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Data Manipulation with Pandas](03.00-Introduction-to-Pandas.ipynb) | [Contents](Index.ipynb) | [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Introducing Pandas Objects
At the very basic level, Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
As we will see during the course of this chapter, Pandas provides a host of useful tools, methods, and functionality on top of the basic data structures, but nearly everything that follows will require an understanding of what these structures are.
Thus, before we go any further, let's introduce these three fundamental Pandas data structures: the ``Series``, ``DataFrame``, and ``Index``.
We will start our code sessions with the standard NumPy and Pandas imports:
```python
import numpy as np
import pandas as pd
```
## The Pandas Series Object
A Pandas ``Series`` is a one-dimensional array of indexed data.
It can be created from a list or array as follows:
```python
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
```
As we see in the output, the ``Series`` wraps both a sequence of values and a sequence of indices, which we can access with the ``values`` and ``index`` attributes.
The ``values`` are simply a familiar NumPy array:
```python
data.values
```
The ``index`` is an array-like object of type ``pd.Index``, which we'll discuss in more detail momentarily.
```python
data.index
```
Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:
```python
data[1]
```
```python
data[1:3]
```
As we will see, though, the Pandas ``Series`` is much more general and flexible than the one-dimensional NumPy array that it emulates.
### ``Series`` as generalized NumPy array
From what we've seen so far, it may look like the ``Series`` object is basically interchangeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the Numpy Array has an *implicitly defined* integer index used to access the values, the Pandas ``Series`` has an *explicitly defined* index associated with the values.
This explicit index definition gives the ``Series`` object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type.
For example, if we wish, we can use strings as an index:
```python
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
```
And the item access works as expected:
```python
data['b']
```
We can even use non-contiguous or non-sequential indices:
```python
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data
```
```python
data[5]
```
### Series as specialized dictionary
In this way, you can think of a Pandas ``Series`` a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a ``Series`` is a structure which maps typed keys to a set of typed values.
This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas ``Series`` makes it much more efficient than Python dictionaries for certain operations.
The ``Series``-as-dictionary analogy can be made even more clear by constructing a ``Series`` object directly from a Python dictionary:
```python
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
```
By default, a ``Series`` will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:
```python
population['California']
```
Unlike a dictionary, though, the ``Series`` also supports array-style operations such as slicing:
```python
population['California':'Illinois']
```
We'll discuss some of the quirks of Pandas indexing and slicing in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).
<!-- #region -->
### Constructing Series objects
We've already seen a few ways of constructing a Pandas ``Series`` from scratch; all of them are some version of the following:
```python
>>> pd.Series(data, index=index)
```
where ``index`` is an optional argument, and ``data`` can be one of many entities.
For example, ``data`` can be a list or NumPy array, in which case ``index`` defaults to an integer sequence:
<!-- #endregion -->
```python
pd.Series([2, 4, 6])
```
``data`` can be a scalar, which is repeated to fill the specified index:
```python
pd.Series(5, index=[100, 200, 300])
```
``data`` can be a dictionary, in which ``index`` defaults to the sorted dictionary keys:
```python
pd.Series({2:'a', 1:'b', 3:'c'})
```
In each case, the index can be explicitly set if a different result is preferred:
```python
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])
```
Notice that in this case, the ``Series`` is populated only with the explicitly identified keys.
## The Pandas DataFrame Object
The next fundamental structure in Pandas is the ``DataFrame``.
Like the ``Series`` object discussed in the previous section, the ``DataFrame`` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
We'll now take a look at each of these perspectives.
### DataFrame as a generalized NumPy array
If a ``Series`` is an analog of a one-dimensional array with flexible indices, a ``DataFrame`` is an analog of a two-dimensional array with both flexible row indices and flexible column names.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a ``DataFrame`` as a sequence of aligned ``Series`` objects.
Here, by "aligned" we mean that they share the same index.
To demonstrate this, let's first construct a new ``Series`` listing the area of each of the five states discussed in the previous section:
```python
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
```
Now that we have this along with the ``population`` Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:
```python
states = pd.DataFrame({'population': population,
'area': area})
states
```
Like the ``Series`` object, the ``DataFrame`` has an ``index`` attribute that gives access to the index labels:
```python
states.index
```
Additionally, the ``DataFrame`` has a ``columns`` attribute, which is an ``Index`` object holding the column labels:
```python
states.columns
```
Thus the ``DataFrame`` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.
### DataFrame as specialized dictionary
Similarly, we can also think of a ``DataFrame`` as a specialization of a dictionary.
Where a dictionary maps a key to a value, a ``DataFrame`` maps a column name to a ``Series`` of column data.
For example, asking for the ``'area'`` attribute returns the ``Series`` object containing the areas we saw earlier:
```python
states['area']
```
Notice the potential point of confusion here: in a two-dimesnional NumPy array, ``data[0]`` will return the first *row*. For a ``DataFrame``, ``data['col0']`` will return the first *column*.
Because of this, it is probably better to think about ``DataFrame``s as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful.
We'll explore more flexible means of indexing ``DataFrame``s in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb).
### Constructing DataFrame objects
A Pandas ``DataFrame`` can be constructed in a variety of ways.
Here we'll give several examples.
#### From a single Series object
A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:
```python
pd.DataFrame(population, columns=['population'])
```
#### From a list of dicts
Any list of dictionaries can be made into a ``DataFrame``.
We'll use a simple list comprehension to create some data:
```python
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
```
Even if some keys in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number") values:
```python
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
```
#### From a dictionary of Series objects
As we saw before, a ``DataFrame`` can be constructed from a dictionary of ``Series`` objects as well:
```python
pd.DataFrame({'population': population,
'area': area})
```
#### From a two-dimensional NumPy array
Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:
```python
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
```
#### From a NumPy structured array
We covered structured arrays in [Structured Data: NumPy's Structured Arrays](02.09-Structured-Data-NumPy.ipynb).
A Pandas ``DataFrame`` operates much like a structured array, and can be created directly from one:
```python
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
```
```python
pd.DataFrame(A)
```
## The Pandas Index Object
We have seen here that both the ``Series`` and ``DataFrame`` objects contain an explicit *index* that lets you reference and modify data.
This ``Index`` object is an interesting structure in itself, and it can be thought of either as an *immutable array* or as an *ordered set* (technically a multi-set, as ``Index`` objects may contain repeated values).
Those views have some interesting consequences in the operations available on ``Index`` objects.
As a simple example, let's construct an ``Index`` from a list of integers:
```python
ind = pd.Index([2, 3, 5, 7, 11])
ind
```
### Index as immutable array
The ``Index`` in many ways operates like an array.
For example, we can use standard Python indexing notation to retrieve values or slices:
```python
ind[1]
```
```python
ind[::2]
```
``Index`` objects also have many of the attributes familiar from NumPy arrays:
```python
print(ind.size, ind.shape, ind.ndim, ind.dtype)
```
One difference between ``Index`` objects and NumPy arrays is that indices are immutablethat is, they cannot be modified via the normal means:
```python
ind[1] = 0
```
This immutability makes it safer to share indices between multiple ``DataFrame``s and arrays, without the potential for side effects from inadvertent index modification.
### Index as ordered set
Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic.
The ``Index`` object follows many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:
```python
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
```
```python
indA & indB # intersection
```
```python
indA | indB # union
```
```python
indA ^ indB # symmetric difference
```
These operations may also be accessed via object methods, for example ``indA.intersection(indB)``.
<!--NAVIGATION-->
< [Data Manipulation with Pandas](03.00-Introduction-to-Pandas.ipynb) | [Contents](Index.ipynb) | [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,324 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) | [Contents](Index.ipynb) | [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.02-Data-Indexing-and-Selection.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Data Indexing and Selection
In [Chapter 2](02.00-Introduction-to-NumPy.ipynb), we looked in detail at methods and tools to access, set, and modify values in NumPy arrays.
These included indexing (e.g., ``arr[2, 1]``), slicing (e.g., ``arr[:, 1:5]``), masking (e.g., ``arr[arr > 0]``), fancy indexing (e.g., ``arr[0, [1, 5]]``), and combinations thereof (e.g., ``arr[:, [1, 5]]``).
Here we'll look at similar means of accessing and modifying values in Pandas ``Series`` and ``DataFrame`` objects.
If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.
We'll start with the simple case of the one-dimensional ``Series`` object, and then move on to the more complicated two-dimesnional ``DataFrame`` object.
## Data Selection in Series
As we saw in the previous section, a ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.
### Series as dictionary
Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values:
```python
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data
```
```python
data['b']
```
We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:
```python
'a' in data
```
```python
data.keys()
```
```python
list(data.items())
```
``Series`` objects can even be modified with a dictionary-like syntax.
Just as you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:
```python
data['e'] = 1.25
data
```
This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.
### Series as one-dimensional array
A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays that is, *slices*, *masking*, and *fancy indexing*.
Examples of these are as follows:
```python
# slicing by explicit index
data['a':'c']
```
```python
# slicing by implicit integer index
data[0:2]
```
```python
# masking
data[(data > 0.3) & (data < 0.8)]
```
```python
# fancy indexing
data[['a', 'e']]
```
Among these, slicing may be the source of the most confusion.
Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice.
### Indexers: loc, iloc, and ix
These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.
```python
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
```
```python
# explicit index when indexing
data[1]
```
```python
# implicit index when slicing
data[1:3]
```
Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.
First, the ``loc`` attribute allows indexing and slicing that always references the explicit index:
```python
data.loc[1]
```
```python
data.loc[1:3]
```
The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:
```python
data.iloc[1]
```
```python
data.iloc[1:3]
```
A third indexing attribute, ``ix``, is a hybrid of the two, and for ``Series`` objects is equivalent to standard ``[]``-based indexing.
The purpose of the ``ix`` indexer will become more apparent in the context of ``DataFrame`` objects, which we will discuss in a moment.
One guiding principle of Python code is that "explicit is better than implicit."
The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.
## Data Selection in DataFrame
Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.
### DataFrame as a dictionary
The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:
```python
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
```
The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:
```python
data['area']
```
Equivalently, we can use attribute-style access with column names that are strings:
```python
data.area
```
This attribute-style column access actually accesses the exact same object as the dictionary-style access:
```python
data.area is data['area']
```
Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.
For example, the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column:
```python
data.pop is data['pop']
```
In particular, you should avoid the temptation to try column assignment via attribute (i.e., use ``data['pop'] = z`` rather than ``data.pop = z``).
Like with the ``Series`` objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:
```python
data['density'] = data['pop'] / data['area']
data
```
This shows a preview of the straightforward syntax of element-by-element arithmetic between ``Series`` objects; we'll dig into this further in [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb).
### DataFrame as two-dimensional array
As mentioned previously, we can also view the ``DataFrame`` as an enhanced two-dimensional array.
We can examine the raw underlying data array using the ``values`` attribute:
```python
data.values
```
With this picture in mind, many familiar array-like observations can be done on the ``DataFrame`` itself.
For example, we can transpose the full ``DataFrame`` to swap rows and columns:
```python
data.T
```
When it comes to indexing of ``DataFrame`` objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array.
In particular, passing a single index to an array accesses a row:
```python
data.values[0]
```
and passing a single "index" to a ``DataFrame`` accesses a column:
```python
data['area']
```
Thus for array-style indexing, we need another convention.
Here Pandas again uses the ``loc``, ``iloc``, and ``ix`` indexers mentioned earlier.
Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:
```python
data.iloc[:3, :2]
```
Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:
```python
data.loc[:'Illinois', :'pop']
```
The ``ix`` indexer allows a hybrid of these two approaches:
```python
data.ix[:3, :'pop']
```
Keep in mind that for integer indices, the ``ix`` indexer is subject to the same potential sources of confusion as discussed for integer-indexed ``Series`` objects.
Any of the familiar NumPy-style data access patterns can be used within these indexers.
For example, in the ``loc`` indexer we can combine masking and fancy indexing as in the following:
```python
data.loc[data.density > 100, ['pop', 'density']]
```
Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:
```python
data.iloc[0, 2] = 90
data
```
To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple ``DataFrame`` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.
### Additional indexing conventions
There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice.
First, while *indexing* refers to columns, *slicing* refers to rows:
```python
data['Florida':'Illinois']
```
Such slices can also refer to rows by number rather than by index:
```python
data[1:3]
```
Similarly, direct masking operations are also interpreted row-wise rather than column-wise:
```python
data[data.density > 100]
```
These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.
<!--NAVIGATION-->
< [Introducing Pandas Objects](03.01-Introducing-Pandas-Objects.ipynb) | [Contents](Index.ipynb) | [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.02-Data-Indexing-and-Selection.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,215 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) | [Contents](Index.ipynb) | [Handling Missing Data](03.04-Missing-Values.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.03-Operations-in-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Operating on Data in Pandas
One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).
Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) are key to this.
Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output, and for binary operations such as addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.
This means that keeping the context of data and combining data from different sourcesboth potentially error-prone tasks with raw NumPy arraysbecome essentially foolproof ones with Pandas.
We will additionally see that there are well-defined operations between one-dimensional ``Series`` structures and two-dimensional ``DataFrame`` structures.
## Ufuncs: Index Preservation
Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas ``Series`` and ``DataFrame`` objects.
Let's start by defining a simple ``Series`` and ``DataFrame`` on which to demonstrate this:
```python
import pandas as pd
import numpy as np
```
```python
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
```
```python
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
columns=['A', 'B', 'C', 'D'])
df
```
If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object *with the indices preserved:*
```python
np.exp(ser)
```
Or, for a slightly more complex calculation:
```python
np.sin(df * np.pi / 4)
```
Any of the ufuncs discussed in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb) can be used in a similar manner.
## UFuncs: Index Alignment
For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices in the process of performing the operation.
This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.
### Index alignment in Series
As an example, suppose we are combining two different data sources, and find only the top three US states by *area* and the top three US states by *population*:
```python
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')
```
Let's see what happens when we divide these to compute the population density:
```python
population / area
```
The resulting array contains the *union* of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:
```python
area.index | population.index
```
Any item for which one or the other does not have an entry is marked with ``NaN``, or "Not a Number," which is how Pandas marks missing data (see further discussion of missing data in [Handling Missing Data](03.04-Missing-Values.ipynb)).
This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:
```python
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
```
If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators.
For example, calling ``A.add(B)`` is equivalent to calling ``A + B``, but allows optional explicit specification of the fill value for any elements in ``A`` or ``B`` that might be missing:
```python
A.add(B, fill_value=0)
```
### Index alignment in DataFrame
A similar type of alignment takes place for *both* columns and indices when performing operations on ``DataFrame``s:
```python
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
columns=list('AB'))
A
```
```python
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
columns=list('BAC'))
B
```
```python
A + B
```
Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted.
As was the case with ``Series``, we can use the associated object's arithmetic method and pass any desired ``fill_value`` to be used in place of missing entries.
Here we'll fill with the mean of all values in ``A`` (computed by first stacking the rows of ``A``):
```python
fill = A.stack().mean()
A.add(B, fill_value=fill)
```
The following table lists Python operators and their equivalent Pandas object methods:
| Python Operator | Pandas Method(s) |
|-----------------|---------------------------------------|
| ``+`` | ``add()`` |
| ``-`` | ``sub()``, ``subtract()`` |
| ``*`` | ``mul()``, ``multiply()`` |
| ``/`` | ``truediv()``, ``div()``, ``divide()``|
| ``//`` | ``floordiv()`` |
| ``%`` | ``mod()`` |
| ``**`` | ``pow()`` |
## Ufuncs: Operations Between DataFrame and Series
When performing operations between a ``DataFrame`` and a ``Series``, the index and column alignment is similarly maintained.
Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a two-dimensional and one-dimensional NumPy array.
Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:
```python
A = rng.randint(10, size=(3, 4))
A
```
```python
A - A[0]
```
According to NumPy's broadcasting rules (see [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb)), subtraction between a two-dimensional array and one of its rows is applied row-wise.
In Pandas, the convention similarly operates row-wise by default:
```python
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
```
If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the ``axis`` keyword:
```python
df.subtract(df['R'], axis=0)
```
Note that these ``DataFrame``/``Series`` operations, like the operations discussed above, will automatically align indices between the two elements:
```python
halfrow = df.iloc[0, ::2]
halfrow
```
```python
df - halfrow
```
This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.
<!--NAVIGATION-->
< [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) | [Contents](Index.ipynb) | [Handling Missing Data](03.04-Missing-Values.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.03-Operations-in-Pandas.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,324 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb) | [Contents](Index.ipynb) | [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.04-Missing-Values.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Handling Missing Data
The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous.
In particular, many interesting datasets will have some amount of data missing.
To make matters even more complicated, different data sources may indicate missing data in different ways.
In this section, we will discuss some general considerations for missing data, discuss how Pandas chooses to represent it, and demonstrate some built-in Pandas tools for handling missing data in Python.
Here and throughout the book, we'll refer to missing data in general as *null*, *NaN*, or *NA* values.
## Trade-Offs in Missing Data Conventions
There are a number of schemes that have been developed to indicate the presence of missing data in a table or DataFrame.
Generally, they revolve around one of two strategies: using a *mask* that globally indicates missing values, or choosing a *sentinel value* that indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number), a special value which is part of the IEEE floating-point specification.
None of these approaches is without trade-offs: use of a separate mask array requires allocation of an additional Boolean array, which adds overhead in both storage and computation. A sentinel value reduces the range of valid values that can be represented, and may require extra (often non-optimized) logic in CPU and GPU arithmetic. Common special values like NaN are not available for all data types.
As in most cases where no universally optimal choice exists, different languages and systems use different conventions.
For example, the R language uses reserved bit patterns within each data type as sentinel values indicating missing data, while the SciDB system uses an extra byte attached to every cell which indicates a NA state.
## Missing Data in Pandas
The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which does not have a built-in notion of NA values for non-floating-point data types.
Pandas could have followed R's lead in specifying bit patterns for each individual data type to indicate nullness, but this approach turns out to be rather unwieldy.
While R contains four basic data types, NumPy supports *far* more than this: for example, while R has a single integer type, NumPy supports *fourteen* basic integer types once you account for available precisions, signedness, and endianness of the encoding.
Reserving a specific bit pattern in all available NumPy types would lead to an unwieldy amount of overhead in special-casing various operations for various types, likely even requiring a new fork of the NumPy package. Further, for the smaller data types (such as 8-bit integers), sacrificing a bit to use as a mask will significantly reduce the range of values it can represent.
NumPy does have support for masked arrays that is, arrays that have a separate Boolean mask array attached for marking data as "good" or "bad."
Pandas could have derived from this, but the overhead in both storage, computation, and code maintenance makes that an unattractive choice.
With these constraints in mind, Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point ``NaN`` value, and the Python ``None`` object.
This choice has some side effects, as we will see, but in practice ends up being a good compromise in most cases of interest.
### ``None``: Pythonic missing data
The first sentinel value used by Pandas is ``None``, a Python singleton object that is often used for missing data in Python code.
Because it is a Python object, ``None`` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type ``'object'`` (i.e., arrays of Python objects):
```python
import numpy as np
import pandas as pd
```
```python
vals1 = np.array([1, None, 3, 4])
vals1
```
This ``dtype=object`` means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects.
While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:
```python
for dtype in ['object', 'int']:
print("dtype =", dtype)
%timeit np.arange(1E6, dtype=dtype).sum()
print()
```
The use of Python objects in an array also means that if you perform aggregations like ``sum()`` or ``min()`` across an array with a ``None`` value, you will generally get an error:
```python
vals1.sum()
```
This reflects the fact that addition between an integer and ``None`` is undefined.
### ``NaN``: Missing numerical data
The other missing data representation, ``NaN`` (acronym for *Not a Number*), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:
```python
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype
```
Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code.
You should be aware that ``NaN`` is a bit like a data virusit infects any other object it touches.
Regardless of the operation, the result of arithmetic with ``NaN`` will be another ``NaN``:
```python
1 + np.nan
```
```python
0 * np.nan
```
Note that this means that aggregates over the values are well defined (i.e., they don't result in an error) but not always useful:
```python
vals2.sum(), vals2.min(), vals2.max()
```
NumPy does provide some special aggregations that will ignore these missing values:
```python
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
```
Keep in mind that ``NaN`` is specifically a floating-point value; there is no equivalent NaN value for integers, strings, or other types.
### NaN and None in Pandas
``NaN`` and ``None`` both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:
```python
pd.Series([1, np.nan, 2, None])
```
For types that don't have an available sentinel value, Pandas automatically type-casts when NA values are present.
For example, if we set a value in an integer array to ``np.nan``, it will automatically be upcast to a floating-point type to accommodate the NA:
```python
x = pd.Series(range(2), dtype=int)
x
```
```python
x[0] = None
x
```
Notice that in addition to casting the integer array to floating point, Pandas automatically converts the ``None`` to a ``NaN`` value.
(Be aware that there is a proposal to add a native integer NA to Pandas in the future; as of this writing, it has not been included).
While this type of magic may feel a bit hackish compared to the more unified approach to NA values in domain-specific languages like R, the Pandas sentinel/casting approach works quite well in practice and in my experience only rarely causes issues.
The following table lists the upcasting conventions in Pandas when NA values are introduced:
|Typeclass | Conversion When Storing NAs | NA Sentinel Value |
|--------------|-----------------------------|------------------------|
| ``floating`` | No change | ``np.nan`` |
| ``object`` | No change | ``None`` or ``np.nan`` |
| ``integer`` | Cast to ``float64`` | ``np.nan`` |
| ``boolean`` | Cast to ``object`` | ``None`` or ``np.nan`` |
Keep in mind that in Pandas, string data is always stored with an ``object`` dtype.
## Operating on Null Values
As we have seen, Pandas treats ``None`` and ``NaN`` as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
They are:
- ``isnull()``: Generate a boolean mask indicating missing values
- ``notnull()``: Opposite of ``isnull()``
- ``dropna()``: Return a filtered version of the data
- ``fillna()``: Return a copy of the data with missing values filled or imputed
We will conclude this section with a brief exploration and demonstration of these routines.
### Detecting null values
Pandas data structures have two useful methods for detecting null data: ``isnull()`` and ``notnull()``.
Either one will return a Boolean mask over the data. For example:
```python
data = pd.Series([1, np.nan, 'hello', None])
```
```python
data.isnull()
```
As mentioned in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb), Boolean masks can be used directly as a ``Series`` or ``DataFrame`` index:
```python
data[data.notnull()]
```
The ``isnull()`` and ``notnull()`` methods produce similar Boolean results for ``DataFrame``s.
### Dropping null values
In addition to the masking used before, there are the convenience methods, ``dropna()``
(which removes NA values) and ``fillna()`` (which fills in NA values). For a ``Series``,
the result is straightforward:
```python
data.dropna()
```
For a ``DataFrame``, there are more options.
Consider the following ``DataFrame``:
```python
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
```
We cannot drop single values from a ``DataFrame``; we can only drop full rows or full columns.
Depending on the application, you might want one or the other, so ``dropna()`` gives a number of options for a ``DataFrame``.
By default, ``dropna()`` will drop all rows in which *any* null value is present:
```python
df.dropna()
```
Alternatively, you can drop NA values along a different axis; ``axis=1`` drops all columns containing a null value:
```python
df.dropna(axis='columns')
```
But this drops some good data as well; you might rather be interested in dropping rows or columns with *all* NA values, or a majority of NA values.
This can be specified through the ``how`` or ``thresh`` parameters, which allow fine control of the number of nulls to allow through.
The default is ``how='any'``, such that any row or column (depending on the ``axis`` keyword) containing a null value will be dropped.
You can also specify ``how='all'``, which will only drop rows/columns that are *all* null values:
```python
df[3] = np.nan
df
```
```python
df.dropna(axis='columns', how='all')
```
For finer-grained control, the ``thresh`` parameter lets you specify a minimum number of non-null values for the row/column to be kept:
```python
df.dropna(axis='rows', thresh=3)
```
Here the first and last row have been dropped, because they contain only two non-null values.
### Filling null values
Sometimes rather than dropping NA values, you'd rather replace them with a valid value.
This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.
You could do this in-place using the ``isnull()`` method as a mask, but because it is such a common operation Pandas provides the ``fillna()`` method, which returns a copy of the array with the null values replaced.
Consider the following ``Series``:
```python
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
```
We can fill NA entries with a single value, such as zero:
```python
data.fillna(0)
```
We can specify a forward-fill to propagate the previous value forward:
```python
# forward-fill
data.fillna(method='ffill')
```
Or we can specify a back-fill to propagate the next values backward:
```python
# back-fill
data.fillna(method='bfill')
```
For ``DataFrame``s, the options are similar, but we can also specify an ``axis`` along which the fills take place:
```python
df
```
```python
df.fillna(method='ffill', axis=1)
```
Notice that if a previous value is not available during a forward fill, the NA value remains.
<!--NAVIGATION-->
< [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb) | [Contents](Index.ipynb) | [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.04-Missing-Values.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,599 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!-- #region deletable=true editable=true -->
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
<!--NAVIGATION-->
< [Handling Missing Data](03.04-Missing-Values.ipynb) | [Contents](Index.ipynb) | [Combining Datasets: Concat and Append](03.06-Concat-And-Append.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.05-Hierarchical-Indexing.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
<!-- #endregion -->
# Hierarchical Indexing
<!-- #region deletable=true editable=true -->
Up to this point we've been focused primarily on one-dimensional and two-dimensional data, stored in Pandas ``Series`` and ``DataFrame`` objects, respectively.
Often it is useful to go beyond this and store higher-dimensional datathat is, data indexed by more than one or two keys.
While Pandas does provide ``Panel`` and ``Panel4D`` objects that natively handle three-dimensional and four-dimensional data (see [Aside: Panel Data](#Aside:-Panel-Data)), a far more common pattern in practice is to make use of *hierarchical indexing* (also known as *multi-indexing*) to incorporate multiple index *levels* within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional ``Series`` and two-dimensional ``DataFrame`` objects.
In this section, we'll explore the direct creation of ``MultiIndex`` objects, considerations when indexing, slicing, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.
We begin with the standard imports:
<!-- #endregion -->
```python deletable=true editable=true
import pandas as pd
import numpy as np
```
<!-- #region deletable=true editable=true -->
## A Multiply Indexed Series
Let's start by considering how we might represent two-dimensional data within a one-dimensional ``Series``.
For concreteness, we will consider a series of data where each point has a character and numerical key.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
### The bad way
Suppose you would like to track data about states from two different years.
Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:
<!-- #endregion -->
```python deletable=true editable=true
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
```
<!-- #region deletable=true editable=true -->
With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:
<!-- #endregion -->
```python deletable=true editable=true
pop[('California', 2010):('Texas', 2000)]
```
<!-- #region deletable=true editable=true -->
But the convenience ends there. For example, if you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:
<!-- #endregion -->
```python deletable=true editable=true
pop[[i for i in pop.index if i[1] == 2010]]
```
<!-- #region deletable=true editable=true -->
This produces the desired result, but is not as clean (or as efficient for large datasets) as the slicing syntax we've grown to love in Pandas.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
### The Better Way: Pandas MultiIndex
Fortunately, Pandas provides a better way.
Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas ``MultiIndex`` type gives us the type of operations we wish to have.
We can create a multi-index from the tuples as follows:
<!-- #endregion -->
```python deletable=true editable=true
index = pd.MultiIndex.from_tuples(index)
index
```
<!-- #region deletable=true editable=true -->
Notice that the ``MultiIndex`` contains multiple *levels* of indexingin this case, the state names and the years, as well as multiple *labels* for each data point which encode these levels.
If we re-index our series with this ``MultiIndex``, we see the hierarchical representation of the data:
<!-- #endregion -->
```python deletable=true editable=true
pop = pop.reindex(index)
pop
```
<!-- #region deletable=true editable=true -->
Here the first two columns of the ``Series`` representation show the multiple index values, while the third column shows the data.
Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:
<!-- #endregion -->
```python deletable=true editable=true
pop[:, 2010]
```
<!-- #region deletable=true editable=true -->
The result is a singly indexed array with just the keys we're interested in.
This syntax is much more convenient (and the operation is much more efficient!) than the home-spun tuple-based multi-indexing solution that we started with.
We'll now further discuss this sort of indexing operation on hieararchically indexed data.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
### MultiIndex as extra dimension
You might notice something else here: we could easily have stored the same data using a simple ``DataFrame`` with index and column labels.
In fact, Pandas is built with this equivalence in mind. The ``unstack()`` method will quickly convert a multiply indexed ``Series`` into a conventionally indexed ``DataFrame``:
<!-- #endregion -->
```python deletable=true editable=true
pop_df = pop.unstack()
pop_df
```
<!-- #region deletable=true editable=true -->
Naturally, the ``stack()`` method provides the opposite operation:
<!-- #endregion -->
```python deletable=true editable=true
pop_df.stack()
```
<!-- #region deletable=true editable=true -->
Seeing this, you might wonder why would we would bother with hierarchical indexing at all.
The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional ``Series``, we can also use it to represent data of three or more dimensions in a ``Series`` or ``DataFrame``.
Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a ``MultiIndex`` this is as easy as adding another column to the ``DataFrame``:
<!-- #endregion -->
```python deletable=true editable=true
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
```
<!-- #region deletable=true editable=true -->
In addition, all the ufuncs and other functionality discussed in [Operating on Data in Pandas](03.03-Operations-in-Pandas.ipynb) work with hierarchical indices as well.
Here we compute the fraction of people under 18 by year, given the above data:
<!-- #endregion -->
```python deletable=true editable=true
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
```
<!-- #region deletable=true editable=true -->
This allows us to easily and quickly manipulate and explore even high-dimensional data.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
## Methods of MultiIndex Creation
The most straightforward way to construct a multiply indexed ``Series`` or ``DataFrame`` is to simply pass a list of two or more index arrays to the constructor. For example:
<!-- #endregion -->
```python deletable=true editable=true
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
```
<!-- #region deletable=true editable=true -->
The work of creating the ``MultiIndex`` is done in the background.
Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a ``MultiIndex`` by default:
<!-- #endregion -->
```python deletable=true editable=true
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
```
<!-- #region deletable=true editable=true -->
Nevertheless, it is sometimes useful to explicitly create a ``MultiIndex``; we'll see a couple of these methods here.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
### Explicit MultiIndex constructors
For more flexibility in how the index is constructed, you can instead use the class method constructors available in the ``pd.MultiIndex``.
For example, as we did before, you can construct the ``MultiIndex`` from a simple list of arrays giving the index values within each level:
<!-- #endregion -->
```python deletable=true editable=true
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
```
<!-- #region deletable=true editable=true -->
You can construct it from a list of tuples giving the multiple index values of each point:
<!-- #endregion -->
```python deletable=true editable=true
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])
```
<!-- #region deletable=true editable=true -->
You can even construct it from a Cartesian product of single indices:
<!-- #endregion -->
```python deletable=true editable=true
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
```
<!-- #region deletable=true editable=true -->
Similarly, you can construct the ``MultiIndex`` directly using its internal encoding by passing ``levels`` (a list of lists containing available index values for each level) and ``labels`` (a list of lists that reference these labels):
<!-- #endregion -->
```python deletable=true editable=true
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
```
<!-- #region deletable=true editable=true -->
Any of these objects can be passed as the ``index`` argument when creating a ``Series`` or ``Dataframe``, or be passed to the ``reindex`` method of an existing ``Series`` or ``DataFrame``.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
### MultiIndex level names
Sometimes it is convenient to name the levels of the ``MultiIndex``.
This can be accomplished by passing the ``names`` argument to any of the above ``MultiIndex`` constructors, or by setting the ``names`` attribute of the index after the fact:
<!-- #endregion -->
```python deletable=true editable=true
pop.index.names = ['state', 'year']
pop
```
<!-- #region deletable=true editable=true -->
With more involved datasets, this can be a useful way to keep track of the meaning of various index values.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
### MultiIndex for columns
In a ``DataFrame``, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well.
Consider the following, which is a mock-up of some (somewhat realistic) medical data:
<!-- #endregion -->
```python deletable=true editable=true
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
names=['subject', 'type'])
# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37
# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data
```
<!-- #region deletable=true editable=true -->
Here we see where the multi-indexing for both rows and columns can come in *very* handy.
This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number.
With this in place we can, for example, index the top-level column by the person's name and get a full ``DataFrame`` containing just that person's information:
<!-- #endregion -->
```python deletable=true editable=true
health_data['Guido']
```
<!-- #region deletable=true editable=true -->
For complicated records containing multiple labeled measurements across multiple times for many subjects (people, countries, cities, etc.) use of hierarchical rows and columns can be extremely convenient!
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
## Indexing and Slicing a MultiIndex
Indexing and slicing on a ``MultiIndex`` is designed to be intuitive, and it helps if you think about the indices as added dimensions.
We'll first look at indexing multiply indexed ``Series``, and then multiply-indexed ``DataFrame``s.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
### Multiply indexed Series
Consider the multiply indexed ``Series`` of state populations we saw earlier:
<!-- #endregion -->
```python deletable=true editable=true
pop
```
<!-- #region deletable=true editable=true -->
We can access single elements by indexing with multiple terms:
<!-- #endregion -->
```python deletable=true editable=true
pop['California', 2000]
```
<!-- #region deletable=true editable=true -->
The ``MultiIndex`` also supports *partial indexing*, or indexing just one of the levels in the index.
The result is another ``Series``, with the lower-level indices maintained:
<!-- #endregion -->
```python deletable=true editable=true
pop['California']
```
<!-- #region deletable=true editable=true -->
Partial slicing is available as well, as long as the ``MultiIndex`` is sorted (see discussion in [Sorted and Unsorted Indices](#Sorted-and-unsorted-indices)):
<!-- #endregion -->
```python deletable=true editable=true
pop.loc['California':'New York']
```
<!-- #region deletable=true editable=true -->
With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:
<!-- #endregion -->
```python deletable=true editable=true
pop[:, 2000]
```
<!-- #region deletable=true editable=true -->
Other types of indexing and selection (discussed in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb)) work as well; for example, selection based on Boolean masks:
<!-- #endregion -->
```python deletable=true editable=true
pop[pop > 22000000]
```
<!-- #region deletable=true editable=true -->
Selection based on fancy indexing also works:
<!-- #endregion -->
```python deletable=true editable=true
pop[['California', 'Texas']]
```
<!-- #region deletable=true editable=true -->
### Multiply indexed DataFrames
A multiply indexed ``DataFrame`` behaves in a similar manner.
Consider our toy medical ``DataFrame`` from before:
<!-- #endregion -->
```python deletable=true editable=true
health_data
```
<!-- #region deletable=true editable=true -->
Remember that columns are primary in a ``DataFrame``, and the syntax used for multiply indexed ``Series`` applies to the columns.
For example, we can recover Guido's heart rate data with a simple operation:
<!-- #endregion -->
```python deletable=true editable=true
health_data['Guido', 'HR']
```
<!-- #region deletable=true editable=true -->
Also, as with the single-index case, we can use the ``loc``, ``iloc``, and ``ix`` indexers introduced in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb). For example:
<!-- #endregion -->
```python deletable=true editable=true
health_data.iloc[:2, :2]
```
<!-- #region deletable=true editable=true -->
These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in ``loc`` or ``iloc`` can be passed a tuple of multiple indices. For example:
<!-- #endregion -->
```python deletable=true editable=true
health_data.loc[:, ('Bob', 'HR')]
```
<!-- #region deletable=true editable=true -->
Working with slices within these index tuples is not especially convenient; trying to create a slice within a tuple will lead to a syntax error:
<!-- #endregion -->
```python deletable=true editable=true
health_data.loc[(:, 1), (:, 'HR')]
```
<!-- #region deletable=true editable=true -->
You could get around this by building the desired slice explicitly using Python's built-in ``slice()`` function, but a better way in this context is to use an ``IndexSlice`` object, which Pandas provides for precisely this situation.
For example:
<!-- #endregion -->
```python deletable=true editable=true
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]
```
<!-- #region deletable=true editable=true -->
There are so many ways to interact with data in multiply indexed ``Series`` and ``DataFrame``s, and as with many tools in this book the best way to become familiar with them is to try them out!
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
## Rearranging Multi-Indices
One of the keys to working with multiply indexed data is knowing how to effectively transform the data.
There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations.
We saw a brief example of this in the ``stack()`` and ``unstack()`` methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we'll explore them here.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
### Sorted and unsorted indices
Earlier, we briefly mentioned a caveat, but we should emphasize it more here.
*Many of the ``MultiIndex`` slicing operations will fail if the index is not sorted.*
Let's take a look at this here.
We'll start by creating some simple multiply indexed data where the indices are *not lexographically sorted*:
<!-- #endregion -->
```python deletable=true editable=true
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data
```
<!-- #region deletable=true editable=true -->
If we try to take a partial slice of this index, it will result in an error:
<!-- #endregion -->
```python deletable=true editable=true
try:
data['a':'b']
except KeyError as e:
print(type(e))
print(e)
```
<!-- #region deletable=true editable=true -->
Although it is not entirely clear from the error message, this is the result of the MultiIndex not being sorted.
For various reasons, partial slices and other similar operations require the levels in the ``MultiIndex`` to be in sorted (i.e., lexographical) order.
Pandas provides a number of convenience routines to perform this type of sorting; examples are the ``sort_index()`` and ``sortlevel()`` methods of the ``DataFrame``.
We'll use the simplest, ``sort_index()``, here:
<!-- #endregion -->
```python deletable=true editable=true
data = data.sort_index()
data
```
<!-- #region deletable=true editable=true -->
With the index sorted in this way, partial slicing will work as expected:
<!-- #endregion -->
```python deletable=true editable=true
data['a':'b']
```
<!-- #region deletable=true editable=true -->
### Stacking and unstacking indices
As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:
<!-- #endregion -->
```python deletable=true editable=true
pop.unstack(level=0)
```
```python deletable=true editable=true
pop.unstack(level=1)
```
<!-- #region deletable=true editable=true -->
The opposite of ``unstack()`` is ``stack()``, which here can be used to recover the original series:
<!-- #endregion -->
```python deletable=true editable=true
pop.unstack().stack()
```
<!-- #region deletable=true editable=true -->
### Index setting and resetting
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the ``reset_index`` method.
Calling this on the population dictionary will result in a ``DataFrame`` with a *state* and *year* column holding the information that was formerly in the index.
For clarity, we can optionally specify the name of the data for the column representation:
<!-- #endregion -->
```python deletable=true editable=true
pop_flat = pop.reset_index(name='population')
pop_flat
```
<!-- #region deletable=true editable=true -->
Often when working with data in the real world, the raw input data looks like this and it's useful to build a ``MultiIndex`` from the column values.
This can be done with the ``set_index`` method of the ``DataFrame``, which returns a multiply indexed ``DataFrame``:
<!-- #endregion -->
```python deletable=true editable=true
pop_flat.set_index(['state', 'year'])
```
<!-- #region deletable=true editable=true -->
In practice, I find this type of reindexing to be one of the more useful patterns when encountering real-world datasets.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
## Data Aggregations on Multi-Indices
We've previously seen that Pandas has built-in data aggregation methods, such as ``mean()``, ``sum()``, and ``max()``.
For hierarchically indexed data, these can be passed a ``level`` parameter that controls which subset of the data the aggregate is computed on.
For example, let's return to our health data:
<!-- #endregion -->
```python deletable=true editable=true
health_data
```
<!-- #region deletable=true editable=true -->
Perhaps we'd like to average-out the measurements in the two visits each year. We can do this by naming the index level we'd like to explore, in this case the year:
<!-- #endregion -->
```python deletable=true editable=true
data_mean = health_data.mean(level='year')
data_mean
```
<!-- #region deletable=true editable=true -->
By further making use of the ``axis`` keyword, we can take the mean among levels on the columns as well:
<!-- #endregion -->
```python deletable=true editable=true
data_mean.mean(axis=1, level='type')
```
<!-- #region deletable=true editable=true -->
Thus in two lines, we've been able to find the average heart rate and temperature measured among all subjects in all visits each year.
This syntax is actually a short cut to the ``GroupBy`` functionality, which we will discuss in [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb).
While this is a toy example, many real-world datasets have similar hierarchical structure.
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
## Aside: Panel Data
Pandas has a few other fundamental data structures that we have not yet discussed, namely the ``pd.Panel`` and ``pd.Panel4D`` objects.
These can be thought of, respectively, as three-dimensional and four-dimensional generalizations of the (one-dimensional) ``Series`` and (two-dimensional) ``DataFrame`` structures.
Once you are familiar with indexing and manipulation of data in a ``Series`` and ``DataFrame``, ``Panel`` and ``Panel4D`` are relatively straightforward to use.
In particular, the ``ix``, ``loc``, and ``iloc`` indexers discussed in [Data Indexing and Selection](03.02-Data-Indexing-and-Selection.ipynb) extend readily to these higher-dimensional structures.
We won't cover these panel structures further in this text, as I've found in the majority of cases that multi-indexing is a more useful and conceptually simpler representation for higher-dimensional data.
Additionally, panel data is fundamentally a dense data representation, while multi-indexing is fundamentally a sparse data representation.
As the number of dimensions increases, the dense representation can become very inefficient for the majority of real-world datasets.
For the occasional specialized application, however, these structures can be useful.
If you'd like to read more about the ``Panel`` and ``Panel4D`` structures, see the references listed in [Further Resources](03.13-Further-Resources.ipynb).
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
<!--NAVIGATION-->
< [Handling Missing Data](03.04-Missing-Values.ipynb) | [Contents](Index.ipynb) | [Combining Datasets: Concat and Append](03.06-Concat-And-Append.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.05-Hierarchical-Indexing.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
<!-- #endregion -->

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,251 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb) | [Contents](Index.ipynb) | [Combining Datasets: Merge and Join](03.07-Merge-and-Join.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.06-Concat-And-Append.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Combining Datasets: Concat and Append
Some of the most interesting studies of data come from combining different data sources.
These operations can involve anything from very straightforward concatenation of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets.
``Series`` and ``DataFrame``s are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.
Here we'll take a look at simple concatenation of ``Series`` and ``DataFrame``s with the ``pd.concat`` function; later we'll dive into more sophisticated in-memory merges and joins implemented in Pandas.
We begin with the standard imports:
```python
import pandas as pd
import numpy as np
```
For convenience, we'll define this function which creates a ``DataFrame`` of a particular form that will be useful below:
```python
def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC', range(3))
```
In addition, we'll create a quick class that allows us to display multiple ``DataFrame``s side by side. The code makes use of the special ``_repr_html_`` method, which IPython uses to implement its rich object display:
```python
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
```
The use of this will become clearer as we continue our discussion in the following section.
## Recall: Concatenation of NumPy Arrays
Concatenation of ``Series`` and ``DataFrame`` objects is very similar to concatenation of Numpy arrays, which can be done via the ``np.concatenate`` function as discussed in [The Basics of NumPy Arrays](02.02-The-Basics-Of-NumPy-Arrays.ipynb).
Recall that with it, you can combine the contents of two or more arrays into a single array:
```python
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
```
The first argument is a list or tuple of arrays to concatenate.
Additionally, it takes an ``axis`` keyword that allows you to specify the axis along which the result will be concatenated:
```python
x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)
```
## Simple Concatenation with ``pd.concat``
<!-- #region -->
Pandas has a function, ``pd.concat()``, which has a similar syntax to ``np.concatenate`` but contains a number of options that we'll discuss momentarily:
```python
# Signature in Pandas v0.18
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
```
``pd.concat()`` can be used for a simple concatenation of ``Series`` or ``DataFrame`` objects, just as ``np.concatenate()`` can be used for simple concatenations of arrays:
<!-- #endregion -->
```python
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
```
It also works to concatenate higher-dimensional objects, such as ``DataFrame``s:
```python
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
```
By default, the concatenation takes place row-wise within the ``DataFrame`` (i.e., ``axis=0``).
Like ``np.concatenate``, ``pd.concat`` allows specification of an axis along which concatenation will take place.
Consider the following example:
```python
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")
```
We could have equivalently specified ``axis=1``; here we've used the more intuitive ``axis='col'``.
### Duplicate indices
One important difference between ``np.concatenate`` and ``pd.concat`` is that Pandas concatenation *preserves indices*, even if the result will have duplicate indices!
Consider this simple example:
```python
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
```
Notice the repeated indices in the result.
While this is valid within ``DataFrame``s, the outcome is often undesirable.
``pd.concat()`` gives us a few ways to handle it.
#### Catching the repeats as an error
If you'd like to simply verify that the indices in the result of ``pd.concat()`` do not overlap, you can specify the ``verify_integrity`` flag.
With this set to True, the concatenation will raise an exception if there are duplicate indices.
Here is an example, where for clarity we'll catch and print the error message:
```python
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
```
#### Ignoring the index
Sometimes the index itself does not matter, and you would prefer it to simply be ignored.
This option can be specified using the ``ignore_index`` flag.
With this set to true, the concatenation will create a new integer index for the resulting ``Series``:
```python
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
```
#### Adding MultiIndex keys
Another option is to use the ``keys`` option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data:
```python
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
```
The result is a multiply indexed ``DataFrame``, and we can use the tools discussed in [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb) to transform this data into the representation we're interested in.
### Concatenation with joins
In the simple examples we just looked at, we were mainly concatenating ``DataFrame``s with shared column names.
In practice, data from different sources might have different sets of column names, and ``pd.concat`` offers several options in this case.
Consider the concatenation of the following two ``DataFrame``s, which have some (but not all!) columns in common:
```python
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
```
By default, the entries for which no data is available are filled with NA values.
To change this, we can specify one of several options for the ``join`` and ``join_axes`` parameters of the concatenate function.
By default, the join is a union of the input columns (``join='outer'``), but we can change this to an intersection of the columns using ``join='inner'``:
```python
display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")
```
Another option is to directly specify the index of the remaininig colums using the ``join_axes`` argument, which takes a list of index objects.
Here we'll specify that the returned columns should be the same as those of the first input:
```python
display('df5', 'df6',
"pd.concat([df5, df6], join_axes=[df5.columns])")
```
The combination of options of the ``pd.concat`` function allows a wide range of possible behaviors when joining two datasets; keep these in mind as you use these tools for your own data.
### The ``append()`` method
Because direct array concatenation is so common, ``Series`` and ``DataFrame`` objects have an ``append`` method that can accomplish the same thing in fewer keystrokes.
For example, rather than calling ``pd.concat([df1, df2])``, you can simply call ``df1.append(df2)``:
```python
display('df1', 'df2', 'df1.append(df2)')
```
Keep in mind that unlike the ``append()`` and ``extend()`` methods of Python lists, the ``append()`` method in Pandas does not modify the original objectinstead it creates a new object with the combined data.
It also is not a very efficient method, because it involves creation of a new index *and* data buffer.
Thus, if you plan to do multiple ``append`` operations, it is generally better to build a list of ``DataFrame``s and pass them all at once to the ``concat()`` function.
In the next section, we'll look at another more powerful approach to combining data from multiple sources, the database-style merges/joins implemented in ``pd.merge``.
For more information on ``concat()``, ``append()``, and related functionality, see the ["Merge, Join, and Concatenate" section](http://pandas.pydata.org/pandas-docs/stable/merging.html) of the Pandas documentation.
<!--NAVIGATION-->
< [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb) | [Contents](Index.ipynb) | [Combining Datasets: Merge and Join](03.07-Merge-and-Join.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.06-Concat-And-Append.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,420 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Combining Datasets: Concat and Append](03.06-Concat-And-Append.ipynb) | [Contents](Index.ipynb) | [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.07-Merge-and-Join.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Combining Datasets: Merge and Join
One essential feature offered by Pandas is its high-performance, in-memory join and merge operations.
If you have ever worked with databases, you should be familiar with this type of data interaction.
The main interface for this is the ``pd.merge`` function, and we'll see few examples of how this can work in practice.
For convenience, we will start by redefining the ``display()`` functionality from the previous section:
```python
import pandas as pd
import numpy as np
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
```
## Relational Algebra
The behavior implemented in ``pd.merge()`` is a subset of what is known as *relational algebra*, which is a formal set of rules for manipulating relational data, and forms the conceptual foundation of operations available in most databases.
The strength of the relational algebra approach is that it proposes several primitive operations, which become the building blocks of more complicated operations on any dataset.
With this lexicon of fundamental operations implemented efficiently in a database or other program, a wide range of fairly complicated composite operations can be performed.
Pandas implements several of these fundamental building-blocks in the ``pd.merge()`` function and the related ``join()`` method of ``Series`` and ``Dataframe``s.
As we will see, these let you efficiently link data from different sources.
## Categories of Joins
The ``pd.merge()`` function implements a number of types of joins: the *one-to-one*, *many-to-one*, and *many-to-many* joins.
All three types of joins are accessed via an identical call to the ``pd.merge()`` interface; the type of join performed depends on the form of the input data.
Here we will show simple examples of the three types of merges, and discuss detailed options further below.
### One-to-one joins
Perhaps the simplest type of merge expresion is the one-to-one join, which is in many ways very similar to the column-wise concatenation seen in [Combining Datasets: Concat & Append](03.06-Concat-And-Append.ipynb).
As a concrete example, consider the following two ``DataFrames`` which contain information on several employees in a company:
```python
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
```
To combine this information into a single ``DataFrame``, we can use the ``pd.merge()`` function:
```python
df3 = pd.merge(df1, df2)
df3
```
The ``pd.merge()`` function recognizes that each ``DataFrame`` has an "employee" column, and automatically joins using this column as a key.
The result of the merge is a new ``DataFrame`` that combines the information from the two inputs.
Notice that the order of entries in each column is not necessarily maintained: in this case, the order of the "employee" column differs between ``df1`` and ``df2``, and the ``pd.merge()`` function correctly accounts for this.
Additionally, keep in mind that the merge in general discards the index, except in the special case of merges by index (see the ``left_index`` and ``right_index`` keywords, discussed momentarily).
### Many-to-one joins
Many-to-one joins are joins in which one of the two key columns contains duplicate entries.
For the many-to-one case, the resulting ``DataFrame`` will preserve those duplicate entries as appropriate.
Consider the following example of a many-to-one join:
```python
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')
```
The resulting ``DataFrame`` has an aditional column with the "supervisor" information, where the information is repeated in one or more locations as required by the inputs.
### Many-to-many joins
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined.
If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.
This will be perhaps most clear with a concrete example.
Consider the following, where we have a ``DataFrame`` showing one or more skills associated with a particular group.
By performing a many-to-many join, we can recover the skills associated with any individual person:
```python
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")
```
These three types of joins can be used with other Pandas tools to implement a wide array of functionality.
But in practice, datasets are rarely as clean as the one we're working with here.
In the following section we'll consider some of the options provided by ``pd.merge()`` that enable you to tune how the join operations work.
## Specification of the Merge Key
We've already seen the default behavior of ``pd.merge()``: it looks for one or more matching column names between the two inputs, and uses this as the key.
However, often the column names will not match so nicely, and ``pd.merge()`` provides a variety of options for handling this.
### The ``on`` keyword
Most simply, you can explicitly specify the name of the key column using the ``on`` keyword, which takes a column name or a list of column names:
```python
display('df1', 'df2', "pd.merge(df1, df2, on='employee')")
```
This option works only if both the left and right ``DataFrame``s have the specified column name.
### The ``left_on`` and ``right_on`` keywords
At times you may wish to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as "name" rather than "employee".
In this case, we can use the ``left_on`` and ``right_on`` keywords to specify the two column names:
```python
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')
```
The result has a redundant column that we can drop if desiredfor example, by using the ``drop()`` method of ``DataFrame``s:
```python
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
```
### The ``left_index`` and ``right_index`` keywords
Sometimes, rather than merging on a column, you would instead like to merge on an index.
For example, your data might look like this:
```python
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')
```
You can use the index as the key for merging by specifying the ``left_index`` and/or ``right_index`` flags in ``pd.merge()``:
```python
display('df1a', 'df2a',
"pd.merge(df1a, df2a, left_index=True, right_index=True)")
```
For convenience, ``DataFrame``s implement the ``join()`` method, which performs a merge that defaults to joining on indices:
```python
display('df1a', 'df2a', 'df1a.join(df2a)')
```
If you'd like to mix indices and columns, you can combine ``left_index`` with ``right_on`` or ``left_on`` with ``right_index`` to get the desired behavior:
```python
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")
```
All of these options also work with multiple indices and/or multiple columns; the interface for this behavior is very intuitive.
For more information on this, see the ["Merge, Join, and Concatenate" section](http://pandas.pydata.org/pandas-docs/stable/merging.html) of the Pandas documentation.
## Specifying Set Arithmetic for Joins
In all the preceding examples we have glossed over one important consideration in performing a join: the type of set arithmetic used in the join.
This comes up when a value appears in one key column but not the other. Consider this example:
```python
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')
```
Here we have merged two datasets that have only a single "name" entry in common: Mary.
By default, the result contains the *intersection* of the two sets of inputs; this is what is known as an *inner join*.
We can specify this explicitly using the ``how`` keyword, which defaults to ``"inner"``:
```python
pd.merge(df6, df7, how='inner')
```
Other options for the ``how`` keyword are ``'outer'``, ``'left'``, and ``'right'``.
An *outer join* returns a join over the union of the input columns, and fills in all missing values with NAs:
```python
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")
```
The *left join* and *right join* return joins over the left entries and right entries, respectively.
For example:
```python
display('df6', 'df7', "pd.merge(df6, df7, how='left')")
```
The output rows now correspond to the entries in the left input. Using
``how='right'`` works in a similar manner.
All of these options can be applied straightforwardly to any of the preceding join types.
## Overlapping Column Names: The ``suffixes`` Keyword
Finally, you may end up in a case where your two input ``DataFrame``s have conflicting column names.
Consider this example:
```python
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')
```
Because the output would have two conflicting column names, the merge function automatically appends a suffix ``_x`` or ``_y`` to make the output columns unique.
If these defaults are inappropriate, it is possible to specify a custom suffix using the ``suffixes`` keyword:
```python
display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"])')
```
These suffixes work in any of the possible join patterns, and work also if there are multiple overlapping columns.
For more information on these patterns, see [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb) where we dive a bit deeper into relational algebra.
Also see the [Pandas "Merge, Join and Concatenate" documentation](http://pandas.pydata.org/pandas-docs/stable/merging.html) for further discussion of these topics.
## Example: US States Data
Merge and join operations come up most often when combining data from different sources.
Here we will consider an example of some data about US states and their populations.
The data files can be found at http://github.com/jakevdp/data-USstates/:
```python
# Following are shell commands to download the data
# !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv
# !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv
# !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-abbrevs.csv
```
Let's take a look at the three datasets, using the Pandas ``read_csv()`` function:
```python
pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data/state-areas.csv')
abbrevs = pd.read_csv('data/state-abbrevs.csv')
display('pop.head()', 'areas.head()', 'abbrevs.head()')
```
Given this information, say we want to compute a relatively straightforward result: rank US states and territories by their 2010 population density.
We clearly have the data here to find this result, but we'll have to combine the datasets to find the result.
We'll start with a many-to-one merge that will give us the full state name within the population ``DataFrame``.
We want to merge based on the ``state/region`` column of ``pop``, and the ``abbreviation`` column of ``abbrevs``.
We'll use ``how='outer'`` to make sure no data is thrown away due to mismatched labels.
```python
merged = pd.merge(pop, abbrevs, how='outer',
left_on='state/region', right_on='abbreviation')
merged = merged.drop('abbreviation', 1) # drop duplicate info
merged.head()
```
Let's double-check whether there were any mismatches here, which we can do by looking for rows with nulls:
```python
merged.isnull().any()
```
Some of the ``population`` info is null; let's figure out which these are!
```python
merged[merged['population'].isnull()].head()
```
It appears that all the null population values are from Puerto Rico prior to the year 2000; this is likely due to this data not being available from the original source.
More importantly, we see also that some of the new ``state`` entries are also null, which means that there was no corresponding entry in the ``abbrevs`` key!
Let's figure out which regions lack this match:
```python
merged.loc[merged['state'].isnull(), 'state/region'].unique()
```
We can quickly infer the issue: our population data includes entries for Puerto Rico (PR) and the United States as a whole (USA), while these entries do not appear in the state abbreviation key.
We can fix these quickly by filling in appropriate entries:
```python
merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
merged.isnull().any()
```
No more nulls in the ``state`` column: we're all set!
Now we can merge the result with the area data using a similar procedure.
Examining our results, we will want to join on the ``state`` column in both:
```python
final = pd.merge(merged, areas, on='state', how='left')
final.head()
```
Again, let's check for nulls to see if there were any mismatches:
```python
final.isnull().any()
```
There are nulls in the ``area`` column; we can take a look to see which regions were ignored here:
```python
final['state'][final['area (sq. mi)'].isnull()].unique()
```
We see that our ``areas`` ``DataFrame`` does not contain the area of the United States as a whole.
We could insert the appropriate value (using the sum of all state areas, for instance), but in this case we'll just drop the null values because the population density of the entire United States is not relevant to our current discussion:
```python
final.dropna(inplace=True)
final.head()
```
Now we have all the data we need. To answer the question of interest, let's first select the portion of the data corresponding with the year 2000, and the total population.
We'll use the ``query()`` function to do this quickly (this requires the ``numexpr`` package to be installed; see [High-Performance Pandas: ``eval()`` and ``query()``](03.12-Performance-Eval-and-Query.ipynb)):
```python
data2010 = final.query("year == 2010 & ages == 'total'")
data2010.head()
```
Now let's compute the population density and display it in order.
We'll start by re-indexing our data on the state, and then compute the result:
```python
data2010.set_index('state', inplace=True)
density = data2010['population'] / data2010['area (sq. mi)']
```
```python
density.sort_values(ascending=False, inplace=True)
density.head()
```
The result is a ranking of US states plus Washington, DC, and Puerto Rico in order of their 2010 population density, in residents per square mile.
We can see that by far the densest region in this dataset is Washington, DC (i.e., the District of Columbia); among states, the densest is New Jersey.
We can also check the end of the list:
```python
density.tail()
```
We see that the least dense state, by far, is Alaska, averaging slightly over one resident per square mile.
This type of messy data merging is a common task when trying to answer questions using real-world data sources.
I hope that this example has given you an idea of the ways you can combine tools we've covered in order to gain insight from your data!
<!--NAVIGATION-->
< [Combining Datasets: Concat and Append](03.06-Concat-And-Append.ipynb) | [Contents](Index.ipynb) | [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.07-Merge-and-Join.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,408 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Combining Datasets: Merge and Join](03.07-Merge-and-Join.ipynb) | [Contents](Index.ipynb) | [Pivot Tables](03.09-Pivot-Tables.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Aggregation and Grouping
An essential piece of analysis of large data is efficient summarization: computing aggregations like ``sum()``, ``mean()``, ``median()``, ``min()``, and ``max()``, in which a single number gives insight into the nature of a potentially large dataset.
In this section, we'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays, to more sophisticated operations based on the concept of a ``groupby``.
For convenience, we'll use the same ``display`` magic function that we've seen in previous sections:
```python
import numpy as np
import pandas as pd
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
```
## Planets Data
Here we will use the Planets dataset, available via the [Seaborn package](http://seaborn.pydata.org/) (see [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb)).
It gives information on planets that astronomers have discovered around other stars (known as *extrasolar planets* or *exoplanets* for short). It can be downloaded with a simple Seaborn command:
```python
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
```
```python
planets.head()
```
This has some details on the 1,000+ extrasolar planets discovered up to 2014.
## Simple Aggregation in Pandas
Earlier, we explored some of the data aggregations available for NumPy arrays (["Aggregations: Min, Max, and Everything In Between"](02.04-Computation-on-arrays-aggregates.ipynb)).
As with a one-dimensional NumPy array, for a Pandas ``Series`` the aggregates return a single value:
```python
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
```
```python
ser.sum()
```
```python
ser.mean()
```
For a ``DataFrame``, by default the aggregates return results within each column:
```python
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
df
```
```python
df.mean()
```
By specifying the ``axis`` argument, you can instead aggregate within each row:
```python
df.mean(axis='columns')
```
Pandas ``Series`` and ``DataFrame``s include all of the common aggregates mentioned in [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb); in addition, there is a convenience method ``describe()`` that computes several common aggregates for each column and returns the result.
Let's use this on the Planets data, for now dropping rows with missing values:
```python
planets.dropna().describe()
```
This can be a useful way to begin understanding the overall properties of a dataset.
For example, we see in the ``year`` column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2010 or after.
This is largely thanks to the *Kepler* mission, which is a space-based telescope specifically designed for finding eclipsing planets around other stars.
The following table summarizes some other built-in Pandas aggregations:
| Aggregation | Description |
|--------------------------|---------------------------------|
| ``count()`` | Total number of items |
| ``first()``, ``last()`` | First and last item |
| ``mean()``, ``median()`` | Mean and median |
| ``min()``, ``max()`` | Minimum and maximum |
| ``std()``, ``var()`` | Standard deviation and variance |
| ``mad()`` | Mean absolute deviation |
| ``prod()`` | Product of all items |
| ``sum()`` | Sum of all items |
These are all methods of ``DataFrame`` and ``Series`` objects.
To go deeper into the data, however, simple aggregates are often not enough.
The next level of data summarization is the ``groupby`` operation, which allows you to quickly and efficiently compute aggregates on subsets of data.
## GroupBy: Split, Apply, Combine
Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called ``groupby`` operation.
The name "group by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: *split, apply, combine*.
### Split, apply, combine
A canonical example of this split-apply-combine operation, where the "apply" is a summation aggregation, is illustrated in this figure:
![](figures/03.08-split-apply-combine.png)
[figure source in Appendix](06.00-Figure-Code.ipynb#Split-Apply-Combine)
This makes clear what the ``groupby`` accomplishes:
- The *split* step involves breaking up and grouping a ``DataFrame`` depending on the value of the specified key.
- The *apply* step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
- The *combine* step merges the results of these operations into an output array.
While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that *the intermediate splits do not need to be explicitly instantiated*. Rather, the ``GroupBy`` can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.
The power of the ``GroupBy`` is that it abstracts away these steps: the user need not think about *how* the computation is done under the hood, but rather thinks about the *operation as a whole*.
As a concrete example, let's take a look at using Pandas for the computation shown in this diagram.
We'll start by creating the input ``DataFrame``:
```python
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df
```
The most basic split-apply-combine operation can be computed with the ``groupby()`` method of ``DataFrame``s, passing the name of the desired key column:
```python
df.groupby('key')
```
Notice that what is returned is not a set of ``DataFrame``s, but a ``DataFrameGroupBy`` object.
This object is where the magic is: you can think of it as a special view of the ``DataFrame``, which is poised to dig into the groups but does no actual computation until the aggregation is applied.
This "lazy evaluation" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.
To produce a result, we can apply an aggregate to this ``DataFrameGroupBy`` object, which will perform the appropriate apply/combine steps to produce the desired result:
```python
df.groupby('key').sum()
```
The ``sum()`` method is just one possibility here; you can apply virtually any common Pandas or NumPy aggregation function, as well as virtually any valid ``DataFrame`` operation, as we will see in the following discussion.
### The GroupBy object
The ``GroupBy`` object is a very flexible abstraction.
In many ways, you can simply treat it as if it's a collection of ``DataFrame``s, and it does the difficult things under the hood. Let's see some examples using the Planets data.
Perhaps the most important operations made available by a ``GroupBy`` are *aggregate*, *filter*, *transform*, and *apply*.
We'll discuss each of these more fully in ["Aggregate, Filter, Transform, Apply"](#Aggregate,-Filter,-Transform,-Apply), but before that let's introduce some of the other functionality that can be used with the basic ``GroupBy`` operation.
#### Column indexing
The ``GroupBy`` object supports column indexing in the same way as the ``DataFrame``, and returns a modified ``GroupBy`` object.
For example:
```python
planets.groupby('method')
```
```python
planets.groupby('method')['orbital_period']
```
Here we've selected a particular ``Series`` group from the original ``DataFrame`` group by reference to its column name.
As with the ``GroupBy`` object, no computation is done until we call some aggregate on the object:
```python
planets.groupby('method')['orbital_period'].median()
```
This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.
#### Iteration over groups
The ``GroupBy`` object supports direct iteration over the groups, returning each group as a ``Series`` or ``DataFrame``:
```python
for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
```
This can be useful for doing certain things manually, though it is often much faster to use the built-in ``apply`` functionality, which we will discuss momentarily.
#### Dispatch methods
Through some Python class magic, any method not explicitly implemented by the ``GroupBy`` object will be passed through and called on the groups, whether they are ``DataFrame`` or ``Series`` objects.
For example, you can use the ``describe()`` method of ``DataFrame``s to perform a set of aggregations that describe each group in the data:
```python
planets.groupby('method')['year'].describe().unstack()
```
Looking at this table helps us to better understand the data: for example, the vast majority of planets have been discovered by the Radial Velocity and Transit methods, though the latter only became common (due to new, more accurate telescopes) in the last decade.
The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.
This is just one example of the utility of dispatch methods.
Notice that they are applied *to each individual group*, and the results are then combined within ``GroupBy`` and returned.
Again, any valid ``DataFrame``/``Series`` method can be used on the corresponding ``GroupBy`` object, which allows for some very flexible and powerful operations!
### Aggregate, filter, transform, apply
The preceding discussion focused on aggregation for the combine operation, but there are more options available.
In particular, ``GroupBy`` objects have ``aggregate()``, ``filter()``, ``transform()``, and ``apply()`` methods that efficiently implement a variety of useful operations before combining the grouped data.
For the purpose of the following subsections, we'll use this ``DataFrame``:
```python
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
```
#### Aggregation
We're now familiar with ``GroupBy`` aggregations with ``sum()``, ``median()``, and the like, but the ``aggregate()`` method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once.
Here is a quick example combining all these:
```python
df.groupby('key').aggregate(['min', np.median, max])
```
Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:
```python
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
```
#### Filtering
A filtering operation allows you to drop data based on the group properties.
For example, we might want to keep all groups in which the standard deviation is larger than some critical value:
```python
def filter_func(x):
return x['data2'].std() > 4
display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")
```
The filter function should return a Boolean value specifying whether the group passes the filtering. Here because group A does not have a standard deviation greater than 4, it is dropped from the result.
#### Transformation
While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.
For such a transformation, the output is the same shape as the input.
A common example is to center the data by subtracting the group-wise mean:
```python
df.groupby('key').transform(lambda x: x - x.mean())
```
#### The apply() method
The ``apply()`` method lets you apply an arbitrary function to the group results.
The function should take a ``DataFrame``, and return either a Pandas object (e.g., ``DataFrame``, ``Series``) or a scalar; the combine operation will be tailored to the type of output returned.
For example, here is an ``apply()`` that normalizes the first column by the sum of the second:
```python
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
display('df', "df.groupby('key').apply(norm_by_data2)")
```
``apply()`` within a ``GroupBy`` is quite flexible: the only criterion is that the function takes a ``DataFrame`` and returns a Pandas object or scalar; what you do in the middle is up to you!
### Specifying the split key
In the simple examples presented before, we split the ``DataFrame`` on a single column name.
This is just one of many options by which the groups can be defined, and we'll go through some other options for group specification here.
#### A list, array, series, or index providing the grouping keys
The key can be any series or list with a length matching that of the ``DataFrame``. For example:
```python
L = [0, 1, 0, 1, 2, 0]
display('df', 'df.groupby(L).sum()')
```
Of course, this means there's another, more verbose way of accomplishing the ``df.groupby('key')`` from before:
```python
display('df', "df.groupby(df['key']).sum()")
```
#### A dictionary or series mapping index to group
Another method is to provide a dictionary that maps index values to the group keys:
```python
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')
```
#### Any Python function
Similar to mapping, you can pass any Python function that will input the index value and output the group:
```python
display('df2', 'df2.groupby(str.lower).mean()')
```
#### A list of valid keys
Further, any of the preceding key choices can be combined to group on a multi-index:
```python
df2.groupby([str.lower, mapping]).mean()
```
### Grouping example
As an example of this, in a couple lines of Python code we can put all these together and count discovered planets by method and by decade:
```python
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)
```
This shows the power of combining many of the operations we've discussed up to this point when looking at realistic datasets.
We immediately gain a coarse understanding of when and how planets have been discovered over the past several decades!
Here I would suggest digging into these few lines of code, and evaluating the individual steps to make sure you understand exactly what they are doing to the result.
It's certainly a somewhat complicated example, but understanding these pieces will give you the means to similarly explore your own data.
<!--NAVIGATION-->
< [Combining Datasets: Merge and Join](03.07-Merge-and-Join.ipynb) | [Contents](Index.ipynb) | [Pivot Tables](03.09-Pivot-Tables.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,290 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb) | [Contents](Index.ipynb) | [Vectorized String Operations](03.10-Working-With-Strings.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.09-Pivot-Tables.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Pivot Tables
We have seen how the ``GroupBy`` abstraction lets us explore relationships within a dataset.
A *pivot table* is a similar operation that is commonly seen in spreadsheets and other programs that operate on tabular data.
The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data.
The difference between pivot tables and ``GroupBy`` can sometimes cause confusion; it helps me to think of pivot tables as essentially a *multidimensional* version of ``GroupBy`` aggregation.
That is, you split-apply-combine, but both the split and the combine happen across not a one-dimensional index, but across a two-dimensional grid.
## Motivating Pivot Tables
For the examples in this section, we'll use the database of passengers on the *Titanic*, available through the Seaborn library (see [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb)):
```python
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
```
```python
titanic.head()
```
This contains a wealth of information on each passenger of that ill-fated voyage, including gender, age, class, fare paid, and much more.
## Pivot Tables by Hand
To start learning more about this data, we might begin by grouping according to gender, survival status, or some combination thereof.
If you have read the previous section, you might be tempted to apply a ``GroupBy`` operationfor example, let's look at survival rate by gender:
```python
titanic.groupby('sex')[['survived']].mean()
```
This immediately gives us some insight: overall, three of every four females on board survived, while only one in five males survived!
This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class.
Using the vocabulary of ``GroupBy``, we might proceed using something like this:
we *group by* class and gender, *select* survival, *apply* a mean aggregate, *combine* the resulting groups, and then *unstack* the hierarchical index to reveal the hidden multidimensionality. In code:
```python
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack()
```
This gives us a better idea of how both gender and class affected survival, but the code is starting to look a bit garbled.
While each step of this pipeline makes sense in light of the tools we've previously discussed, the long string of code is not particularly easy to read or use.
This two-dimensional ``GroupBy`` is common enough that Pandas includes a convenience routine, ``pivot_table``, which succinctly handles this type of multi-dimensional aggregation.
## Pivot Table Syntax
Here is the equivalent to the preceding operation using the ``pivot_table`` method of ``DataFrame``s:
```python
titanic.pivot_table('survived', index='sex', columns='class')
```
This is eminently more readable than the ``groupby`` approach, and produces the same result.
As you might expect of an early 20th-century transatlantic cruise, the survival gradient favors both women and higher classes.
First-class women survived with near certainty (hi, Rose!), while only one in ten third-class men survived (sorry, Jack!).
### Multi-level pivot tables
Just as in the ``GroupBy``, the grouping in pivot tables can be specified with multiple levels, and via a number of options.
For example, we might be interested in looking at age as a third dimension.
We'll bin the age using the ``pd.cut`` function:
```python
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')
```
We can apply the same strategy when working with the columns as well; let's add info on the fare paid using ``pd.qcut`` to automatically compute quantiles:
```python
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])
```
The result is a four-dimensional aggregation with hierarchical indices (see [Hierarchical Indexing](03.05-Hierarchical-Indexing.ipynb)), shown in a grid demonstrating the relationship between the values.
<!-- #region -->
### Additional pivot table options
The full call signature of the ``pivot_table`` method of ``DataFrame``s is as follows:
```python
# call signature as of Pandas 0.18
DataFrame.pivot_table(data, values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False,
dropna=True, margins_name='All')
```
We've already seen examples of the first three arguments; here we'll take a quick look at the remaining ones.
Two of the options, ``fill_value`` and ``dropna``, have to do with missing data and are fairly straightforward; we will not show examples of them here.
The ``aggfunc`` keyword controls what type of aggregation is applied, which is a mean by default.
As in the GroupBy, the aggregation specification can be a string representing one of several common choices (e.g., ``'sum'``, ``'mean'``, ``'count'``, ``'min'``, ``'max'``, etc.) or a function that implements an aggregation (e.g., ``np.sum()``, ``min()``, ``sum()``, etc.).
Additionally, it can be specified as a dictionary mapping a column to any of the above desired options:
<!-- #endregion -->
```python
titanic.pivot_table(index='sex', columns='class',
aggfunc={'survived':sum, 'fare':'mean'})
```
Notice also here that we've omitted the ``values`` keyword; when specifying a mapping for ``aggfunc``, this is determined automatically.
At times it's useful to compute totals along each grouping.
This can be done via the ``margins`` keyword:
```python
titanic.pivot_table('survived', index='sex', columns='class', margins=True)
```
Here this automatically gives us information about the class-agnostic survival rate by gender, the gender-agnostic survival rate by class, and the overall survival rate of 38%.
The margin label can be specified with the ``margins_name`` keyword, which defaults to ``"All"``.
## Example: Birthrate Data
As a more interesting example, let's take a look at the freely available data on births in the United States, provided by the Centers for Disease Control (CDC).
This data can be found at https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
(this dataset has been analyzed rather extensively by Andrew Gelman and his group; see, for example, [this blog post](http://andrewgelman.com/2012/06/14/cool-ass-signal-processing-using-gaussian-processes/)):
```python
# shell command to download the data:
# !curl -O https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv
```
```python
births = pd.read_csv('data/births.csv')
```
Taking a look at the data, we see that it's relatively simpleit contains the number of births grouped by date and gender:
```python
births.head()
```
We can start to understand this data a bit more by using a pivot table.
Let's add a decade column, and take a look at male and female births as a function of decade:
```python
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc='sum')
```
We immediately see that male births outnumber female births in every decade.
To see this trend a bit more clearly, we can use the built-in plotting tools in Pandas to visualize the total number of births by year (see [Introduction to Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) for a discussion of plotting with Matplotlib):
```python
%matplotlib inline
import matplotlib.pyplot as plt
sns.set() # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');
```
With a simple pivot table and ``plot()`` method, we can immediately see the annual trend in births by gender. By eye, it appears that over the past 50 years male births have outnumbered female births by around 5%.
### Further data exploration
Though this doesn't necessarily relate to the pivot table, there are a few more interesting features we can pull out of this dataset using the Pandas tools covered up to this point.
We must start by cleaning the data a bit, removing outliers caused by mistyped dates (e.g., June 31st) or missing values (e.g., June 99th).
One easy way to remove these all at once is to cut outliers; we'll do this via a robust sigma-clipping operation:
```python
quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])
```
This final line is a robust estimate of the sample mean, where the 0.74 comes from the interquartile range of a Gaussian distribution (You can learn more about sigma-clipping operations in a book I coauthored with Željko Ivezić, Andrew J. Connolly, and Alexander Gray: ["Statistics, Data Mining, and Machine Learning in Astronomy"](http://press.princeton.edu/titles/10159.html) (Princeton University Press, 2014)).
With this we can use the ``query()`` method (discussed further in [High-Performance Pandas: ``eval()`` and ``query()``](03.12-Performance-Eval-and-Query.ipynb)) to filter-out rows with births outside these values:
```python
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
```
Next we set the ``day`` column to integers; previously it had been a string because some columns in the dataset contained the value ``'null'``:
```python
# set 'day' column to integer; it originally was a string due to nulls
births['day'] = births['day'].astype(int)
```
Finally, we can combine the day, month, and year to create a Date index (see [Working with Time Series](03.11-Working-with-Time-Series.ipynb)).
This allows us to quickly compute the weekday corresponding to each row:
```python
# create a datetime index from the year, month, day
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')
births['dayofweek'] = births.index.dayofweek
```
Using this we can plot births by weekday for several decades:
```python
import matplotlib.pyplot as plt
import matplotlib as mpl
births.pivot_table('births', index='dayofweek',
columns='decade', aggfunc='mean').plot()
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.ylabel('mean births by day');
```
Apparently births are slightly less common on weekends than on weekdays! Note that the 1990s and 2000s are missing because the CDC data contains only the month of birth starting in 1989.
Another intersting view is to plot the mean number of births by the day of the *year*.
Let's first group the data by month and day separately:
```python
births_by_date = births.pivot_table('births',
[births.index.month, births.index.day])
births_by_date.head()
```
The result is a multi-index over months and days.
To make this easily plottable, let's turn these months and days into a date by associating them with a dummy year variable (making sure to choose a leap year so February 29th is correctly handled!)
```python
births_by_date.index = [pd.datetime(2012, month, day)
for (month, day) in births_by_date.index]
births_by_date.head()
```
Focusing on the month and day only, we now have a time series reflecting the average number of births by date of the year.
From this, we can use the ``plot`` method to plot the data. It reveals some interesting trends:
```python
# Plot the results
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax);
```
In particular, the striking feature of this graph is the dip in birthrate on US holidays (e.g., Independence Day, Labor Day, Thanksgiving, Christmas, New Year's Day) although this likely reflects trends in scheduled/induced births rather than some deep psychosomatic effect on natural births.
For more discussion on this trend, see the analysis and links in [Andrew Gelman's blog post](http://andrewgelman.com/2012/06/14/cool-ass-signal-processing-using-gaussian-processes/) on the subject.
We'll return to this figure in [Example:-Effect-of-Holidays-on-US-Births](04.09-Text-and-Annotation.ipynb#Example:-Effect-of-Holidays-on-US-Births), where we will use Matplotlib's tools to annotate this plot.
Looking at this short example, you can see that many of the Python and Pandas tools we've seen to this point can be combined and used to gain insight from a variety of datasets.
We will see some more sophisticated applications of these data manipulations in future sections!
<!--NAVIGATION-->
< [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb) | [Contents](Index.ipynb) | [Vectorized String Operations](03.10-Working-With-Strings.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.09-Pivot-Tables.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,375 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Pivot Tables](03.09-Pivot-Tables.ipynb) | [Contents](Index.ipynb) | [Working with Time Series](03.11-Working-with-Time-Series.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Vectorized String Operations
One strength of Python is its relative ease in handling and manipulating string data.
Pandas builds on this and provides a comprehensive set of *vectorized string operations* that become an essential piece of the type of munging required when working with (read: cleaning up) real-world data.
In this section, we'll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the Internet.
## Introducing Pandas String Operations
We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:
```python
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2
```
This *vectorization* of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done.
For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax:
```python
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]
```
This is perhaps sufficient to work with some data, but it will break if there are any missing values.
For example:
```python
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
```
Pandas includes features to address both this need for vectorized string operations and for correctly handling missing data via the ``str`` attribute of Pandas Series and Index objects containing strings.
So, for example, suppose we create a Pandas Series with this data:
```python
import pandas as pd
names = pd.Series(data)
names
```
We can now call a single method that will capitalize all the entries, while skipping over any missing values:
```python
names.str.capitalize()
```
Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.
## Tables of Pandas String Methods
If you have a good understanding of string manipulation in Python, most of Pandas string syntax is intuitive enough that it's probably sufficient to just list a table of available methods; we will start with that here, before diving deeper into a few of the subtleties.
The examples in this section use the following series of names:
```python
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
'Eric Idle', 'Terry Jones', 'Michael Palin'])
```
### Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:
| | | | |
|-------------|------------------|------------------|------------------|
|``len()`` | ``lower()`` | ``translate()`` | ``islower()`` |
|``ljust()`` | ``upper()`` | ``startswith()`` | ``isupper()`` |
|``rjust()`` | ``find()`` | ``endswith()`` | ``isnumeric()`` |
|``center()`` | ``rfind()`` | ``isalnum()`` | ``isdecimal()`` |
|``zfill()`` | ``index()`` | ``isalpha()`` | ``split()`` |
|``strip()`` | ``rindex()`` | ``isdigit()`` | ``rsplit()`` |
|``rstrip()`` | ``capitalize()`` | ``isspace()`` | ``partition()`` |
|``lstrip()`` | ``swapcase()`` | ``istitle()`` | ``rpartition()`` |
Notice that these have various return values. Some, like ``lower()``, return a series of strings:
```python
monte.str.lower()
```
But some others return numbers:
```python
monte.str.len()
```
Or Boolean values:
```python
monte.str.startswith('T')
```
Still others return lists or other compound values for each element:
```python
monte.str.split()
```
We'll see further manipulations of this kind of series-of-lists object as we continue our discussion.
### Methods using regular expressions
In addition, there are several methods that accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python's built-in ``re`` module:
| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()`` | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |
With these, you can do a wide range of interesting operations.
For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element:
```python
monte.str.extract('([A-Za-z]+)', expand=False)
```
Or we can do something more complicated, like finding all names that start and end with a consonant, making use of the start-of-string (``^``) and end-of-string (``$``) regular expression characters:
```python
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')
```
The ability to concisely apply regular expressions across ``Series`` or ``Dataframe`` entries opens up many possibilities for analysis and cleaning of data.
### Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations:
| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()`` | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |
#### Vectorized item access and slicing
The ``get()`` and ``slice()`` operations, in particular, enable vectorized element access from each array.
For example, we can get a slice of the first three characters of each array using ``str.slice(0, 3)``.
Note that this behavior is also available through Python's normal indexing syntaxfor example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:
```python
monte.str[0:3]
```
Indexing via ``df.str.get(i)`` and ``df.str[i]`` is likewise similar.
These ``get()`` and ``slice()`` methods also let you access elements of arrays returned by ``split()``.
For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:
```python
monte.str.split().str.get(-1)
```
#### Indicator variables
Another method that requires a bit of extra explanation is the ``get_dummies()`` method.
This is useful when your data has a column containing some sort of coded indicator.
For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":
```python
full_monte = pd.DataFrame({'name': monte,
'info': ['B|C|D', 'B|D', 'A|C',
'B|D', 'B|C', 'B|C|D']})
full_monte
```
The ``get_dummies()`` routine lets you quickly split-out these indicator variables into a ``DataFrame``:
```python
full_monte['info'].str.get_dummies('|')
```
With these operations as building blocks, you can construct an endless range of string processing procedures when cleaning your data.
We won't dive further into these methods here, but I encourage you to read through ["Working with Text Data"](http://pandas.pydata.org/pandas-docs/stable/text.html) in the Pandas online documentation, or to refer to the resources listed in [Further Resources](03.13-Further-Resources.ipynb).
## Example: Recipe Database
These vectorized string operations become most useful in the process of cleaning up messy, real-world data.
Here I'll walk through an example of that, using an open recipe database compiled from various sources on the Web.
Our goal will be to parse the recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.
The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.
As of Spring 2016, this database is about 30 MB, and can be downloaded and unzipped with these commands:
```python
# !curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
# !gunzip recipeitems-latest.json.gz
```
The database is in JSON format, so we will try ``pd.read_json`` to read it:
```python
try:
recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
print("ValueError:", e)
```
Oops! We get a ``ValueError`` mentioning that there is "trailing data."
Searching for the text of this error on the Internet, it seems that it's due to using a file in which *each line* is itself a valid JSON, but the full file is not.
Let's check if this interpretation is true:
```python
with open('recipeitems-latest.json') as f:
line = f.readline()
pd.read_json(line).shape
```
Yes, apparently each line is a valid JSON, so we'll need to string them together.
One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with ``pd.read_json``:
```python
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
# Extract each line
data = (line.strip() for line in f)
# Reformat so each line is the element of a list
data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)
```
```python
recipes.shape
```
We see there are nearly 200,000 recipes, and 17 columns.
Let's take a look at one row to see what we have:
```python
recipes.iloc[0]
```
There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.
In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in.
Let's start by taking a closer look at the ingredients:
```python
recipes.ingredients.str.len().describe()
```
The ingredient lists average 250 characters long, with a minimum of 0 and a maximum of nearly 10,000 characters!
Just out of curiousity, let's see which recipe has the longest ingredient list:
```python
recipes.name[np.argmax(recipes.ingredients.str.len())]
```
That certainly looks like an involved recipe.
We can do other aggregate explorations; for example, let's see how many of the recipes are for breakfast food:
```python
recipes.description.str.contains('[Bb]reakfast').sum()
```
Or how many of the recipes list cinnamon as an ingredient:
```python
recipes.ingredients.str.contains('[Cc]innamon').sum()
```
We could even look to see whether any recipes misspell the ingredient as "cinamon":
```python
recipes.ingredients.str.contains('[Cc]inamon').sum()
```
This is the type of essential data exploration that is possible with Pandas string tools.
It is data munging like this that Python really excels at.
### A simple recipe recommender
Let's go a bit further, and start working on a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients.
While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.
So we will cheat a bit: we'll start with a list of common ingredients, and simply search to see whether they are in each recipe's ingredient list.
For simplicity, let's just stick with herbs and spices for the time being:
```python
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']
```
We can then build a Boolean ``DataFrame`` consisting of True and False values, indicating whether this ingredient appears in the list:
```python
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
for spice in spice_list))
spice_df.head()
```
Now, as an example, let's say we'd like to find a recipe that uses parsley, paprika, and tarragon.
We can compute this very quickly using the ``query()`` method of ``DataFrame``s, discussed in [High-Performance Pandas: ``eval()`` and ``query()``](03.12-Performance-Eval-and-Query.ipynb):
```python
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)
```
We find only 10 recipes with this combination; let's use the index returned by this selection to discover the names of the recipes that have this combination:
```python
recipes.name[selection.index]
```
Now that we have narrowed down our recipe selection by a factor of almost 20,000, we are in a position to make a more informed decision about what we'd like to cook for dinner.
### Going further with recipes
Hopefully this example has given you a bit of a flavor (ba-dum!) for the types of data cleaning operations that are efficiently enabled by Pandas string methods.
Of course, building a very robust recipe recommendation system would require a *lot* more work!
Extracting full ingredient lists from each recipe would be an important piece of the task; unfortunately, the wide variety of formats used makes this a relatively time-consuming process.
This points to the truism that in data science, cleaning and munging of real-world data often comprises the majority of the work, and Pandas provides the tools that can help you do this efficiently.
<!--NAVIGATION-->
< [Pivot Tables](03.09-Pivot-Tables.ipynb) | [Contents](Index.ipynb) | [Working with Time Series](03.11-Working-with-Time-Series.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,633 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Vectorized String Operations](03.10-Working-With-Strings.ipynb) | [Contents](Index.ipynb) | [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.11-Working-with-Time-Series.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Working with Time Series
Pandas was developed in the context of financial modeling, so as you might expect, it contains a fairly extensive set of tools for working with dates, times, and time-indexed data.
Date and time data comes in a few flavors, which we will discuss here:
- *Time stamps* reference particular moments in time (e.g., July 4th, 2015 at 7:00am).
- *Time intervals* and *periods* reference a length of time between a particular beginning and end point; for example, the year 2015. Periods usually reference a special case of time intervals in which each interval is of uniform length and does not overlap (e.g., 24 hour-long periods comprising days).
- *Time deltas* or *durations* reference an exact length of time (e.g., a duration of 22.56 seconds).
In this section, we will introduce how to work with each of these types of date/time data in Pandas.
This short section is by no means a complete guide to the time series tools available in Python or Pandas, but instead is intended as a broad overview of how you as a user should approach working with time series.
We will start with a brief discussion of tools for dealing with dates and times in Python, before moving more specifically to a discussion of the tools provided by Pandas.
After listing some resources that go into more depth, we will review some short examples of working with time series data in Pandas.
## Dates and Times in Python
The Python world has a number of available representations of dates, times, deltas, and timespans.
While the time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see their relationship to other packages used in Python.
### Native Python dates and times: ``datetime`` and ``dateutil``
Python's basic objects for working with dates and times reside in the built-in ``datetime`` module.
Along with the third-party ``dateutil`` module, you can use it to quickly perform a host of useful functionalities on dates and times.
For example, you can manually build a date using the ``datetime`` type:
```python
from datetime import datetime
datetime(year=2015, month=7, day=4)
```
Or, using the ``dateutil`` module, you can parse dates from a variety of string formats:
```python
from dateutil import parser
date = parser.parse("4th of July, 2015")
date
```
Once you have a ``datetime`` object, you can do things like printing the day of the week:
```python
date.strftime('%A')
```
In the final line, we've used one of the standard string format codes for printing dates (``"%A"``), which you can read about in the [strftime section](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) of Python's [datetime documentation](https://docs.python.org/3/library/datetime.html).
Documentation of other useful date utilities can be found in [dateutil's online documentation](http://labix.org/python-dateutil).
A related package to be aware of is [``pytz``](http://pytz.sourceforge.net/), which contains tools for working with the most migrane-inducing piece of time series data: time zones.
The power of ``datetime`` and ``dateutil`` lie in their flexibility and easy syntax: you can use these objects and their built-in methods to easily perform nearly any operation you might be interested in.
Where they break down is when you wish to work with large arrays of dates and times:
just as lists of Python numerical variables are suboptimal compared to NumPy-style typed numerical arrays, lists of Python datetime objects are suboptimal compared to typed arrays of encoded dates.
### Typed arrays of times: NumPy's ``datetime64``
The weaknesses of Python's datetime format inspired the NumPy team to add a set of native time series data type to NumPy.
The ``datetime64`` dtype encodes dates as 64-bit integers, and thus allows arrays of dates to be represented very compactly.
The ``datetime64`` requires a very specific input format:
```python
import numpy as np
date = np.array('2015-07-04', dtype=np.datetime64)
date
```
Once we have this date formatted, however, we can quickly do vectorized operations on it:
```python
date + np.arange(12)
```
Because of the uniform type in NumPy ``datetime64`` arrays, this type of operation can be accomplished much more quickly than if we were working directly with Python's ``datetime`` objects, especially as arrays get large
(we introduced this type of vectorization in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb)).
One detail of the ``datetime64`` and ``timedelta64`` objects is that they are built on a *fundamental time unit*.
Because the ``datetime64`` object is limited to 64-bit precision, the range of encodable times is $2^{64}$ times this fundamental unit.
In other words, ``datetime64`` imposes a trade-off between *time resolution* and *maximum time span*.
For example, if you want a time resolution of one nanosecond, you only have enough information to encode a range of $2^{64}$ nanoseconds, or just under 600 years.
NumPy will infer the desired unit from the input; for example, here is a day-based datetime:
```python
np.datetime64('2015-07-04')
```
Here is a minute-based datetime:
```python
np.datetime64('2015-07-04 12:00')
```
Notice that the time zone is automatically set to the local time on the computer executing the code.
You can force any desired fundamental unit using one of many format codes; for example, here we'll force a nanosecond-based time:
```python
np.datetime64('2015-07-04 12:59:59.50', 'ns')
```
The following table, drawn from the [NumPy datetime64 documentation](http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html), lists the available format codes along with the relative and absolute timespans that they can encode:
|Code | Meaning | Time span (relative) | Time span (absolute) |
|--------|-------------|----------------------|------------------------|
| ``Y`` | Year | ± 9.2e18 years | [9.2e18 BC, 9.2e18 AD] |
| ``M`` | Month | ± 7.6e17 years | [7.6e17 BC, 7.6e17 AD] |
| ``W`` | Week | ± 1.7e17 years | [1.7e17 BC, 1.7e17 AD] |
| ``D`` | Day | ± 2.5e16 years | [2.5e16 BC, 2.5e16 AD] |
| ``h`` | Hour | ± 1.0e15 years | [1.0e15 BC, 1.0e15 AD] |
| ``m`` | Minute | ± 1.7e13 years | [1.7e13 BC, 1.7e13 AD] |
| ``s`` | Second | ± 2.9e12 years | [ 2.9e9 BC, 2.9e9 AD] |
| ``ms`` | Millisecond | ± 2.9e9 years | [ 2.9e6 BC, 2.9e6 AD] |
| ``us`` | Microsecond | ± 2.9e6 years | [290301 BC, 294241 AD] |
| ``ns`` | Nanosecond | ± 292 years | [ 1678 AD, 2262 AD] |
| ``ps`` | Picosecond | ± 106 days | [ 1969 AD, 1970 AD] |
| ``fs`` | Femtosecond | ± 2.6 hours | [ 1969 AD, 1970 AD] |
| ``as`` | Attosecond | ± 9.2 seconds | [ 1969 AD, 1970 AD] |
For the types of data we see in the real world, a useful default is ``datetime64[ns]``, as it can encode a useful range of modern dates with a suitably fine precision.
Finally, we will note that while the ``datetime64`` data type addresses some of the deficiencies of the built-in Python ``datetime`` type, it lacks many of the convenient methods and functions provided by ``datetime`` and especially ``dateutil``.
More information can be found in [NumPy's datetime64 documentation](http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html).
### Dates and times in pandas: best of both worlds
Pandas builds upon all the tools just discussed to provide a ``Timestamp`` object, which combines the ease-of-use of ``datetime`` and ``dateutil`` with the efficient storage and vectorized interface of ``numpy.datetime64``.
From a group of these ``Timestamp`` objects, Pandas can construct a ``DatetimeIndex`` that can be used to index data in a ``Series`` or ``DataFrame``; we'll see many examples of this below.
For example, we can use Pandas tools to repeat the demonstration from above.
We can parse a flexibly formatted string date, and use format codes to output the day of the week:
```python
import pandas as pd
date = pd.to_datetime("4th of July, 2015")
date
```
```python
date.strftime('%A')
```
Additionally, we can do NumPy-style vectorized operations directly on this same object:
```python
date + pd.to_timedelta(np.arange(12), 'D')
```
In the next section, we will take a closer look at manipulating time series data with the tools provided by Pandas.
## Pandas Time Series: Indexing by Time
Where the Pandas time series tools really become useful is when you begin to *index data by timestamps*.
For example, we can construct a ``Series`` object that has time indexed data:
```python
index = pd.DatetimeIndex(['2014-07-04', '2014-08-04',
'2015-07-04', '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
data
```
Now that we have this data in a ``Series``, we can make use of any of the ``Series`` indexing patterns we discussed in previous sections, passing values that can be coerced into dates:
```python
data['2014-07-04':'2015-07-04']
```
There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from that year:
```python
data['2015']
```
Later, we will see additional examples of the convenience of dates-as-indices.
But first, a closer look at the available time series data structures.
## Pandas Time Series Data Structures
This section will introduce the fundamental Pandas data structures for working with time series data:
- For *time stamps*, Pandas provides the ``Timestamp`` type. As mentioned before, it is essentially a replacement for Python's native ``datetime``, but is based on the more efficient ``numpy.datetime64`` data type. The associated Index structure is ``DatetimeIndex``.
- For *time Periods*, Pandas provides the ``Period`` type. This encodes a fixed-frequency interval based on ``numpy.datetime64``. The associated index structure is ``PeriodIndex``.
- For *time deltas* or *durations*, Pandas provides the ``Timedelta`` type. ``Timedelta`` is a more efficient replacement for Python's native ``datetime.timedelta`` type, and is based on ``numpy.timedelta64``. The associated index structure is ``TimedeltaIndex``.
The most fundamental of these date/time objects are the ``Timestamp`` and ``DatetimeIndex`` objects.
While these class objects can be invoked directly, it is more common to use the ``pd.to_datetime()`` function, which can parse a wide variety of formats.
Passing a single date to ``pd.to_datetime()`` yields a ``Timestamp``; passing a series of dates by default yields a ``DatetimeIndex``:
```python
dates = pd.to_datetime([datetime(2015, 7, 3), '4th of July, 2015',
'2015-Jul-6', '07-07-2015', '20150708'])
dates
```
Any ``DatetimeIndex`` can be converted to a ``PeriodIndex`` with the ``to_period()`` function with the addition of a frequency code; here we'll use ``'D'`` to indicate daily frequency:
```python
dates.to_period('D')
```
A ``TimedeltaIndex`` is created, for example, when a date is subtracted from another:
```python
dates - dates[0]
```
### Regular sequences: ``pd.date_range()``
To make the creation of regular date sequences more convenient, Pandas offers a few functions for this purpose: ``pd.date_range()`` for timestamps, ``pd.period_range()`` for periods, and ``pd.timedelta_range()`` for time deltas.
We've seen that Python's ``range()`` and NumPy's ``np.arange()`` turn a startpoint, endpoint, and optional stepsize into a sequence.
Similarly, ``pd.date_range()`` accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates.
By default, the frequency is one day:
```python
pd.date_range('2015-07-03', '2015-07-10')
```
Alternatively, the date range can be specified not with a start and endpoint, but with a startpoint and a number of periods:
```python
pd.date_range('2015-07-03', periods=8)
```
The spacing can be modified by altering the ``freq`` argument, which defaults to ``D``.
For example, here we will construct a range of hourly timestamps:
```python
pd.date_range('2015-07-03', periods=8, freq='H')
```
To create regular sequences of ``Period`` or ``Timedelta`` values, the very similar ``pd.period_range()`` and ``pd.timedelta_range()`` functions are useful.
Here are some monthly periods:
```python
pd.period_range('2015-07', periods=8, freq='M')
```
And a sequence of durations increasing by an hour:
```python
pd.timedelta_range(0, periods=10, freq='H')
```
All of these require an understanding of Pandas frequency codes, which we'll summarize in the next section.
## Frequencies and Offsets
Fundamental to these Pandas time series tools is the concept of a frequency or date offset.
Just as we saw the ``D`` (day) and ``H`` (hour) codes above, we can use such codes to specify any desired frequency spacing.
The following table summarizes the main codes available:
| Code | Description | Code | Description |
|--------|---------------------|--------|----------------------|
| ``D`` | Calendar day | ``B`` | Business day |
| ``W`` | Weekly | | |
| ``M`` | Month end | ``BM`` | Business month end |
| ``Q`` | Quarter end | ``BQ`` | Business quarter end |
| ``A`` | Year end | ``BA`` | Business year end |
| ``H`` | Hours | ``BH`` | Business hours |
| ``T`` | Minutes | | |
| ``S`` | Seconds | | |
| ``L`` | Milliseonds | | |
| ``U`` | Microseconds | | |
| ``N`` | nanoseconds | | |
The monthly, quarterly, and annual frequencies are all marked at the end of the specified period.
By adding an ``S`` suffix to any of these, they instead will be marked at the beginning:
| Code | Description || Code | Description |
|---------|------------------------||---------|------------------------|
| ``MS`` | Month start ||``BMS`` | Business month start |
| ``QS`` | Quarter start ||``BQS`` | Business quarter start |
| ``AS`` | Year start ||``BAS`` | Business year start |
Additionally, you can change the month used to mark any quarterly or annual code by adding a three-letter month code as a suffix:
- ``Q-JAN``, ``BQ-FEB``, ``QS-MAR``, ``BQS-APR``, etc.
- ``A-JAN``, ``BA-FEB``, ``AS-MAR``, ``BAS-APR``, etc.
In the same way, the split-point of the weekly frequency can be modified by adding a three-letter weekday code:
- ``W-SUN``, ``W-MON``, ``W-TUE``, ``W-WED``, etc.
On top of this, codes can be combined with numbers to specify other frequencies.
For example, for a frequency of 2 hours 30 minutes, we can combine the hour (``H``) and minute (``T``) codes as follows:
```python
pd.timedelta_range(0, periods=9, freq="2H30T")
```
All of these short codes refer to specific instances of Pandas time series offsets, which can be found in the ``pd.tseries.offsets`` module.
For example, we can create a business day offset directly as follows:
```python
from pandas.tseries.offsets import BDay
pd.date_range('2015-07-01', periods=5, freq=BDay())
```
For more discussion of the use of frequencies and offsets, see the ["DateOffset" section](http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects) of the Pandas documentation.
## Resampling, Shifting, and Windowing
The ability to use dates and times as indices to intuitively organize and access data is an important piece of the Pandas time series tools.
The benefits of indexed data in general (automatic alignment during operations, intuitive data slicing and access, etc.) still apply, and Pandas provides several additional time series-specific operations.
We will take a look at a few of those here, using some stock price data as an example.
Because Pandas was developed largely in a finance context, it includes some very specific tools for financial data.
For example, the accompanying ``pandas-datareader`` package (installable via ``conda install pandas-datareader``), knows how to import financial data from a number of available sources, including Yahoo finance, Google Finance, and others.
Here we will load Google's closing price history:
```python
from pandas_datareader import data
goog = data.DataReader('GOOG', start='2004', end='2016',
data_source='google')
goog.head()
```
For simplicity, we'll use just the closing price:
```python
goog = goog['Close']
```
We can visualize this using the ``plot()`` method, after the normal Matplotlib setup boilerplate (see [Chapter 4](04.00-Introduction-To-Matplotlib.ipynb)):
```python
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
```
```python
goog.plot();
```
### Resampling and converting frequencies
One common need for time series data is resampling at a higher or lower frequency.
This can be done using the ``resample()`` method, or the much simpler ``asfreq()`` method.
The primary difference between the two is that ``resample()`` is fundamentally a *data aggregation*, while ``asfreq()`` is fundamentally a *data selection*.
Taking a look at the Google closing price, let's compare what the two return when we down-sample the data.
Here we will resample the data at the end of business year:
```python
goog.plot(alpha=0.5, style='-')
goog.resample('BA').mean().plot(style=':')
goog.asfreq('BA').plot(style='--');
plt.legend(['input', 'resample', 'asfreq'],
loc='upper left');
```
Notice the difference: at each point, ``resample`` reports the *average of the previous year*, while ``asfreq`` reports the *value at the end of the year*.
For up-sampling, ``resample()`` and ``asfreq()`` are largely equivalent, though resample has many more options available.
In this case, the default for both methods is to leave the up-sampled points empty, that is, filled with NA values.
Just as with the ``pd.fillna()`` function discussed previously, ``asfreq()`` accepts a ``method`` argument to specify how values are imputed.
Here, we will resample the business day data at a daily frequency (i.e., including weekends):
```python
fig, ax = plt.subplots(2, sharex=True)
data = goog.iloc[:10]
data.asfreq('D').plot(ax=ax[0], marker='o')
data.asfreq('D', method='bfill').plot(ax=ax[1], style='-o')
data.asfreq('D', method='ffill').plot(ax=ax[1], style='--o')
ax[1].legend(["back-fill", "forward-fill"]);
```
The top panel is the default: non-business days are left as NA values and do not appear on the plot.
The bottom panel shows the differences between two strategies for filling the gaps: forward-filling and backward-filling.
### Time-shifts
Another common time series-specific operation is shifting of data in time.
Pandas has two closely related methods for computing this: ``shift()`` and ``tshift()``
In short, the difference between them is that ``shift()`` *shifts the data*, while ``tshift()`` *shifts the index*.
In both cases, the shift is specified in multiples of the frequency.
Here we will both ``shift()`` and ``tshift()`` by 900 days;
```python
fig, ax = plt.subplots(3, sharey=True)
# apply a frequency to the data
goog = goog.asfreq('D', method='pad')
goog.plot(ax=ax[0])
goog.shift(900).plot(ax=ax[1])
goog.tshift(900).plot(ax=ax[2])
# legends and annotations
local_max = pd.to_datetime('2007-11-05')
offset = pd.Timedelta(900, 'D')
ax[0].legend(['input'], loc=2)
ax[0].get_xticklabels()[2].set(weight='heavy', color='red')
ax[0].axvline(local_max, alpha=0.3, color='red')
ax[1].legend(['shift(900)'], loc=2)
ax[1].get_xticklabels()[2].set(weight='heavy', color='red')
ax[1].axvline(local_max + offset, alpha=0.3, color='red')
ax[2].legend(['tshift(900)'], loc=2)
ax[2].get_xticklabels()[1].set(weight='heavy', color='red')
ax[2].axvline(local_max + offset, alpha=0.3, color='red');
```
We see here that ``shift(900)`` shifts the *data* by 900 days, pushing some of it off the end of the graph (and leaving NA values at the other end), while ``tshift(900)`` shifts the *index values* by 900 days.
A common context for this type of shift is in computing differences over time. For example, we use shifted values to compute the one-year return on investment for Google stock over the course of the dataset:
```python
ROI = 100 * (goog.tshift(-365) / goog - 1)
ROI.plot()
plt.ylabel('% Return on Investment');
```
This helps us to see the overall trend in Google stock: thus far, the most profitable times to invest in Google have been (unsurprisingly, in retrospect) shortly after its IPO, and in the middle of the 2009 recession.
### Rolling windows
Rolling statistics are a third type of time series-specific operation implemented by Pandas.
These can be accomplished via the ``rolling()`` attribute of ``Series`` and ``DataFrame`` objects, which returns a view similar to what we saw with the ``groupby`` operation (see [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb)).
This rolling view makes available a number of aggregation operations by default.
For example, here is the one-year centered rolling mean and standard deviation of the Google stock prices:
```python
rolling = goog.rolling(365, center=True)
data = pd.DataFrame({'input': goog,
'one-year rolling_mean': rolling.mean(),
'one-year rolling_std': rolling.std()})
ax = data.plot(style=['-', '--', ':'])
ax.lines[0].set_alpha(0.3)
```
As with group-by operations, the ``aggregate()`` and ``apply()`` methods can be used for custom rolling computations.
## Where to Learn More
This section has provided only a brief summary of some of the most essential features of time series tools provided by Pandas; for a more complete discussion, you can refer to the ["Time Series/Date" section](http://pandas.pydata.org/pandas-docs/stable/timeseries.html) of the Pandas online documentation.
Another excellent resource is the textbook [Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do) by Wes McKinney (OReilly, 2012).
Although it is now a few years old, it is an invaluable resource on the use of Pandas.
In particular, this book emphasizes time series tools in the context of business and finance, and focuses much more on particular details of business calendars, time zones, and related topics.
As always, you can also use the IPython help functionality to explore and try further options available to the functions and methods discussed here. I find this often is the best way to learn a new Python tool.
## Example: Visualizing Seattle Bicycle Counts
As a more involved example of working with some time series data, let's take a look at bicycle counts on Seattle's [Fremont Bridge](http://www.openstreetmap.org/#map=17/47.64813/-122.34965).
This data comes from an automated bicycle counter, installed in late 2012, which has inductive sensors on the east and west sidewalks of the bridge.
The hourly bicycle counts can be downloaded from http://data.seattle.gov/; here is the [direct link to the dataset](https://data.seattle.gov/Transportation/Fremont-Bridge-Hourly-Bicycle-Counts-by-Month-Octo/65db-xm6k).
As of summer 2016, the CSV can be downloaded as follows:
```python
# !curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD
```
Once this dataset is downloaded, we can use Pandas to read the CSV output into a ``DataFrame``.
We will specify that we want the Date as an index, and we want these dates to be automatically parsed:
```python
data = pd.read_csv('FremontBridge.csv', index_col='Date', parse_dates=True)
data.head()
```
For convenience, we'll further process this dataset by shortening the column names and adding a "Total" column:
```python
data.columns = ['West', 'East']
data['Total'] = data.eval('West + East')
```
Now let's take a look at the summary statistics for this data:
```python
data.dropna().describe()
```
### Visualizing the data
We can gain some insight into the dataset by visualizing it.
Let's start by plotting the raw data:
```python
%matplotlib inline
import seaborn; seaborn.set()
```
```python
data.plot()
plt.ylabel('Hourly Bicycle Count');
```
The ~25,000 hourly samples are far too dense for us to make much sense of.
We can gain more insight by resampling the data to a coarser grid.
Let's resample by week:
```python
weekly = data.resample('W').sum()
weekly.plot(style=[':', '--', '-'])
plt.ylabel('Weekly bicycle count');
```
This shows us some interesting seasonal trends: as you might expect, people bicycle more in the summer than in the winter, and even within a particular season the bicycle use varies from week to week (likely dependent on weather; see [In Depth: Linear Regression](05.06-Linear-Regression.ipynb) where we explore this further).
Another way that comes in handy for aggregating the data is to use a rolling mean, utilizing the ``pd.rolling_mean()`` function.
Here we'll do a 30 day rolling mean of our data, making sure to center the window:
```python
daily = data.resample('D').sum()
daily.rolling(30, center=True).sum().plot(style=[':', '--', '-'])
plt.ylabel('mean hourly count');
```
The jaggedness of the result is due to the hard cutoff of the window.
We can get a smoother version of a rolling mean using a window functionfor example, a Gaussian window.
The following code specifies both the width of the window (we chose 50 days) and the width of the Gaussian within the window (we chose 10 days):
```python
daily.rolling(50, center=True,
win_type='gaussian').sum(std=10).plot(style=[':', '--', '-']);
```
### Digging into the data
While these smoothed data views are useful to get an idea of the general trend in the data, they hide much of the interesting structure.
For example, we might want to look at the average traffic as a function of the time of day.
We can do this using the GroupBy functionality discussed in [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb):
```python
by_time = data.groupby(data.index.time).mean()
hourly_ticks = 4 * 60 * 60 * np.arange(6)
by_time.plot(xticks=hourly_ticks, style=[':', '--', '-']);
```
The hourly traffic is a strongly bimodal distribution, with peaks around 8:00 in the morning and 5:00 in the evening.
This is likely evidence of a strong component of commuter traffic crossing the bridge.
This is further evidenced by the differences between the western sidewalk (generally used going toward downtown Seattle), which peaks more strongly in the morning, and the eastern sidewalk (generally used going away from downtown Seattle), which peaks more strongly in the evening.
We also might be curious about how things change based on the day of the week. Again, we can do this with a simple groupby:
```python
by_weekday = data.groupby(data.index.dayofweek).mean()
by_weekday.index = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
by_weekday.plot(style=[':', '--', '-']);
```
This shows a strong distinction between weekday and weekend totals, with around twice as many average riders crossing the bridge on Monday through Friday than on Saturday and Sunday.
With this in mind, let's do a compound GroupBy and look at the hourly trend on weekdays versus weekends.
We'll start by grouping by both a flag marking the weekend, and the time of day:
```python
weekend = np.where(data.index.weekday < 5, 'Weekday', 'Weekend')
by_time = data.groupby([weekend, data.index.time]).mean()
```
Now we'll use some of the Matplotlib tools described in [Multiple Subplots](04.08-Multiple-Subplots.ipynb) to plot two panels side by side:
```python
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
by_time.ix['Weekday'].plot(ax=ax[0], title='Weekdays',
xticks=hourly_ticks, style=[':', '--', '-'])
by_time.ix['Weekend'].plot(ax=ax[1], title='Weekends',
xticks=hourly_ticks, style=[':', '--', '-']);
```
The result is very interesting: we see a bimodal commute pattern during the work week, and a unimodal recreational pattern during the weekends.
It would be interesting to dig through this data in more detail, and examine the effect of weather, temperature, time of year, and other factors on people's commuting patterns; for further discussion, see my blog post ["Is Seattle Really Seeing an Uptick In Cycling?"](https://jakevdp.github.io/blog/2014/06/10/is-seattle-really-seeing-an-uptick-in-cycling/), which uses a subset of this data.
We will also revisit this dataset in the context of modeling in [In Depth: Linear Regression](05.06-Linear-Regression.ipynb).
<!--NAVIGATION-->
< [Vectorized String Operations](03.10-Working-With-Strings.ipynb) | [Contents](Index.ipynb) | [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.11-Working-with-Time-Series.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,317 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Working with Time Series](03.11-Working-with-Time-Series.ipynb) | [Contents](Index.ipynb) | [Further Resources](03.13-Further-Resources.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.12-Performance-Eval-and-Query.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# High-Performance Pandas: eval() and query()
As we've already seen in previous sections, the power of the PyData stack is built upon the ability of NumPy and Pandas to push basic operations into C via an intuitive syntax: examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas.
While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.
As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays.
These are the ``eval()`` and ``query()`` functions, which rely on the [Numexpr](https://github.com/pydata/numexpr) package.
In this notebook we will walk through their use and give some rules-of-thumb about when you might think about using them.
## Motivating ``query()`` and ``eval()``: Compound Expressions
We've seen previously that NumPy and Pandas support fast vectorized operations; for example, when adding the elements of two arrays:
```python
import numpy as np
rng = np.random.RandomState(42)
x = rng.rand(1000000)
y = rng.rand(1000000)
%timeit x + y
```
As discussed in [Computation on NumPy Arrays: Universal Functions](02.03-Computation-on-arrays-ufuncs.ipynb), this is much faster than doing the addition via a Python loop or comprehension:
```python
%timeit np.fromiter((xi + yi for xi, yi in zip(x, y)), dtype=x.dtype, count=len(x))
```
But this abstraction can become less efficient when computing compound expressions.
For example, consider the following expression:
```python
mask = (x > 0.5) & (y < 0.5)
```
Because NumPy evaluates each subexpression, this is roughly equivalent to the following:
```python
tmp1 = (x > 0.5)
tmp2 = (y < 0.5)
mask = tmp1 & tmp2
```
In other words, *every intermediate step is explicitly allocated in memory*. If the ``x`` and ``y`` arrays are very large, this can lead to significant memory and computational overhead.
The Numexpr library gives you the ability to compute this type of compound expression element by element, without the need to allocate full intermediate arrays.
The [Numexpr documentation](https://github.com/pydata/numexpr) has more details, but for the time being it is sufficient to say that the library accepts a *string* giving the NumPy-style expression you'd like to compute:
```python
import numexpr
mask_numexpr = numexpr.evaluate('(x > 0.5) & (y < 0.5)')
np.allclose(mask, mask_numexpr)
```
The benefit here is that Numexpr evaluates the expression in a way that does not use full-sized temporary arrays, and thus can be much more efficient than NumPy, especially for large arrays.
The Pandas ``eval()`` and ``query()`` tools that we will discuss here are conceptually similar, and depend on the Numexpr package.
## ``pandas.eval()`` for Efficient Operations
The ``eval()`` function in Pandas uses string expressions to efficiently compute operations using ``DataFrame``s.
For example, consider the following ``DataFrame``s:
```python
import pandas as pd
nrows, ncols = 100000, 100
rng = np.random.RandomState(42)
df1, df2, df3, df4 = (pd.DataFrame(rng.rand(nrows, ncols))
for i in range(4))
```
To compute the sum of all four ``DataFrame``s using the typical Pandas approach, we can just write the sum:
```python
%timeit df1 + df2 + df3 + df4
```
The same result can be computed via ``pd.eval`` by constructing the expression as a string:
```python
%timeit pd.eval('df1 + df2 + df3 + df4')
```
The ``eval()`` version of this expression is about 50% faster (and uses much less memory), while giving the same result:
```python
np.allclose(df1 + df2 + df3 + df4,
pd.eval('df1 + df2 + df3 + df4'))
```
### Operations supported by ``pd.eval()``
As of Pandas v0.16, ``pd.eval()`` supports a wide range of operations.
To demonstrate these, we'll use the following integer ``DataFrame``s:
```python
df1, df2, df3, df4, df5 = (pd.DataFrame(rng.randint(0, 1000, (100, 3)))
for i in range(5))
```
#### Arithmetic operators
``pd.eval()`` supports all arithmetic operators. For example:
```python
result1 = -df1 * df2 / (df3 + df4) - df5
result2 = pd.eval('-df1 * df2 / (df3 + df4) - df5')
np.allclose(result1, result2)
```
#### Comparison operators
``pd.eval()`` supports all comparison operators, including chained expressions:
```python
result1 = (df1 < df2) & (df2 <= df3) & (df3 != df4)
result2 = pd.eval('df1 < df2 <= df3 != df4')
np.allclose(result1, result2)
```
#### Bitwise operators
``pd.eval()`` supports the ``&`` and ``|`` bitwise operators:
```python
result1 = (df1 < 0.5) & (df2 < 0.5) | (df3 < df4)
result2 = pd.eval('(df1 < 0.5) & (df2 < 0.5) | (df3 < df4)')
np.allclose(result1, result2)
```
In addition, it supports the use of the literal ``and`` and ``or`` in Boolean expressions:
```python
result3 = pd.eval('(df1 < 0.5) and (df2 < 0.5) or (df3 < df4)')
np.allclose(result1, result3)
```
#### Object attributes and indices
``pd.eval()`` supports access to object attributes via the ``obj.attr`` syntax, and indexes via the ``obj[index]`` syntax:
```python
result1 = df2.T[0] + df3.iloc[1]
result2 = pd.eval('df2.T[0] + df3.iloc[1]')
np.allclose(result1, result2)
```
#### Other operations
Other operations such as function calls, conditional statements, loops, and other more involved constructs are currently *not* implemented in ``pd.eval()``.
If you'd like to execute these more complicated types of expressions, you can use the Numexpr library itself.
## ``DataFrame.eval()`` for Column-Wise Operations
Just as Pandas has a top-level ``pd.eval()`` function, ``DataFrame``s have an ``eval()`` method that works in similar ways.
The benefit of the ``eval()`` method is that columns can be referred to *by name*.
We'll use this labeled array as an example:
```python
df = pd.DataFrame(rng.rand(1000, 3), columns=['A', 'B', 'C'])
df.head()
```
Using ``pd.eval()`` as above, we can compute expressions with the three columns like this:
```python
result1 = (df['A'] + df['B']) / (df['C'] - 1)
result2 = pd.eval("(df.A + df.B) / (df.C - 1)")
np.allclose(result1, result2)
```
The ``DataFrame.eval()`` method allows much more succinct evaluation of expressions with the columns:
```python
result3 = df.eval('(A + B) / (C - 1)')
np.allclose(result1, result3)
```
Notice here that we treat *column names as variables* within the evaluated expression, and the result is what we would wish.
### Assignment in DataFrame.eval()
In addition to the options just discussed, ``DataFrame.eval()`` also allows assignment to any column.
Let's use the ``DataFrame`` from before, which has columns ``'A'``, ``'B'``, and ``'C'``:
```python
df.head()
```
We can use ``df.eval()`` to create a new column ``'D'`` and assign to it a value computed from the other columns:
```python
df.eval('D = (A + B) / C', inplace=True)
df.head()
```
In the same way, any existing column can be modified:
```python
df.eval('D = (A - B) / C', inplace=True)
df.head()
```
### Local variables in DataFrame.eval()
The ``DataFrame.eval()`` method supports an additional syntax that lets it work with local Python variables.
Consider the following:
```python
column_mean = df.mean(1)
result1 = df['A'] + column_mean
result2 = df.eval('A + @column_mean')
np.allclose(result1, result2)
```
The ``@`` character here marks a *variable name* rather than a *column name*, and lets you efficiently evaluate expressions involving the two "namespaces": the namespace of columns, and the namespace of Python objects.
Notice that this ``@`` character is only supported by the ``DataFrame.eval()`` *method*, not by the ``pandas.eval()`` *function*, because the ``pandas.eval()`` function only has access to the one (Python) namespace.
## DataFrame.query() Method
The ``DataFrame`` has another method based on evaluated strings, called the ``query()`` method.
Consider the following:
```python
result1 = df[(df.A < 0.5) & (df.B < 0.5)]
result2 = pd.eval('df[(df.A < 0.5) & (df.B < 0.5)]')
np.allclose(result1, result2)
```
As with the example used in our discussion of ``DataFrame.eval()``, this is an expression involving columns of the ``DataFrame``.
It cannot be expressed using the ``DataFrame.eval()`` syntax, however!
Instead, for this type of filtering operation, you can use the ``query()`` method:
```python
result2 = df.query('A < 0.5 and B < 0.5')
np.allclose(result1, result2)
```
In addition to being a more efficient computation, compared to the masking expression this is much easier to read and understand.
Note that the ``query()`` method also accepts the ``@`` flag to mark local variables:
```python
Cmean = df['C'].mean()
result1 = df[(df.A < Cmean) & (df.B < Cmean)]
result2 = df.query('A < @Cmean and B < @Cmean')
np.allclose(result1, result2)
```
## Performance: When to Use These Functions
When considering whether to use these functions, there are two considerations: *computation time* and *memory use*.
Memory use is the most predictable aspect. As already mentioned, every compound expression involving NumPy arrays or Pandas ``DataFrame``s will result in implicit creation of temporary arrays:
For example, this:
```python
x = df[(df.A < 0.5) & (df.B < 0.5)]
```
Is roughly equivalent to this:
```python
tmp1 = df.A < 0.5
tmp2 = df.B < 0.5
tmp3 = tmp1 & tmp2
x = df[tmp3]
```
If the size of the temporary ``DataFrame``s is significant compared to your available system memory (typically several gigabytes) then it's a good idea to use an ``eval()`` or ``query()`` expression.
You can check the approximate size of your array in bytes using this:
```python
df.values.nbytes
```
On the performance side, ``eval()`` can be faster even when you are not maxing-out your system memory.
The issue is how your temporary ``DataFrame``s compare to the size of the L1 or L2 CPU cache on your system (typically a few megabytes in 2016); if they are much bigger, then ``eval()`` can avoid some potentially slow movement of values between the different memory caches.
In practice, I find that the difference in computation time between the traditional methods and the ``eval``/``query`` method is usually not significantif anything, the traditional method is faster for smaller arrays!
The benefit of ``eval``/``query`` is mainly in the saved memory, and the sometimes cleaner syntax they offer.
We've covered most of the details of ``eval()`` and ``query()`` here; for more information on these, you can refer to the Pandas documentation.
In particular, different parsers and engines can be specified for running these queries; for details on this, see the discussion within the ["Enhancing Performance" section](http://pandas.pydata.org/pandas-docs/dev/enhancingperf.html).
<!--NAVIGATION-->
< [Working with Time Series](03.11-Working-with-Time-Series.ipynb) | [Contents](Index.ipynb) | [Further Resources](03.13-Further-Resources.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.12-Performance-Eval-and-Query.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

View File

@ -0,0 +1,99 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--BOOK_INFORMATION-->\n",
"<img align=\"left\" style=\"padding-right:10px;\" src=\"figures/PDSH-cover-small.png\">\n",
"\n",
"*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*\n",
"\n",
"*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) | [Contents](Index.ipynb) | [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.13-Further-Resources.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Further Resources"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"In this chapter, we've covered many of the basics of using Pandas effectively for data analysis.\n",
"Still, much has been omitted from our discussion.\n",
"To learn more about Pandas, I recommend the following resources:\n",
"\n",
"- [Pandas online documentation](http://pandas.pydata.org/): This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.\n",
"\n",
"- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the Pandas package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).\n",
"\n",
"- [Stack Overflow](http://stackoverflow.com/questions/tagged/pandas): Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you're coming acrossmore than likely you'll find your answer on a Stack Overflow page.\n",
"\n",
"- [Pandas on PyVideo](http://pyvideo.org/search?q=pandas): From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.\n",
"\n",
"Using these resources, combined with the walk-through given in this chapter, my hope is that you'll be poised to use Pandas to tackle any data analysis problem you come across!"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"<!--NAVIGATION-->\n",
"< [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) | [Contents](Index.ipynb) | [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) >\n",
"\n",
"<a href=\"https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.13-Further-Resources.ipynb\"><img align=\"left\" src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\" title=\"Open and Execute in Google Colaboratory\"></a>\n"
]
}
],
"metadata": {
"anaconda-cloud": {},
"jupytext": {
"formats": "ipynb,md"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

View File

@ -0,0 +1,57 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!-- #region deletable=true editable=true -->
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
<!--NAVIGATION-->
< [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) | [Contents](Index.ipynb) | [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.13-Further-Resources.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
<!-- #endregion -->
# Further Resources
<!-- #region deletable=true editable=true -->
In this chapter, we've covered many of the basics of using Pandas effectively for data analysis.
Still, much has been omitted from our discussion.
To learn more about Pandas, I recommend the following resources:
- [Pandas online documentation](http://pandas.pydata.org/): This is the go-to source for complete documentation of the package. While the examples in the documentation tend to be small generated datasets, the description of the options is complete and generally very useful for understanding the use of various functions.
- [*Python for Data Analysis*](http://shop.oreilly.com/product/0636920023784.do) Written by Wes McKinney (the original creator of Pandas), this book contains much more detail on the Pandas package than we had room for in this chapter. In particular, he takes a deep dive into tools for time series, which were his bread and butter as a financial consultant. The book also has many entertaining examples of applying Pandas to gain insight from real-world datasets. Keep in mind, though, that the book is now several years old, and the Pandas package has quite a few new features that this book does not cover (but be on the lookout for a new edition in 2017).
- [Stack Overflow](http://stackoverflow.com/questions/tagged/pandas): Pandas has so many users that any question you have has likely been asked and answered on Stack Overflow. Using Pandas is a case where some Google-Fu is your best friend. Simply go to your favorite search engine and type in the question, problem, or error you're coming acrossmore than likely you'll find your answer on a Stack Overflow page.
- [Pandas on PyVideo](http://pyvideo.org/search?q=pandas): From PyCon to SciPy to PyData, many conferences have featured tutorials from Pandas developers and power users. The PyCon tutorials in particular tend to be given by very well-vetted presenters.
Using these resources, combined with the walk-through given in this chapter, my hope is that you'll be poised to use Pandas to tackle any data analysis problem you come across!
<!-- #endregion -->
<!-- #region deletable=true editable=true -->
<!--NAVIGATION-->
< [High-Performance Pandas: eval() and query()](03.12-Performance-Eval-and-Query.ipynb) | [Contents](Index.ipynb) | [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.13-Further-Resources.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
<!-- #endregion -->

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,261 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Further Resources](03.13-Further-Resources.ipynb) | [Contents](Index.ipynb) | [Simple Line Plots](04.01-Simple-Line-Plots.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.00-Introduction-To-Matplotlib.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Visualization with Matplotlib
We'll now take an in-depth look at the Matplotlib package for visualization in Python.
Matplotlib is a multi-platform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack.
It was conceived by John Hunter in 2002, originally as a patch to IPython for enabling interactive MATLAB-style plotting via gnuplot from the IPython command line.
IPython's creator, Fernando Perez, was at the time scrambling to finish his PhD, and let John know he wouldnt have time to review the patch for several months.
John took this as a cue to set out on his own, and the Matplotlib package was born, with version 0.1 released in 2003.
It received an early boost when it was adopted as the plotting package of choice of the Space Telescope Science Institute (the folks behind the Hubble Telescope), which financially supported Matplotlibs development and greatly expanded its capabilities.
One of Matplotlibs most important features is its ability to play well with many operating systems and graphics backends.
Matplotlib supports dozens of backends and output types, which means you can count on it to work regardless of which operating system you are using or which output format you wish.
This cross-platform, everything-to-everyone approach has been one of the great strengths of Matplotlib.
It has led to a large user base, which in turn has led to an active developer base and Matplotlibs powerful tools and ubiquity within the scientific Python world.
In recent years, however, the interface and style of Matplotlib have begun to show their age.
Newer tools like ggplot and ggvis in the R language, along with web visualization toolkits based on D3js and HTML5 canvas, often make Matplotlib feel clunky and old-fashioned.
Still, I'm of the opinion that we cannot ignore Matplotlib's strength as a well-tested, cross-platform graphics engine.
Recent Matplotlib versions make it relatively easy to set new global plotting styles (see [Customizing Matplotlib: Configurations and Style Sheets](04.11-Settings-and-Stylesheets.ipynb)), and people have been developing new packages that build on its powerful internals to drive Matplotlib via cleaner, more modern APIs—for example, Seaborn (discussed in [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb)), [ggpy](http://yhat.github.io/ggpy/), [HoloViews](http://holoviews.org/), [Altair](http://altair-viz.github.io/), and even Pandas itself can be used as wrappers around Matplotlib's API.
Even with wrappers like these, it is still often useful to dive into Matplotlib's syntax to adjust the final plot output.
For this reason, I believe that Matplotlib itself will remain a vital piece of the data visualization stack, even if new tools mean the community gradually moves away from using the Matplotlib API directly.
## General Matplotlib Tips
Before we dive into the details of creating visualizations with Matplotlib, there are a few useful things you should know about using the package.
### Importing Matplotlib
Just as we use the ``np`` shorthand for NumPy and the ``pd`` shorthand for Pandas, we will use some standard shorthands for Matplotlib imports:
```python
import matplotlib as mpl
import matplotlib.pyplot as plt
```
The ``plt`` interface is what we will use most often, as we shall see throughout this chapter.
### Setting Styles
We will use the ``plt.style`` directive to choose appropriate aesthetic styles for our figures.
Here we will set the ``classic`` style, which ensures that the plots we create use the classic Matplotlib style:
```python
plt.style.use('classic')
```
Throughout this section, we will adjust this style as needed.
Note that the stylesheets used here are supported as of Matplotlib version 1.5; if you are using an earlier version of Matplotlib, only the default style is available.
For more information on stylesheets, see [Customizing Matplotlib: Configurations and Style Sheets](04.11-Settings-and-Stylesheets.ipynb).
### ``show()`` or No ``show()``? How to Display Your Plots
A visualization you can't see won't be of much use, but just how you view your Matplotlib plots depends on the context.
The best use of Matplotlib differs depending on how you are using it; roughly, the three applicable contexts are using Matplotlib in a script, in an IPython terminal, or in an IPython notebook.
<!-- #region -->
#### Plotting from a script
If you are using Matplotlib from within a script, the function ``plt.show()`` is your friend.
``plt.show()`` starts an event loop, looks for all currently active figure objects, and opens one or more interactive windows that display your figure or figures.
So, for example, you may have a file called *myplot.py* containing the following:
```python
# ------- file: myplot.py ------
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
plt.show()
```
You can then run this script from the command-line prompt, which will result in a window opening with your figure displayed:
```
$ python myplot.py
```
The ``plt.show()`` command does a lot under the hood, as it must interact with your system's interactive graphical backend.
The details of this operation can vary greatly from system to system and even installation to installation, but matplotlib does its best to hide all these details from you.
One thing to be aware of: the ``plt.show()`` command should be used *only once* per Python session, and is most often seen at the very end of the script.
Multiple ``show()`` commands can lead to unpredictable backend-dependent behavior, and should mostly be avoided.
<!-- #endregion -->
#### Plotting from an IPython shell
It can be very convenient to use Matplotlib interactively within an IPython shell (see [IPython: Beyond Normal Python](01.00-IPython-Beyond-Normal-Python.ipynb)).
IPython is built to work well with Matplotlib if you specify Matplotlib mode.
To enable this mode, you can use the ``%matplotlib`` magic command after starting ``ipython``:
```ipython
In [1]: %matplotlib
Using matplotlib backend: TkAgg
In [2]: import matplotlib.pyplot as plt
```
At this point, any ``plt`` plot command will cause a figure window to open, and further commands can be run to update the plot.
Some changes (such as modifying properties of lines that are already drawn) will not draw automatically: to force an update, use ``plt.draw()``.
Using ``plt.show()`` in Matplotlib mode is not required.
#### Plotting from an IPython notebook
The IPython notebook is a browser-based interactive data analysis tool that can combine narrative, code, graphics, HTML elements, and much more into a single executable document (see [IPython: Beyond Normal Python](01.00-IPython-Beyond-Normal-Python.ipynb)).
Plotting interactively within an IPython notebook can be done with the ``%matplotlib`` command, and works in a similar way to the IPython shell.
In the IPython notebook, you also have the option of embedding graphics directly in the notebook, with two possible options:
- ``%matplotlib notebook`` will lead to *interactive* plots embedded within the notebook
- ``%matplotlib inline`` will lead to *static* images of your plot embedded in the notebook
For this book, we will generally opt for ``%matplotlib inline``:
```python
%matplotlib inline
```
After running this command (it needs to be done only once per kernel/session), any cell within the notebook that creates a plot will embed a PNG image of the resulting graphic:
```python
import numpy as np
x = np.linspace(0, 10, 100)
fig = plt.figure()
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');
```
### Saving Figures to File
One nice feature of Matplotlib is the ability to save figures in a wide variety of formats.
Saving a figure can be done using the ``savefig()`` command.
For example, to save the previous figure as a PNG file, you can run this:
```python
fig.savefig('my_figure.png')
```
We now have a file called ``my_figure.png`` in the current working directory:
```python
!ls -lh my_figure.png
```
To confirm that it contains what we think it contains, let's use the IPython ``Image`` object to display the contents of this file:
```python
from IPython.display import Image
Image('my_figure.png')
```
In ``savefig()``, the file format is inferred from the extension of the given filename.
Depending on what backends you have installed, many different file formats are available.
The list of supported file types can be found for your system by using the following method of the figure canvas object:
```python
fig.canvas.get_supported_filetypes()
```
Note that when saving your figure, it's not necessary to use ``plt.show()`` or related commands discussed earlier.
## Two Interfaces for the Price of One
A potentially confusing feature of Matplotlib is its dual interfaces: a convenient MATLAB-style state-based interface, and a more powerful object-oriented interface. We'll quickly highlight the differences between the two here.
#### MATLAB-style Interface
Matplotlib was originally written as a Python alternative for MATLAB users, and much of its syntax reflects that fact.
The MATLAB-style tools are contained in the pyplot (``plt``) interface.
For example, the following code will probably look quite familiar to MATLAB users:
```python
plt.figure() # create a plot figure
# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))
# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));
```
It is important to note that this interface is *stateful*: it keeps track of the "current" figure and axes, which are where all ``plt`` commands are applied.
You can get a reference to these using the ``plt.gcf()`` (get current figure) and ``plt.gca()`` (get current axes) routines.
While this stateful interface is fast and convenient for simple plots, it is easy to run into problems.
For example, once the second panel is created, how can we go back and add something to the first?
This is possible within the MATLAB-style interface, but a bit clunky.
Fortunately, there is a better way.
#### Object-oriented interface
The object-oriented interface is available for these more complicated situations, and for when you want more control over your figure.
Rather than depending on some notion of an "active" figure or axes, in the object-oriented interface the plotting functions are *methods* of explicit ``Figure`` and ``Axes`` objects.
To re-create the previous plot using this style of plotting, you might do the following:
```python
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)
# Call plot() method on the appropriate object
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x));
```
For more simple plots, the choice of which style to use is largely a matter of preference, but the object-oriented approach can become a necessity as plots become more complicated.
Throughout this chapter, we will switch between the MATLAB-style and object-oriented interfaces, depending on what is most convenient.
In most cases, the difference is as small as switching ``plt.plot()`` to ``ax.plot()``, but there are a few gotchas that we will highlight as they come up in the following sections.
<!--NAVIGATION-->
< [Further Resources](03.13-Further-Resources.ipynb) | [Contents](Index.ipynb) | [Simple Line Plots](04.01-Simple-Line-Plots.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.00-Introduction-To-Matplotlib.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,239 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) | [Contents](Index.ipynb) | [Simple Scatter Plots](04.02-Simple-Scatter-Plots.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.01-Simple-Line-Plots.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Simple Line Plots
Perhaps the simplest of all plots is the visualization of a single function $y = f(x)$.
Here we will take a first look at creating a simple plot of this type.
As with all the following sections, we'll start by setting up the notebook for plotting and importing the packages we will use:
```python
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
```
For all Matplotlib plots, we start by creating a figure and an axes.
In their simplest form, a figure and axes can be created as follows:
```python
fig = plt.figure()
ax = plt.axes()
```
In Matplotlib, the *figure* (an instance of the class ``plt.Figure``) can be thought of as a single container that contains all the objects representing axes, graphics, text, and labels.
The *axes* (an instance of the class ``plt.Axes``) is what we see above: a bounding box with ticks and labels, which will eventually contain the plot elements that make up our visualization.
Throughout this book, we'll commonly use the variable name ``fig`` to refer to a figure instance, and ``ax`` to refer to an axes instance or group of axes instances.
Once we have created an axes, we can use the ``ax.plot`` function to plot some data. Let's start with a simple sinusoid:
```python
fig = plt.figure()
ax = plt.axes()
x = np.linspace(0, 10, 1000)
ax.plot(x, np.sin(x));
```
Alternatively, we can use the pylab interface and let the figure and axes be created for us in the background
(see [Two Interfaces for the Price of One](04.00-Introduction-To-Matplotlib.ipynb#Two-Interfaces-for-the-Price-of-One) for a discussion of these two interfaces):
```python
plt.plot(x, np.sin(x));
```
If we want to create a single figure with multiple lines, we can simply call the ``plot`` function multiple times:
```python
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x));
```
That's all there is to plotting simple functions in Matplotlib!
We'll now dive into some more details about how to control the appearance of the axes and lines.
## Adjusting the Plot: Line Colors and Styles
The first adjustment you might wish to make to a plot is to control the line colors and styles.
The ``plt.plot()`` function takes additional arguments that can be used to specify these.
To adjust the color, you can use the ``color`` keyword, which accepts a string argument representing virtually any imaginable color.
The color can be specified in a variety of ways:
```python
plt.plot(x, np.sin(x - 0), color='blue') # specify color by name
plt.plot(x, np.sin(x - 1), color='g') # short color code (rgbcmyk)
plt.plot(x, np.sin(x - 2), color='0.75') # Grayscale between 0 and 1
plt.plot(x, np.sin(x - 3), color='#FFDD44') # Hex code (RRGGBB from 00 to FF)
plt.plot(x, np.sin(x - 4), color=(1.0,0.2,0.3)) # RGB tuple, values 0 to 1
plt.plot(x, np.sin(x - 5), color='chartreuse'); # all HTML color names supported
```
If no color is specified, Matplotlib will automatically cycle through a set of default colors for multiple lines.
Similarly, the line style can be adjusted using the ``linestyle`` keyword:
```python
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');
# For short, you can use the following codes:
plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
```
If you would like to be extremely terse, these ``linestyle`` and ``color`` codes can be combined into a single non-keyword argument to the ``plt.plot()`` function:
```python
plt.plot(x, x + 0, '-g') # solid green
plt.plot(x, x + 1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') # dashdot black
plt.plot(x, x + 3, ':r'); # dotted red
```
These single-character color codes reflect the standard abbreviations in the RGB (Red/Green/Blue) and CMYK (Cyan/Magenta/Yellow/blacK) color systems, commonly used for digital color graphics.
There are many other keyword arguments that can be used to fine-tune the appearance of the plot; for more details, I'd suggest viewing the docstring of the ``plt.plot()`` function using IPython's help tools (See [Help and Documentation in IPython](01.01-Help-And-Documentation.ipynb)).
## Adjusting the Plot: Axes Limits
Matplotlib does a decent job of choosing default axes limits for your plot, but sometimes it's nice to have finer control.
The most basic way to adjust axis limits is to use the ``plt.xlim()`` and ``plt.ylim()`` methods:
```python
plt.plot(x, np.sin(x))
plt.xlim(-1, 11)
plt.ylim(-1.5, 1.5);
```
If for some reason you'd like either axis to be displayed in reverse, you can simply reverse the order of the arguments:
```python
plt.plot(x, np.sin(x))
plt.xlim(10, 0)
plt.ylim(1.2, -1.2);
```
A useful related method is ``plt.axis()`` (note here the potential confusion between *axes* with an *e*, and *axis* with an *i*).
The ``plt.axis()`` method allows you to set the ``x`` and ``y`` limits with a single call, by passing a list which specifies ``[xmin, xmax, ymin, ymax]``:
```python
plt.plot(x, np.sin(x))
plt.axis([-1, 11, -1.5, 1.5]);
```
The ``plt.axis()`` method goes even beyond this, allowing you to do things like automatically tighten the bounds around the current plot:
```python
plt.plot(x, np.sin(x))
plt.axis('tight');
```
It allows even higher-level specifications, such as ensuring an equal aspect ratio so that on your screen, one unit in ``x`` is equal to one unit in ``y``:
```python
plt.plot(x, np.sin(x))
plt.axis('equal');
```
For more information on axis limits and the other capabilities of the ``plt.axis`` method, refer to the ``plt.axis`` docstring.
## Labeling Plots
As the last piece of this section, we'll briefly look at the labeling of plots: titles, axis labels, and simple legends.
Titles and axis labels are the simplest such labels—there are methods that can be used to quickly set them:
```python
plt.plot(x, np.sin(x))
plt.title("A Sine Curve")
plt.xlabel("x")
plt.ylabel("sin(x)");
```
The position, size, and style of these labels can be adjusted using optional arguments to the function.
For more information, see the Matplotlib documentation and the docstrings of each of these functions.
When multiple lines are being shown within a single axes, it can be useful to create a plot legend that labels each line type.
Again, Matplotlib has a built-in way of quickly creating such a legend.
It is done via the (you guessed it) ``plt.legend()`` method.
Though there are several valid ways of using this, I find it easiest to specify the label of each line using the ``label`` keyword of the plot function:
```python
plt.plot(x, np.sin(x), '-g', label='sin(x)')
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.axis('equal')
plt.legend();
```
As you can see, the ``plt.legend()`` function keeps track of the line style and color, and matches these with the correct label.
More information on specifying and formatting plot legends can be found in the ``plt.legend`` docstring; additionally, we will cover some more advanced legend options in [Customizing Plot Legends](04.06-Customizing-Legends.ipynb).
## Aside: Matplotlib Gotchas
While most ``plt`` functions translate directly to ``ax`` methods (such as ``plt.plot()`` → ``ax.plot()``, ``plt.legend()`` → ``ax.legend()``, etc.), this is not the case for all commands.
In particular, functions to set limits, labels, and titles are slightly modified.
For transitioning between MATLAB-style functions and object-oriented methods, make the following changes:
- ``plt.xlabel()`` → ``ax.set_xlabel()``
- ``plt.ylabel()`` → ``ax.set_ylabel()``
- ``plt.xlim()`` → ``ax.set_xlim()``
- ``plt.ylim()`` → ``ax.set_ylim()``
- ``plt.title()`` → ``ax.set_title()``
In the object-oriented interface to plotting, rather than calling these functions individually, it is often more convenient to use the ``ax.set()`` method to set all these properties at once:
```python
ax = plt.axes()
ax.plot(x, np.sin(x))
ax.set(xlim=(0, 10), ylim=(-2, 2),
xlabel='x', ylabel='sin(x)',
title='A Simple Plot');
```
<!--NAVIGATION-->
< [Visualization with Matplotlib](04.00-Introduction-To-Matplotlib.ipynb) | [Contents](Index.ipynb) | [Simple Scatter Plots](04.02-Simple-Scatter-Plots.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.01-Simple-Line-Plots.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,147 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Simple Line Plots](04.01-Simple-Line-Plots.ipynb) | [Contents](Index.ipynb) | [Visualizing Errors](04.03-Errorbars.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.02-Simple-Scatter-Plots.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Simple Scatter Plots
Another commonly used plot type is the simple scatter plot, a close cousin of the line plot.
Instead of points being joined by line segments, here the points are represented individually with a dot, circle, or other shape.
Well start by setting up the notebook for plotting and importing the functions we will use:
```python
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
```
## Scatter Plots with ``plt.plot``
In the previous section we looked at ``plt.plot``/``ax.plot`` to produce line plots.
It turns out that this same function can produce scatter plots as well:
```python
x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.plot(x, y, 'o', color='black');
```
The third argument in the function call is a character that represents the type of symbol used for the plotting. Just as you can specify options such as ``'-'``, ``'--'`` to control the line style, the marker style has its own set of short string codes. The full list of available symbols can be seen in the documentation of ``plt.plot``, or in Matplotlib's online documentation. Most of the possibilities are fairly intuitive, and we'll show a number of the more common ones here:
```python
rng = np.random.RandomState(0)
for marker in ['o', '.', ',', 'x', '+', 'v', '^', '<', '>', 's', 'd']:
plt.plot(rng.rand(5), rng.rand(5), marker,
label="marker='{0}'".format(marker))
plt.legend(numpoints=1)
plt.xlim(0, 1.8);
```
For even more possibilities, these character codes can be used together with line and color codes to plot points along with a line connecting them:
```python
plt.plot(x, y, '-ok');
```
Additional keyword arguments to ``plt.plot`` specify a wide range of properties of the lines and markers:
```python
plt.plot(x, y, '-p', color='gray',
markersize=15, linewidth=4,
markerfacecolor='white',
markeredgecolor='gray',
markeredgewidth=2)
plt.ylim(-1.2, 1.2);
```
This type of flexibility in the ``plt.plot`` function allows for a wide variety of possible visualization options.
For a full description of the options available, refer to the ``plt.plot`` documentation.
## Scatter Plots with ``plt.scatter``
A second, more powerful method of creating scatter plots is the ``plt.scatter`` function, which can be used very similarly to the ``plt.plot`` function:
```python
plt.scatter(x, y, marker='o');
```
The primary difference of ``plt.scatter`` from ``plt.plot`` is that it can be used to create scatter plots where the properties of each individual point (size, face color, edge color, etc.) can be individually controlled or mapped to data.
Let's show this by creating a random scatter plot with points of many colors and sizes.
In order to better see the overlapping results, we'll also use the ``alpha`` keyword to adjust the transparency level:
```python
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
cmap='viridis')
plt.colorbar(); # show color scale
```
Notice that the color argument is automatically mapped to a color scale (shown here by the ``colorbar()`` command), and that the size argument is given in pixels.
In this way, the color and size of points can be used to convey information in the visualization, in order to visualize multidimensional data.
For example, we might use the Iris data from Scikit-Learn, where each sample is one of three types of flowers that has had the size of its petals and sepals carefully measured:
```python
from sklearn.datasets import load_iris
iris = load_iris()
features = iris.data.T
plt.scatter(features[0], features[1], alpha=0.2,
s=100*features[3], c=iris.target, cmap='viridis')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1]);
```
We can see that this scatter plot has given us the ability to simultaneously explore four different dimensions of the data:
the (x, y) location of each point corresponds to the sepal length and width, the size of the point is related to the petal width, and the color is related to the particular species of flower.
Multicolor and multifeature scatter plots like this can be useful for both exploration and presentation of data.
## ``plot`` Versus ``scatter``: A Note on Efficiency
Aside from the different features available in ``plt.plot`` and ``plt.scatter``, why might you choose to use one over the other? While it doesn't matter as much for small amounts of data, as datasets get larger than a few thousand points, ``plt.plot`` can be noticeably more efficient than ``plt.scatter``.
The reason is that ``plt.scatter`` has the capability to render a different size and/or color for each point, so the renderer must do the extra work of constructing each point individually.
In ``plt.plot``, on the other hand, the points are always essentially clones of each other, so the work of determining the appearance of the points is done only once for the entire set of data.
For large datasets, the difference between these two can lead to vastly different performance, and for this reason, ``plt.plot`` should be preferred over ``plt.scatter`` for large datasets.
<!--NAVIGATION-->
< [Simple Line Plots](04.01-Simple-Line-Plots.ipynb) | [Contents](Index.ipynb) | [Visualizing Errors](04.03-Errorbars.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.02-Simple-Scatter-Plots.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,132 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Simple Scatter Plots](04.02-Simple-Scatter-Plots.ipynb) | [Contents](Index.ipynb) | [Density and Contour Plots](04.04-Density-and-Contour-Plots.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.03-Errorbars.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Visualizing Errors
For any scientific measurement, accurate accounting for errors is nearly as important, if not more important, than accurate reporting of the number itself.
For example, imagine that I am using some astrophysical observations to estimate the Hubble Constant, the local measurement of the expansion rate of the Universe.
I know that the current literature suggests a value of around 71 (km/s)/Mpc, and I measure a value of 74 (km/s)/Mpc with my method. Are the values consistent? The only correct answer, given this information, is this: there is no way to know.
Suppose I augment this information with reported uncertainties: the current literature suggests a value of around 71 $\pm$ 2.5 (km/s)/Mpc, and my method has measured a value of 74 $\pm$ 5 (km/s)/Mpc. Now are the values consistent? That is a question that can be quantitatively answered.
In visualization of data and results, showing these errors effectively can make a plot convey much more complete information.
## Basic Errorbars
A basic errorbar can be created with a single Matplotlib function call:
```python
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
```
```python
x = np.linspace(0, 10, 50)
dy = 0.8
y = np.sin(x) + dy * np.random.randn(50)
plt.errorbar(x, y, yerr=dy, fmt='.k');
```
Here the ``fmt`` is a format code controlling the appearance of lines and points, and has the same syntax as the shorthand used in ``plt.plot``, outlined in [Simple Line Plots](04.01-Simple-Line-Plots.ipynb) and [Simple Scatter Plots](04.02-Simple-Scatter-Plots.ipynb).
In addition to these basic options, the ``errorbar`` function has many options to fine-tune the outputs.
Using these additional options you can easily customize the aesthetics of your errorbar plot.
I often find it helpful, especially in crowded plots, to make the errorbars lighter than the points themselves:
```python
plt.errorbar(x, y, yerr=dy, fmt='o', color='black',
ecolor='lightgray', elinewidth=3, capsize=0);
```
In addition to these options, you can also specify horizontal errorbars (``xerr``), one-sided errorbars, and many other variants.
For more information on the options available, refer to the docstring of ``plt.errorbar``.
## Continuous Errors
In some situations it is desirable to show errorbars on continuous quantities.
Though Matplotlib does not have a built-in convenience routine for this type of application, it's relatively easy to combine primitives like ``plt.plot`` and ``plt.fill_between`` for a useful result.
Here we'll perform a simple *Gaussian process regression*, using the Scikit-Learn API (see [Introducing Scikit-Learn](05.02-Introducing-Scikit-Learn.ipynb) for details).
This is a method of fitting a very flexible non-parametric function to data with a continuous measure of the uncertainty.
We won't delve into the details of Gaussian process regression at this point, but will focus instead on how you might visualize such a continuous error measurement:
```python
from sklearn.gaussian_process import GaussianProcess
# define the model and draw some data
model = lambda x: x * np.sin(x)
xdata = np.array([1, 3, 5, 6, 8])
ydata = model(xdata)
# Compute the Gaussian process fit
gp = GaussianProcess(corr='cubic', theta0=1e-2, thetaL=1e-4, thetaU=1E-1,
random_start=100)
gp.fit(xdata[:, np.newaxis], ydata)
xfit = np.linspace(0, 10, 1000)
yfit, MSE = gp.predict(xfit[:, np.newaxis], eval_MSE=True)
dyfit = 2 * np.sqrt(MSE) # 2*sigma ~ 95% confidence region
```
We now have ``xfit``, ``yfit``, and ``dyfit``, which sample the continuous fit to our data.
We could pass these to the ``plt.errorbar`` function as above, but we don't really want to plot 1,000 points with 1,000 errorbars.
Instead, we can use the ``plt.fill_between`` function with a light color to visualize this continuous error:
```python
# Visualize the result
plt.plot(xdata, ydata, 'or')
plt.plot(xfit, yfit, '-', color='gray')
plt.fill_between(xfit, yfit - dyfit, yfit + dyfit,
color='gray', alpha=0.2)
plt.xlim(0, 10);
```
Note what we've done here with the ``fill_between`` function: we pass an x value, then the lower y-bound, then the upper y-bound, and the result is that the area between these regions is filled.
The resulting figure gives a very intuitive view into what the Gaussian process regression algorithm is doing: in regions near a measured data point, the model is strongly constrained and this is reflected in the small model errors.
In regions far from a measured data point, the model is not strongly constrained, and the model errors increase.
For more information on the options available in ``plt.fill_between()`` (and the closely related ``plt.fill()`` function), see the function docstring or the Matplotlib documentation.
Finally, if this seems a bit too low level for your taste, refer to [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb), where we discuss the Seaborn package, which has a more streamlined API for visualizing this type of continuous errorbar.
<!--NAVIGATION-->
< [Simple Scatter Plots](04.02-Simple-Scatter-Plots.ipynb) | [Contents](Index.ipynb) | [Density and Contour Plots](04.04-Density-and-Contour-Plots.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.03-Errorbars.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,141 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Visualizing Errors](04.03-Errorbars.ipynb) | [Contents](Index.ipynb) | [Histograms, Binnings, and Density](04.05-Histograms-and-Binnings.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.04-Density-and-Contour-Plots.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Density and Contour Plots
Sometimes it is useful to display three-dimensional data in two dimensions using contours or color-coded regions.
There are three Matplotlib functions that can be helpful for this task: ``plt.contour`` for contour plots, ``plt.contourf`` for filled contour plots, and ``plt.imshow`` for showing images.
This section looks at several examples of using these. We'll start by setting up the notebook for plotting and importing the functions we will use:
```python
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np
```
## Visualizing a Three-Dimensional Function
We'll start by demonstrating a contour plot using a function $z = f(x, y)$, using the following particular choice for $f$ (we've seen this before in [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb), when we used it as a motivating example for array broadcasting):
```python
def f(x, y):
return np.sin(x) ** 10 + np.cos(10 + y * x) * np.cos(x)
```
A contour plot can be created with the ``plt.contour`` function.
It takes three arguments: a grid of *x* values, a grid of *y* values, and a grid of *z* values.
The *x* and *y* values represent positions on the plot, and the *z* values will be represented by the contour levels.
Perhaps the most straightforward way to prepare such data is to use the ``np.meshgrid`` function, which builds two-dimensional grids from one-dimensional arrays:
```python
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
```
Now let's look at this with a standard line-only contour plot:
```python
plt.contour(X, Y, Z, colors='black');
```
Notice that by default when a single color is used, negative values are represented by dashed lines, and positive values by solid lines.
Alternatively, the lines can be color-coded by specifying a colormap with the ``cmap`` argument.
Here, we'll also specify that we want more lines to be drawn—20 equally spaced intervals within the data range:
```python
plt.contour(X, Y, Z, 20, cmap='RdGy');
```
Here we chose the ``RdGy`` (short for *Red-Gray*) colormap, which is a good choice for centered data.
Matplotlib has a wide range of colormaps available, which you can easily browse in IPython by doing a tab completion on the ``plt.cm`` module:
```
plt.cm.<TAB>
```
Our plot is looking nicer, but the spaces between the lines may be a bit distracting.
We can change this by switching to a filled contour plot using the ``plt.contourf()`` function (notice the ``f`` at the end), which uses largely the same syntax as ``plt.contour()``.
Additionally, we'll add a ``plt.colorbar()`` command, which automatically creates an additional axis with labeled color information for the plot:
```python
plt.contourf(X, Y, Z, 20, cmap='RdGy')
plt.colorbar();
```
The colorbar makes it clear that the black regions are "peaks," while the red regions are "valleys."
One potential issue with this plot is that it is a bit "splotchy." That is, the color steps are discrete rather than continuous, which is not always what is desired.
This could be remedied by setting the number of contours to a very high number, but this results in a rather inefficient plot: Matplotlib must render a new polygon for each step in the level.
A better way to handle this is to use the ``plt.imshow()`` function, which interprets a two-dimensional grid of data as an image.
The following code shows this:
```python
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy')
plt.colorbar()
plt.axis(aspect='image');
```
There are a few potential gotchas with ``imshow()``, however:
- ``plt.imshow()`` doesn't accept an *x* and *y* grid, so you must manually specify the *extent* [*xmin*, *xmax*, *ymin*, *ymax*] of the image on the plot.
- ``plt.imshow()`` by default follows the standard image array definition where the origin is in the upper left, not in the lower left as in most contour plots. This must be changed when showing gridded data.
- ``plt.imshow()`` will automatically adjust the axis aspect ratio to match the input data; this can be changed by setting, for example, ``plt.axis(aspect='image')`` to make *x* and *y* units match.
Finally, it can sometimes be useful to combine contour plots and image plots.
For example, here we'll use a partially transparent background image (with transparency set via the ``alpha`` parameter) and overplot contours with labels on the contours themselves (using the ``plt.clabel()`` function):
```python
contours = plt.contour(X, Y, Z, 3, colors='black')
plt.clabel(contours, inline=True, fontsize=8)
plt.imshow(Z, extent=[0, 5, 0, 5], origin='lower',
cmap='RdGy', alpha=0.5)
plt.colorbar();
```
The combination of these three functions—``plt.contour``, ``plt.contourf``, and ``plt.imshow``—gives nearly limitless possibilities for displaying this sort of three-dimensional data within a two-dimensional plot.
For more information on the options available in these functions, refer to their docstrings.
If you are interested in three-dimensional visualizations of this type of data, see [Three-dimensional Plotting in Matplotlib](04.12-Three-Dimensional-Plotting.ipynb).
<!--NAVIGATION-->
< [Visualizing Errors](04.03-Errorbars.ipynb) | [Contents](Index.ipynb) | [Histograms, Binnings, and Density](04.05-Histograms-and-Binnings.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.04-Density-and-Contour-Plots.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,168 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Density and Contour Plots](04.04-Density-and-Contour-Plots.ipynb) | [Contents](Index.ipynb) | [Customizing Plot Legends](04.06-Customizing-Legends.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.05-Histograms-and-Binnings.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Histograms, Binnings, and Density
A simple histogram can be a great first step in understanding a dataset.
Earlier, we saw a preview of Matplotlib's histogram function (see [Comparisons, Masks, and Boolean Logic](02.06-Boolean-Arrays-and-Masks.ipynb)), which creates a basic histogram in one line, once the normal boiler-plate imports are done:
```python
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
data = np.random.randn(1000)
```
```python
plt.hist(data);
```
The ``hist()`` function has many options to tune both the calculation and the display;
here's an example of a more customized histogram:
```python
plt.hist(data, bins=30, normed=True, alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
```
The ``plt.hist`` docstring has more information on other customization options available.
I find this combination of ``histtype='stepfilled'`` along with some transparency ``alpha`` to be very useful when comparing histograms of several distributions:
```python
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(histtype='stepfilled', alpha=0.3, normed=True, bins=40)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
```
If you would like to simply compute the histogram (that is, count the number of points in a given bin) and not display it, the ``np.histogram()`` function is available:
```python
counts, bin_edges = np.histogram(data, bins=5)
print(counts)
```
## Two-Dimensional Histograms and Binnings
Just as we create histograms in one dimension by dividing the number-line into bins, we can also create histograms in two-dimensions by dividing points among two-dimensional bins.
We'll take a brief look at several ways to do this here.
We'll start by defining some data—an ``x`` and ``y`` array drawn from a multivariate Gaussian distribution:
```python
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
```
### ``plt.hist2d``: Two-dimensional histogram
One straightforward way to plot a two-dimensional histogram is to use Matplotlib's ``plt.hist2d`` function:
```python
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
```
Just as with ``plt.hist``, ``plt.hist2d`` has a number of extra options to fine-tune the plot and the binning, which are nicely outlined in the function docstring.
Further, just as ``plt.hist`` has a counterpart in ``np.histogram``, ``plt.hist2d`` has a counterpart in ``np.histogram2d``, which can be used as follows:
```python
counts, xedges, yedges = np.histogram2d(x, y, bins=30)
```
For the generalization of this histogram binning in dimensions higher than two, see the ``np.histogramdd`` function.
### ``plt.hexbin``: Hexagonal binnings
The two-dimensional histogram creates a tesselation of squares across the axes.
Another natural shape for such a tesselation is the regular hexagon.
For this purpose, Matplotlib provides the ``plt.hexbin`` routine, which will represents a two-dimensional dataset binned within a grid of hexagons:
```python
plt.hexbin(x, y, gridsize=30, cmap='Blues')
cb = plt.colorbar(label='count in bin')
```
``plt.hexbin`` has a number of interesting options, including the ability to specify weights for each point, and to change the output in each bin to any NumPy aggregate (mean of weights, standard deviation of weights, etc.).
### Kernel density estimation
Another common method of evaluating densities in multiple dimensions is *kernel density estimation* (KDE).
This will be discussed more fully in [In-Depth: Kernel Density Estimation](05.13-Kernel-Density-Estimation.ipynb), but for now we'll simply mention that KDE can be thought of as a way to "smear out" the points in space and add up the result to obtain a smooth function.
One extremely quick and simple KDE implementation exists in the ``scipy.stats`` package.
Here is a quick example of using the KDE on this data:
```python
from scipy.stats import gaussian_kde
# fit an array of size [Ndim, Nsamples]
data = np.vstack([x, y])
kde = gaussian_kde(data)
# evaluate on a regular grid
xgrid = np.linspace(-3.5, 3.5, 40)
ygrid = np.linspace(-6, 6, 40)
Xgrid, Ygrid = np.meshgrid(xgrid, ygrid)
Z = kde.evaluate(np.vstack([Xgrid.ravel(), Ygrid.ravel()]))
# Plot the result as an image
plt.imshow(Z.reshape(Xgrid.shape),
origin='lower', aspect='auto',
extent=[-3.5, 3.5, -6, 6],
cmap='Blues')
cb = plt.colorbar()
cb.set_label("density")
```
KDE has a smoothing length that effectively slides the knob between detail and smoothness (one example of the ubiquitous biasvariance trade-off).
The literature on choosing an appropriate smoothing length is vast: ``gaussian_kde`` uses a rule-of-thumb to attempt to find a nearly optimal smoothing length for the input data.
Other KDE implementations are available within the SciPy ecosystem, each with its own strengths and weaknesses; see, for example, ``sklearn.neighbors.KernelDensity`` and ``statsmodels.nonparametric.kernel_density.KDEMultivariate``.
For visualizations based on KDE, using Matplotlib tends to be overly verbose.
The Seaborn library, discussed in [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb), provides a much more terse API for creating KDE-based visualizations.
<!--NAVIGATION-->
< [Density and Contour Plots](04.04-Density-and-Contour-Plots.ipynb) | [Contents](Index.ipynb) | [Customizing Plot Legends](04.06-Customizing-Legends.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.05-Histograms-and-Binnings.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,194 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Histograms, Binnings, and Density](04.05-Histograms-and-Binnings.ipynb) | [Contents](Index.ipynb) | [Customizing Colorbars](04.07-Customizing-Colorbars.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.06-Customizing-Legends.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Customizing Plot Legends
Plot legends give meaning to a visualization, assigning meaning to the various plot elements.
We previously saw how to create a simple legend; here we'll take a look at customizing the placement and aesthetics of the legend in Matplotlib.
The simplest legend can be created with the ``plt.legend()`` command, which automatically creates a legend for any labeled plot elements:
```python
import matplotlib.pyplot as plt
plt.style.use('classic')
```
```python
%matplotlib inline
import numpy as np
```
```python
x = np.linspace(0, 10, 1000)
fig, ax = plt.subplots()
ax.plot(x, np.sin(x), '-b', label='Sine')
ax.plot(x, np.cos(x), '--r', label='Cosine')
ax.axis('equal')
leg = ax.legend();
```
But there are many ways we might want to customize such a legend.
For example, we can specify the location and turn off the frame:
```python
ax.legend(loc='upper left', frameon=False)
fig
```
We can use the ``ncol`` command to specify the number of columns in the legend:
```python
ax.legend(frameon=False, loc='lower center', ncol=2)
fig
```
We can use a rounded box (``fancybox``) or add a shadow, change the transparency (alpha value) of the frame, or change the padding around the text:
```python
ax.legend(fancybox=True, framealpha=1, shadow=True, borderpad=1)
fig
```
For more information on available legend options, see the ``plt.legend`` docstring.
## Choosing Elements for the Legend
As we have already seen, the legend includes all labeled elements by default.
If this is not what is desired, we can fine-tune which elements and labels appear in the legend by using the objects returned by plot commands.
The ``plt.plot()`` command is able to create multiple lines at once, and returns a list of created line instances.
Passing any of these to ``plt.legend()`` will tell it which to identify, along with the labels we'd like to specify:
```python
y = np.sin(x[:, np.newaxis] + np.pi * np.arange(0, 2, 0.5))
lines = plt.plot(x, y)
# lines is a list of plt.Line2D instances
plt.legend(lines[:2], ['first', 'second']);
```
I generally find in practice that it is clearer to use the first method, applying labels to the plot elements you'd like to show on the legend:
```python
plt.plot(x, y[:, 0], label='first')
plt.plot(x, y[:, 1], label='second')
plt.plot(x, y[:, 2:])
plt.legend(framealpha=1, frameon=True);
```
Notice that by default, the legend ignores all elements without a ``label`` attribute set.
## Legend for Size of Points
Sometimes the legend defaults are not sufficient for the given visualization.
For example, perhaps you're be using the size of points to mark certain features of the data, and want to create a legend reflecting this.
Here is an example where we'll use the size of points to indicate populations of California cities.
We'd like a legend that specifies the scale of the sizes of the points, and we'll accomplish this by plotting some labeled data with no entries:
```python
import pandas as pd
cities = pd.read_csv('data/california_cities.csv')
# Extract the data we're interested in
lat, lon = cities['latd'], cities['longd']
population, area = cities['population_total'], cities['area_total_km2']
# Scatter the points, using size and color but no label
plt.scatter(lon, lat, label=None,
c=np.log10(population), cmap='viridis',
s=area, linewidth=0, alpha=0.5)
plt.axis(aspect='equal')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.colorbar(label='log$_{10}$(population)')
plt.clim(3, 7)
# Here we create a legend:
# we'll plot empty lists with the desired size and label
for area in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.3, s=area,
label=str(area) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, title='City Area')
plt.title('California Cities: Area and Population');
```
The legend will always reference some object that is on the plot, so if we'd like to display a particular shape we need to plot it.
In this case, the objects we want (gray circles) are not on the plot, so we fake them by plotting empty lists.
Notice too that the legend only lists plot elements that have a label specified.
By plotting empty lists, we create labeled plot objects which are picked up by the legend, and now our legend tells us some useful information.
This strategy can be useful for creating more sophisticated visualizations.
Finally, note that for geographic data like this, it would be clearer if we could show state boundaries or other map-specific elements.
For this, an excellent choice of tool is Matplotlib's Basemap addon toolkit, which we'll explore in [Geographic Data with Basemap](04.13-Geographic-Data-With-Basemap.ipynb).
## Multiple Legends
Sometimes when designing a plot you'd like to add multiple legends to the same axes.
Unfortunately, Matplotlib does not make this easy: via the standard ``legend`` interface, it is only possible to create a single legend for the entire plot.
If you try to create a second legend using ``plt.legend()`` or ``ax.legend()``, it will simply override the first one.
We can work around this by creating a new legend artist from scratch, and then using the lower-level ``ax.add_artist()`` method to manually add the second artist to the plot:
```python
fig, ax = plt.subplots()
lines = []
styles = ['-', '--', '-.', ':']
x = np.linspace(0, 10, 1000)
for i in range(4):
lines += ax.plot(x, np.sin(x - i * np.pi / 2),
styles[i], color='black')
ax.axis('equal')
# specify the lines and labels of the first legend
ax.legend(lines[:2], ['line A', 'line B'],
loc='upper right', frameon=False)
# Create the second legend and add the artist manually.
from matplotlib.legend import Legend
leg = Legend(ax, lines[2:], ['line C', 'line D'],
loc='lower right', frameon=False)
ax.add_artist(leg);
```
This is a peek into the low-level artist objects that comprise any Matplotlib plot.
If you examine the source code of ``ax.legend()`` (recall that you can do this with within the IPython notebook using ``ax.legend??``) you'll see that the function simply consists of some logic to create a suitable ``Legend`` artist, which is then saved in the ``legend_`` attribute and added to the figure when the plot is drawn.
<!--NAVIGATION-->
< [Histograms, Binnings, and Density](04.05-Histograms-and-Binnings.ipynb) | [Contents](Index.ipynb) | [Customizing Colorbars](04.07-Customizing-Colorbars.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.06-Customizing-Legends.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,253 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Customizing Plot Legends](04.06-Customizing-Legends.ipynb) | [Contents](Index.ipynb) | [Multiple Subplots](04.08-Multiple-Subplots.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.07-Customizing-Colorbars.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Customizing Colorbars
Plot legends identify discrete labels of discrete points.
For continuous labels based on the color of points, lines, or regions, a labeled colorbar can be a great tool.
In Matplotlib, a colorbar is a separate axes that can provide a key for the meaning of colors in a plot.
Because the book is printed in black-and-white, this section has an accompanying online supplement where you can view the figures in full color (https://github.com/jakevdp/PythonDataScienceHandbook).
We'll start by setting up the notebook for plotting and importing the functions we will use:
```python
import matplotlib.pyplot as plt
plt.style.use('classic')
```
```python
%matplotlib inline
import numpy as np
```
As we have seen several times throughout this section, the simplest colorbar can be created with the ``plt.colorbar`` function:
```python
x = np.linspace(0, 10, 1000)
I = np.sin(x) * np.cos(x[:, np.newaxis])
plt.imshow(I)
plt.colorbar();
```
We'll now discuss a few ideas for customizing these colorbars and using them effectively in various situations.
## Customizing Colorbars
The colormap can be specified using the ``cmap`` argument to the plotting function that is creating the visualization:
```python
plt.imshow(I, cmap='gray');
```
All the available colormaps are in the ``plt.cm`` namespace; using IPython's tab-completion will give you a full list of built-in possibilities:
```
plt.cm.<TAB>
```
But being *able* to choose a colormap is just the first step: more important is how to *decide* among the possibilities!
The choice turns out to be much more subtle than you might initially expect.
### Choosing the Colormap
A full treatment of color choice within visualization is beyond the scope of this book, but for entertaining reading on this subject and others, see the article ["Ten Simple Rules for Better Figures"](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833).
Matplotlib's online documentation also has an [interesting discussion](http://Matplotlib.org/1.4.1/users/colormaps.html) of colormap choice.
Broadly, you should be aware of three different categories of colormaps:
- *Sequential colormaps*: These are made up of one continuous sequence of colors (e.g., ``binary`` or ``viridis``).
- *Divergent colormaps*: These usually contain two distinct colors, which show positive and negative deviations from a mean (e.g., ``RdBu`` or ``PuOr``).
- *Qualitative colormaps*: these mix colors with no particular sequence (e.g., ``rainbow`` or ``jet``).
The ``jet`` colormap, which was the default in Matplotlib prior to version 2.0, is an example of a qualitative colormap.
Its status as the default was quite unfortunate, because qualitative maps are often a poor choice for representing quantitative data.
Among the problems is the fact that qualitative maps usually do not display any uniform progression in brightness as the scale increases.
We can see this by converting the ``jet`` colorbar into black and white:
```python
from matplotlib.colors import LinearSegmentedColormap
def grayscale_cmap(cmap):
"""Return a grayscale version of the given colormap"""
cmap = plt.cm.get_cmap(cmap)
colors = cmap(np.arange(cmap.N))
# convert RGBA to perceived grayscale luminance
# cf. http://alienryderflex.com/hsp.html
RGB_weight = [0.299, 0.587, 0.114]
luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))
colors[:, :3] = luminance[:, np.newaxis]
return LinearSegmentedColormap.from_list(cmap.name + "_gray", colors, cmap.N)
def view_colormap(cmap):
"""Plot a colormap with its grayscale equivalent"""
cmap = plt.cm.get_cmap(cmap)
colors = cmap(np.arange(cmap.N))
cmap = grayscale_cmap(cmap)
grayscale = cmap(np.arange(cmap.N))
fig, ax = plt.subplots(2, figsize=(6, 2),
subplot_kw=dict(xticks=[], yticks=[]))
ax[0].imshow([colors], extent=[0, 10, 0, 1])
ax[1].imshow([grayscale], extent=[0, 10, 0, 1])
```
```python
view_colormap('jet')
```
Notice the bright stripes in the grayscale image.
Even in full color, this uneven brightness means that the eye will be drawn to certain portions of the color range, which will potentially emphasize unimportant parts of the dataset.
It's better to use a colormap such as ``viridis`` (the default as of Matplotlib 2.0), which is specifically constructed to have an even brightness variation across the range.
Thus it not only plays well with our color perception, but also will translate well to grayscale printing:
```python
view_colormap('viridis')
```
If you favor rainbow schemes, another good option for continuous data is the ``cubehelix`` colormap:
```python
view_colormap('cubehelix')
```
For other situations, such as showing positive and negative deviations from some mean, dual-color colorbars such as ``RdBu`` (*Red-Blue*) can be useful. However, as you can see in the following figure, it's important to note that the positive-negative information will be lost upon translation to grayscale!
```python
view_colormap('RdBu')
```
We'll see examples of using some of these color maps as we continue.
There are a large number of colormaps available in Matplotlib; to see a list of them, you can use IPython to explore the ``plt.cm`` submodule. For a more principled approach to colors in Python, you can refer to the tools and documentation within the Seaborn library (see [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb)).
### Color limits and extensions
Matplotlib allows for a large range of colorbar customization.
The colorbar itself is simply an instance of ``plt.Axes``, so all of the axes and tick formatting tricks we've learned are applicable.
The colorbar has some interesting flexibility: for example, we can narrow the color limits and indicate the out-of-bounds values with a triangular arrow at the top and bottom by setting the ``extend`` property.
This might come in handy, for example, if displaying an image that is subject to noise:
```python
# make noise in 1% of the image pixels
speckles = (np.random.random(I.shape) < 0.01)
I[speckles] = np.random.normal(0, 3, np.count_nonzero(speckles))
plt.figure(figsize=(10, 3.5))
plt.subplot(1, 2, 1)
plt.imshow(I, cmap='RdBu')
plt.colorbar()
plt.subplot(1, 2, 2)
plt.imshow(I, cmap='RdBu')
plt.colorbar(extend='both')
plt.clim(-1, 1);
```
Notice that in the left panel, the default color limits respond to the noisy pixels, and the range of the noise completely washes-out the pattern we are interested in.
In the right panel, we manually set the color limits, and add extensions to indicate values which are above or below those limits.
The result is a much more useful visualization of our data.
### Discrete Color Bars
Colormaps are by default continuous, but sometimes you'd like to represent discrete values.
The easiest way to do this is to use the ``plt.cm.get_cmap()`` function, and pass the name of a suitable colormap along with the number of desired bins:
```python
plt.imshow(I, cmap=plt.cm.get_cmap('Blues', 6))
plt.colorbar()
plt.clim(-1, 1);
```
The discrete version of a colormap can be used just like any other colormap.
## Example: Handwritten Digits
For an example of where this might be useful, let's look at an interesting visualization of some hand written digits data.
This data is included in Scikit-Learn, and consists of nearly 2,000 $8 \times 8$ thumbnails showing various hand-written digits.
For now, let's start by downloading the digits data and visualizing several of the example images with ``plt.imshow()``:
```python
# load images of the digits 0 through 5 and visualize several of them
from sklearn.datasets import load_digits
digits = load_digits(n_class=6)
fig, ax = plt.subplots(8, 8, figsize=(6, 6))
for i, axi in enumerate(ax.flat):
axi.imshow(digits.images[i], cmap='binary')
axi.set(xticks=[], yticks=[])
```
Because each digit is defined by the hue of its 64 pixels, we can consider each digit to be a point lying in 64-dimensional space: each dimension represents the brightness of one pixel.
But visualizing relationships in such high-dimensional spaces can be extremely difficult.
One way to approach this is to use a *dimensionality reduction* technique such as manifold learning to reduce the dimensionality of the data while maintaining the relationships of interest.
Dimensionality reduction is an example of unsupervised machine learning, and we will discuss it in more detail in [What Is Machine Learning?](05.01-What-Is-Machine-Learning.ipynb).
Deferring the discussion of these details, let's take a look at a two-dimensional manifold learning projection of this digits data (see [In-Depth: Manifold Learning](05.10-Manifold-Learning.ipynb) for details):
```python
# project the digits into 2 dimensions using IsoMap
from sklearn.manifold import Isomap
iso = Isomap(n_components=2)
projection = iso.fit_transform(digits.data)
```
We'll use our discrete colormap to view the results, setting the ``ticks`` and ``clim`` to improve the aesthetics of the resulting colorbar:
```python
# plot the results
plt.scatter(projection[:, 0], projection[:, 1], lw=0.1,
c=digits.target, cmap=plt.cm.get_cmap('cubehelix', 6))
plt.colorbar(ticks=range(6), label='digit value')
plt.clim(-0.5, 5.5)
```
The projection also gives us some interesting insights on the relationships within the dataset: for example, the ranges of 5 and 3 nearly overlap in this projection, indicating that some hand written fives and threes are difficult to distinguish, and therefore more likely to be confused by an automated classification algorithm.
Other values, like 0 and 1, are more distantly separated, and therefore much less likely to be confused.
This observation agrees with our intuition, because 5 and 3 look much more similar than do 0 and 1.
We'll return to manifold learning and to digit classification in [Chapter 5](05.00-Machine-Learning.ipynb).
<!--NAVIGATION-->
< [Customizing Plot Legends](04.06-Customizing-Legends.ipynb) | [Contents](Index.ipynb) | [Multiple Subplots](04.08-Multiple-Subplots.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.07-Customizing-Colorbars.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,187 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Customizing Colorbars](04.07-Customizing-Colorbars.ipynb) | [Contents](Index.ipynb) | [Text and Annotation](04.09-Text-and-Annotation.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.08-Multiple-Subplots.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Multiple Subplots
Sometimes it is helpful to compare different views of data side by side.
To this end, Matplotlib has the concept of *subplots*: groups of smaller axes that can exist together within a single figure.
These subplots might be insets, grids of plots, or other more complicated layouts.
In this section we'll explore four routines for creating subplots in Matplotlib.
```python
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-white')
import numpy as np
```
## ``plt.axes``: Subplots by Hand
The most basic method of creating an axes is to use the ``plt.axes`` function.
As we've seen previously, by default this creates a standard axes object that fills the entire figure.
``plt.axes`` also takes an optional argument that is a list of four numbers in the figure coordinate system.
These numbers represent ``[left, bottom, width, height]`` in the figure coordinate system, which ranges from 0 at the bottom left of the figure to 1 at the top right of the figure.
For example, we might create an inset axes at the top-right corner of another axes by setting the *x* and *y* position to 0.65 (that is, starting at 65% of the width and 65% of the height of the figure) and the *x* and *y* extents to 0.2 (that is, the size of the axes is 20% of the width and 20% of the height of the figure):
```python
ax1 = plt.axes() # standard axes
ax2 = plt.axes([0.65, 0.65, 0.2, 0.2])
```
The equivalent of this command within the object-oriented interface is ``fig.add_axes()``. Let's use this to create two vertically stacked axes:
```python
fig = plt.figure()
ax1 = fig.add_axes([0.1, 0.5, 0.8, 0.4],
xticklabels=[], ylim=(-1.2, 1.2))
ax2 = fig.add_axes([0.1, 0.1, 0.8, 0.4],
ylim=(-1.2, 1.2))
x = np.linspace(0, 10)
ax1.plot(np.sin(x))
ax2.plot(np.cos(x));
```
We now have two axes (the top with no tick labels) that are just touching: the bottom of the upper panel (at position 0.5) matches the top of the lower panel (at position 0.1 + 0.4).
## ``plt.subplot``: Simple Grids of Subplots
Aligned columns or rows of subplots are a common-enough need that Matplotlib has several convenience routines that make them easy to create.
The lowest level of these is ``plt.subplot()``, which creates a single subplot within a grid.
As you can see, this command takes three integer arguments—the number of rows, the number of columns, and the index of the plot to be created in this scheme, which runs from the upper left to the bottom right:
```python
for i in range(1, 7):
plt.subplot(2, 3, i)
plt.text(0.5, 0.5, str((2, 3, i)),
fontsize=18, ha='center')
```
The command ``plt.subplots_adjust`` can be used to adjust the spacing between these plots.
The following code uses the equivalent object-oriented command, ``fig.add_subplot()``:
```python
fig = plt.figure()
fig.subplots_adjust(hspace=0.4, wspace=0.4)
for i in range(1, 7):
ax = fig.add_subplot(2, 3, i)
ax.text(0.5, 0.5, str((2, 3, i)),
fontsize=18, ha='center')
```
We've used the ``hspace`` and ``wspace`` arguments of ``plt.subplots_adjust``, which specify the spacing along the height and width of the figure, in units of the subplot size (in this case, the space is 40% of the subplot width and height).
## ``plt.subplots``: The Whole Grid in One Go
The approach just described can become quite tedious when creating a large grid of subplots, especially if you'd like to hide the x- and y-axis labels on the inner plots.
For this purpose, ``plt.subplots()`` is the easier tool to use (note the ``s`` at the end of ``subplots``). Rather than creating a single subplot, this function creates a full grid of subplots in a single line, returning them in a NumPy array.
The arguments are the number of rows and number of columns, along with optional keywords ``sharex`` and ``sharey``, which allow you to specify the relationships between different axes.
Here we'll create a $2 \times 3$ grid of subplots, where all axes in the same row share their y-axis scale, and all axes in the same column share their x-axis scale:
```python
fig, ax = plt.subplots(2, 3, sharex='col', sharey='row')
```
Note that by specifying ``sharex`` and ``sharey``, we've automatically removed inner labels on the grid to make the plot cleaner.
The resulting grid of axes instances is returned within a NumPy array, allowing for convenient specification of the desired axes using standard array indexing notation:
```python
# axes are in a two-dimensional array, indexed by [row, col]
for i in range(2):
for j in range(3):
ax[i, j].text(0.5, 0.5, str((i, j)),
fontsize=18, ha='center')
fig
```
In comparison to ``plt.subplot()``, ``plt.subplots()`` is more consistent with Python's conventional 0-based indexing.
## ``plt.GridSpec``: More Complicated Arrangements
To go beyond a regular grid to subplots that span multiple rows and columns, ``plt.GridSpec()`` is the best tool.
The ``plt.GridSpec()`` object does not create a plot by itself; it is simply a convenient interface that is recognized by the ``plt.subplot()`` command.
For example, a gridspec for a grid of two rows and three columns with some specified width and height space looks like this:
```python
grid = plt.GridSpec(2, 3, wspace=0.4, hspace=0.3)
```
From this we can specify subplot locations and extents using the familiary Python slicing syntax:
```python
plt.subplot(grid[0, 0])
plt.subplot(grid[0, 1:])
plt.subplot(grid[1, :2])
plt.subplot(grid[1, 2]);
```
This type of flexible grid alignment has a wide range of uses.
I most often use it when creating multi-axes histogram plots like the ones shown here:
```python
# Create some normally distributed data
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 3000).T
# Set up the axes with gridspec
fig = plt.figure(figsize=(6, 6))
grid = plt.GridSpec(4, 4, hspace=0.2, wspace=0.2)
main_ax = fig.add_subplot(grid[:-1, 1:])
y_hist = fig.add_subplot(grid[:-1, 0], xticklabels=[], sharey=main_ax)
x_hist = fig.add_subplot(grid[-1, 1:], yticklabels=[], sharex=main_ax)
# scatter points on the main axes
main_ax.plot(x, y, 'ok', markersize=3, alpha=0.2)
# histogram on the attached axes
x_hist.hist(x, 40, histtype='stepfilled',
orientation='vertical', color='gray')
x_hist.invert_yaxis()
y_hist.hist(y, 40, histtype='stepfilled',
orientation='horizontal', color='gray')
y_hist.invert_xaxis()
```
This type of distribution plotted alongside its margins is common enough that it has its own plotting API in the Seaborn package; see [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb) for more details.
<!--NAVIGATION-->
< [Customizing Colorbars](04.07-Customizing-Colorbars.ipynb) | [Contents](Index.ipynb) | [Text and Annotation](04.09-Text-and-Annotation.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.08-Multiple-Subplots.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,249 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Multiple Subplots](04.08-Multiple-Subplots.ipynb) | [Contents](Index.ipynb) | [Customizing Ticks](04.10-Customizing-Ticks.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.09-Text-and-Annotation.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Text and Annotation
Creating a good visualization involves guiding the reader so that the figure tells a story.
In some cases, this story can be told in an entirely visual manner, without the need for added text, but in others, small textual cues and labels are necessary.
Perhaps the most basic types of annotations you will use are axes labels and titles, but the options go beyond this.
Let's take a look at some data and how we might visualize and annotate it to help convey interesting information. We'll start by setting up the notebook for plotting and importing the functions we will use:
```python
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib as mpl
plt.style.use('seaborn-whitegrid')
import numpy as np
import pandas as pd
```
## Example: Effect of Holidays on US Births
Let's return to some data we worked with earler, in ["Example: Birthrate Data"](03.09-Pivot-Tables.ipynb#Example:-Birthrate-Data), where we generated a plot of average births over the course of the calendar year; as already mentioned, that this data can be downloaded at https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv.
We'll start with the same cleaning procedure we used there, and plot the results:
```python
births = pd.read_csv('data/births.csv')
quartiles = np.percentile(births['births'], [25, 50, 75])
mu, sig = quartiles[1], 0.74 * (quartiles[2] - quartiles[0])
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @sig)')
births['day'] = births['day'].astype(int)
births.index = pd.to_datetime(10000 * births.year +
100 * births.month +
births.day, format='%Y%m%d')
births_by_date = births.pivot_table('births',
[births.index.month, births.index.day])
births_by_date.index = [pd.datetime(2012, month, day)
for (month, day) in births_by_date.index]
```
```python
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax);
```
When we're communicating data like this, it is often useful to annotate certain features of the plot to draw the reader's attention.
This can be done manually with the ``plt.text``/``ax.text`` command, which will place text at a particular x/y value:
```python
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax)
# Add labels to the plot
style = dict(size=10, color='gray')
ax.text('2012-1-1', 3950, "New Year's Day", **style)
ax.text('2012-7-4', 4250, "Independence Day", ha='center', **style)
ax.text('2012-9-4', 4850, "Labor Day", ha='center', **style)
ax.text('2012-10-31', 4600, "Halloween", ha='right', **style)
ax.text('2012-11-25', 4450, "Thanksgiving", ha='center', **style)
ax.text('2012-12-25', 3850, "Christmas ", ha='right', **style)
# Label the axes
ax.set(title='USA births by day of year (1969-1988)',
ylabel='average daily births')
# Format the x axis with centered month labels
ax.xaxis.set_major_locator(mpl.dates.MonthLocator())
ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h'));
```
The ``ax.text`` method takes an x position, a y position, a string, and then optional keywords specifying the color, size, style, alignment, and other properties of the text.
Here we used ``ha='right'`` and ``ha='center'``, where ``ha`` is short for *horizonal alignment*.
See the docstring of ``plt.text()`` and of ``mpl.text.Text()`` for more information on available options.
## Transforms and Text Position
In the previous example, we have anchored our text annotations to data locations. Sometimes it's preferable to anchor the text to a position on the axes or figure, independent of the data. In Matplotlib, this is done by modifying the *transform*.
Any graphics display framework needs some scheme for translating between coordinate systems.
For example, a data point at $(x, y) = (1, 1)$ needs to somehow be represented at a certain location on the figure, which in turn needs to be represented in pixels on the screen.
Mathematically, such coordinate transformations are relatively straightforward, and Matplotlib has a well-developed set of tools that it uses internally to perform them (these tools can be explored in the ``matplotlib.transforms`` submodule).
The average user rarely needs to worry about the details of these transforms, but it is helpful knowledge to have when considering the placement of text on a figure. There are three pre-defined transforms that can be useful in this situation:
- ``ax.transData``: Transform associated with data coordinates
- ``ax.transAxes``: Transform associated with the axes (in units of axes dimensions)
- ``fig.transFigure``: Transform associated with the figure (in units of figure dimensions)
Here let's look at an example of drawing text at various locations using these transforms:
```python
fig, ax = plt.subplots(facecolor='lightgray')
ax.axis([0, 10, 0, 10])
# transform=ax.transData is the default, but we'll specify it anyway
ax.text(1, 5, ". Data: (1, 5)", transform=ax.transData)
ax.text(0.5, 0.1, ". Axes: (0.5, 0.1)", transform=ax.transAxes)
ax.text(0.2, 0.2, ". Figure: (0.2, 0.2)", transform=fig.transFigure);
```
Note that by default, the text is aligned above and to the left of the specified coordinates: here the "." at the beginning of each string will approximately mark the given coordinate location.
The ``transData`` coordinates give the usual data coordinates associated with the x- and y-axis labels.
The ``transAxes`` coordinates give the location from the bottom-left corner of the axes (here the white box), as a fraction of the axes size.
The ``transFigure`` coordinates are similar, but specify the position from the bottom-left of the figure (here the gray box), as a fraction of the figure size.
Notice now that if we change the axes limits, it is only the ``transData`` coordinates that will be affected, while the others remain stationary:
```python
ax.set_xlim(0, 2)
ax.set_ylim(-6, 6)
fig
```
This behavior can be seen more clearly by changing the axes limits interactively: if you are executing this code in a notebook, you can make that happen by changing ``%matplotlib inline`` to ``%matplotlib notebook`` and using each plot's menu to interact with the plot.
## Arrows and Annotation
Along with tick marks and text, another useful annotation mark is the simple arrow.
Drawing arrows in Matplotlib is often much harder than you'd bargain for.
While there is a ``plt.arrow()`` function available, I wouldn't suggest using it: the arrows it creates are SVG objects that will be subject to the varying aspect ratio of your plots, and the result is rarely what the user intended.
Instead, I'd suggest using the ``plt.annotate()`` function.
This function creates some text and an arrow, and the arrows can be very flexibly specified.
Here we'll use ``annotate`` with several of its options:
```python
%matplotlib inline
fig, ax = plt.subplots()
x = np.linspace(0, 20, 1000)
ax.plot(x, np.cos(x))
ax.axis('equal')
ax.annotate('local maximum', xy=(6.28, 1), xytext=(10, 4),
arrowprops=dict(facecolor='black', shrink=0.05))
ax.annotate('local minimum', xy=(5 * np.pi, -1), xytext=(2, -6),
arrowprops=dict(arrowstyle="->",
connectionstyle="angle3,angleA=0,angleB=-90"));
```
The arrow style is controlled through the ``arrowprops`` dictionary, which has numerous options available.
These options are fairly well-documented in Matplotlib's online documentation, so rather than repeating them here it is probably more useful to quickly show some of the possibilities.
Let's demonstrate several of the possible options using the birthrate plot from before:
```python
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax)
# Add labels to the plot
ax.annotate("New Year's Day", xy=('2012-1-1', 4100), xycoords='data',
xytext=(50, -30), textcoords='offset points',
arrowprops=dict(arrowstyle="->",
connectionstyle="arc3,rad=-0.2"))
ax.annotate("Independence Day", xy=('2012-7-4', 4250), xycoords='data',
bbox=dict(boxstyle="round", fc="none", ec="gray"),
xytext=(10, -40), textcoords='offset points', ha='center',
arrowprops=dict(arrowstyle="->"))
ax.annotate('Labor Day', xy=('2012-9-4', 4850), xycoords='data', ha='center',
xytext=(0, -20), textcoords='offset points')
ax.annotate('', xy=('2012-9-1', 4850), xytext=('2012-9-7', 4850),
xycoords='data', textcoords='data',
arrowprops={'arrowstyle': '|-|,widthA=0.2,widthB=0.2', })
ax.annotate('Halloween', xy=('2012-10-31', 4600), xycoords='data',
xytext=(-80, -40), textcoords='offset points',
arrowprops=dict(arrowstyle="fancy",
fc="0.6", ec="none",
connectionstyle="angle3,angleA=0,angleB=-90"))
ax.annotate('Thanksgiving', xy=('2012-11-25', 4500), xycoords='data',
xytext=(-120, -60), textcoords='offset points',
bbox=dict(boxstyle="round4,pad=.5", fc="0.9"),
arrowprops=dict(arrowstyle="->",
connectionstyle="angle,angleA=0,angleB=80,rad=20"))
ax.annotate('Christmas', xy=('2012-12-25', 3850), xycoords='data',
xytext=(-30, 0), textcoords='offset points',
size=13, ha='right', va="center",
bbox=dict(boxstyle="round", alpha=0.1),
arrowprops=dict(arrowstyle="wedge,tail_width=0.5", alpha=0.1));
# Label the axes
ax.set(title='USA births by day of year (1969-1988)',
ylabel='average daily births')
# Format the x axis with centered month labels
ax.xaxis.set_major_locator(mpl.dates.MonthLocator())
ax.xaxis.set_minor_locator(mpl.dates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(plt.NullFormatter())
ax.xaxis.set_minor_formatter(mpl.dates.DateFormatter('%h'));
ax.set_ylim(3600, 5400);
```
You'll notice that the specifications of the arrows and text boxes are very detailed: this gives you the power to create nearly any arrow style you wish.
Unfortunately, it also means that these sorts of features often must be manually tweaked, a process that can be very time consuming when producing publication-quality graphics!
Finally, I'll note that the preceding mix of styles is by no means best practice for presenting data, but rather included as a demonstration of some of the available options.
More discussion and examples of available arrow and annotation styles can be found in the Matplotlib gallery, in particular the [Annotation Demo](http://matplotlib.org/examples/pylab_examples/annotation_demo2.html).
<!--NAVIGATION-->
< [Multiple Subplots](04.08-Multiple-Subplots.ipynb) | [Contents](Index.ipynb) | [Customizing Ticks](04.10-Customizing-Ticks.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.09-Text-and-Annotation.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,226 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Text and Annotation](04.09-Text-and-Annotation.ipynb) | [Contents](Index.ipynb) | [Customizing Matplotlib: Configurations and Stylesheets](04.11-Settings-and-Stylesheets.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.10-Customizing-Ticks.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Customizing Ticks
Matplotlib's default tick locators and formatters are designed to be generally sufficient in many common situations, but are in no way optimal for every plot. This section will give several examples of adjusting the tick locations and formatting for the particular plot type you're interested in.
Before we go into examples, it will be best for us to understand further the object hierarchy of Matplotlib plots.
Matplotlib aims to have a Python object representing everything that appears on the plot: for example, recall that the ``figure`` is the bounding box within which plot elements appear.
Each Matplotlib object can also act as a container of sub-objects: for example, each ``figure`` can contain one or more ``axes`` objects, each of which in turn contain other objects representing plot contents.
The tick marks are no exception. Each ``axes`` has attributes ``xaxis`` and ``yaxis``, which in turn have attributes that contain all the properties of the lines, ticks, and labels that make up the axes.
## Major and Minor Ticks
Within each axis, there is the concept of a *major* tick mark, and a *minor* tick mark. As the names would imply, major ticks are usually bigger or more pronounced, while minor ticks are usually smaller. By default, Matplotlib rarely makes use of minor ticks, but one place you can see them is within logarithmic plots:
```python
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
import numpy as np
```
```python
ax = plt.axes(xscale='log', yscale='log')
ax.grid();
```
We see here that each major tick shows a large tickmark and a label, while each minor tick shows a smaller tickmark with no label.
These tick properties—locations and labels—that is, can be customized by setting the ``formatter`` and ``locator`` objects of each axis. Let's examine these for the x axis of the just shown plot:
```python
print(ax.xaxis.get_major_locator())
print(ax.xaxis.get_minor_locator())
```
```python
print(ax.xaxis.get_major_formatter())
print(ax.xaxis.get_minor_formatter())
```
We see that both major and minor tick labels have their locations specified by a ``LogLocator`` (which makes sense for a logarithmic plot). Minor ticks, though, have their labels formatted by a ``NullFormatter``: this says that no labels will be shown.
We'll now show a few examples of setting these locators and formatters for various plots.
## Hiding Ticks or Labels
Perhaps the most common tick/label formatting operation is the act of hiding ticks or labels.
This can be done using ``plt.NullLocator()`` and ``plt.NullFormatter()``, as shown here:
```python
ax = plt.axes()
ax.plot(np.random.rand(50))
ax.yaxis.set_major_locator(plt.NullLocator())
ax.xaxis.set_major_formatter(plt.NullFormatter())
```
Notice that we've removed the labels (but kept the ticks/gridlines) from the x axis, and removed the ticks (and thus the labels as well) from the y axis.
Having no ticks at all can be useful in many situations—for example, when you want to show a grid of images.
For instance, consider the following figure, which includes images of different faces, an example often used in supervised machine learning problems (see, for example, [In-Depth: Support Vector Machines](05.07-Support-Vector-Machines.ipynb)):
```python
fig, ax = plt.subplots(5, 5, figsize=(5, 5))
fig.subplots_adjust(hspace=0, wspace=0)
# Get some face data from scikit-learn
from sklearn.datasets import fetch_olivetti_faces
faces = fetch_olivetti_faces().images
for i in range(5):
for j in range(5):
ax[i, j].xaxis.set_major_locator(plt.NullLocator())
ax[i, j].yaxis.set_major_locator(plt.NullLocator())
ax[i, j].imshow(faces[10 * i + j], cmap="bone")
```
Notice that each image has its own axes, and we've set the locators to null because the tick values (pixel number in this case) do not convey relevant information for this particular visualization.
## Reducing or Increasing the Number of Ticks
One common problem with the default settings is that smaller subplots can end up with crowded labels.
We can see this in the plot grid shown here:
```python
fig, ax = plt.subplots(4, 4, sharex=True, sharey=True)
```
Particularly for the x ticks, the numbers nearly overlap and make them quite difficult to decipher.
We can fix this with the ``plt.MaxNLocator()``, which allows us to specify the maximum number of ticks that will be displayed.
Given this maximum number, Matplotlib will use internal logic to choose the particular tick locations:
```python
# For every axis, set the x and y major locator
for axi in ax.flat:
axi.xaxis.set_major_locator(plt.MaxNLocator(3))
axi.yaxis.set_major_locator(plt.MaxNLocator(3))
fig
```
This makes things much cleaner. If you want even more control over the locations of regularly-spaced ticks, you might also use ``plt.MultipleLocator``, which we'll discuss in the following section.
## Fancy Tick Formats
Matplotlib's default tick formatting can leave a lot to be desired: it works well as a broad default, but sometimes you'd like do do something more.
Consider this plot of a sine and a cosine:
```python
# Plot a sine and cosine curve
fig, ax = plt.subplots()
x = np.linspace(0, 3 * np.pi, 1000)
ax.plot(x, np.sin(x), lw=3, label='Sine')
ax.plot(x, np.cos(x), lw=3, label='Cosine')
# Set up grid, legend, and limits
ax.grid(True)
ax.legend(frameon=False)
ax.axis('equal')
ax.set_xlim(0, 3 * np.pi);
```
There are a couple changes we might like to make. First, it's more natural for this data to space the ticks and grid lines in multiples of $\pi$. We can do this by setting a ``MultipleLocator``, which locates ticks at a multiple of the number you provide. For good measure, we'll add both major and minor ticks in multiples of $\pi/4$:
```python
ax.xaxis.set_major_locator(plt.MultipleLocator(np.pi / 2))
ax.xaxis.set_minor_locator(plt.MultipleLocator(np.pi / 4))
fig
```
But now these tick labels look a little bit silly: we can see that they are multiples of $\pi$, but the decimal representation does not immediately convey this.
To fix this, we can change the tick formatter. There's no built-in formatter for what we want to do, so we'll instead use ``plt.FuncFormatter``, which accepts a user-defined function giving fine-grained control over the tick outputs:
```python
def format_func(value, tick_number):
# find number of multiples of pi/2
N = int(np.round(2 * value / np.pi))
if N == 0:
return "0"
elif N == 1:
return r"$\pi/2$"
elif N == 2:
return r"$\pi$"
elif N % 2 > 0:
return r"${0}\pi/2$".format(N)
else:
return r"${0}\pi$".format(N // 2)
ax.xaxis.set_major_formatter(plt.FuncFormatter(format_func))
fig
```
This is much better! Notice that we've made use of Matplotlib's LaTeX support, specified by enclosing the string within dollar signs. This is very convenient for display of mathematical symbols and formulae: in this case, ``"$\pi$"`` is rendered as the Greek character $\pi$.
The ``plt.FuncFormatter()`` offers extremely fine-grained control over the appearance of your plot ticks, and comes in very handy when preparing plots for presentation or publication.
## Summary of Formatters and Locators
We've mentioned a couple of the available formatters and locators.
We'll conclude this section by briefly listing all the built-in locator and formatter options. For more information on any of these, refer to the docstrings or to the Matplotlib online documentaion.
Each of the following is available in the ``plt`` namespace:
Locator class | Description
---------------------|-------------
``NullLocator`` | No ticks
``FixedLocator`` | Tick locations are fixed
``IndexLocator`` | Locator for index plots (e.g., where x = range(len(y)))
``LinearLocator`` | Evenly spaced ticks from min to max
``LogLocator`` | Logarithmically ticks from min to max
``MultipleLocator`` | Ticks and range are a multiple of base
``MaxNLocator`` | Finds up to a max number of ticks at nice locations
``AutoLocator`` | (Default.) MaxNLocator with simple defaults.
``AutoMinorLocator`` | Locator for minor ticks
Formatter Class | Description
----------------------|---------------
``NullFormatter`` | No labels on the ticks
``IndexFormatter`` | Set the strings from a list of labels
``FixedFormatter`` | Set the strings manually for the labels
``FuncFormatter`` | User-defined function sets the labels
``FormatStrFormatter``| Use a format string for each value
``ScalarFormatter`` | (Default.) Formatter for scalar values
``LogFormatter`` | Default formatter for log axes
We'll see further examples of these through the remainder of the book.
<!--NAVIGATION-->
< [Text and Annotation](04.09-Text-and-Annotation.ipynb) | [Contents](Index.ipynb) | [Customizing Matplotlib: Configurations and Stylesheets](04.11-Settings-and-Stylesheets.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.10-Customizing-Ticks.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,269 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Customizing Ticks](04.10-Customizing-Ticks.ipynb) | [Contents](Index.ipynb) | [Three-Dimensional Plotting in Matplotlib](04.12-Three-Dimensional-Plotting.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.11-Settings-and-Stylesheets.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Customizing Matplotlib: Configurations and Stylesheets
Matplotlib's default plot settings are often the subject of complaint among its users.
While much is slated to change in the 2.0 Matplotlib release in late 2016, the ability to customize default settings helps bring the package inline with your own aesthetic preferences.
Here we'll walk through some of Matplotlib's runtime configuration (rc) options, and take a look at the newer *stylesheets* feature, which contains some nice sets of default configurations.
## Plot Customization by Hand
Through this chapter, we've seen how it is possible to tweak individual plot settings to end up with something that looks a little bit nicer than the default.
It's possible to do these customizations for each individual plot.
For example, here is a fairly drab default histogram:
```python
import matplotlib.pyplot as plt
plt.style.use('classic')
import numpy as np
%matplotlib inline
```
```python
x = np.random.randn(1000)
plt.hist(x);
```
We can adjust this by hand to make it a much more visually pleasing plot:
```python
# use a gray background
ax = plt.axes(axisbg='#E6E6E6')
ax.set_axisbelow(True)
# draw solid white grid lines
plt.grid(color='w', linestyle='solid')
# hide axis spines
for spine in ax.spines.values():
spine.set_visible(False)
# hide top and right ticks
ax.xaxis.tick_bottom()
ax.yaxis.tick_left()
# lighten ticks and labels
ax.tick_params(colors='gray', direction='out')
for tick in ax.get_xticklabels():
tick.set_color('gray')
for tick in ax.get_yticklabels():
tick.set_color('gray')
# control face and edge color of histogram
ax.hist(x, edgecolor='#E6E6E6', color='#EE6666');
```
This looks better, and you may recognize the look as inspired by the look of the R language's ggplot visualization package.
But this took a whole lot of effort!
We definitely do not want to have to do all that tweaking each time we create a plot.
Fortunately, there is a way to adjust these defaults once in a way that will work for all plots.
## Changing the Defaults: ``rcParams``
Each time Matplotlib loads, it defines a runtime configuration (rc) containing the default styles for every plot element you create.
This configuration can be adjusted at any time using the ``plt.rc`` convenience routine.
Let's see what it looks like to modify the rc parameters so that our default plot will look similar to what we did before.
We'll start by saving a copy of the current ``rcParams`` dictionary, so we can easily reset these changes in the current session:
```python
IPython_default = plt.rcParams.copy()
```
Now we can use the ``plt.rc`` function to change some of these settings:
```python
from matplotlib import cycler
colors = cycler('color',
['#EE6666', '#3388BB', '#9988DD',
'#EECC55', '#88BB44', '#FFBBBB'])
plt.rc('axes', facecolor='#E6E6E6', edgecolor='none',
axisbelow=True, grid=True, prop_cycle=colors)
plt.rc('grid', color='w', linestyle='solid')
plt.rc('xtick', direction='out', color='gray')
plt.rc('ytick', direction='out', color='gray')
plt.rc('patch', edgecolor='#E6E6E6')
plt.rc('lines', linewidth=2)
```
With these settings defined, we can now create a plot and see our settings in action:
```python
plt.hist(x);
```
Let's see what simple line plots look like with these rc parameters:
```python
for i in range(4):
plt.plot(np.random.rand(10))
```
I find this much more aesthetically pleasing than the default styling.
If you disagree with my aesthetic sense, the good news is that you can adjust the rc parameters to suit your own tastes!
These settings can be saved in a *.matplotlibrc* file, which you can read about in the [Matplotlib documentation](http://Matplotlib.org/users/customizing.html).
That said, I prefer to customize Matplotlib using its stylesheets instead.
## Stylesheets
The version 1.4 release of Matplotlib in August 2014 added a very convenient ``style`` module, which includes a number of new default stylesheets, as well as the ability to create and package your own styles. These stylesheets are formatted similarly to the *.matplotlibrc* files mentioned earlier, but must be named with a *.mplstyle* extension.
Even if you don't create your own style, the stylesheets included by default are extremely useful.
The available styles are listed in ``plt.style.available``—here I'll list only the first five for brevity:
```python
plt.style.available[:5]
```
<!-- #region -->
The basic way to switch to a stylesheet is to call
``` python
plt.style.use('stylename')
```
But keep in mind that this will change the style for the rest of the session!
Alternatively, you can use the style context manager, which sets a style temporarily:
``` python
with plt.style.context('stylename'):
make_a_plot()
```
<!-- #endregion -->
Let's create a function that will make two basic types of plot:
```python
def hist_and_lines():
np.random.seed(0)
fig, ax = plt.subplots(1, 2, figsize=(11, 4))
ax[0].hist(np.random.randn(1000))
for i in range(3):
ax[1].plot(np.random.rand(10))
ax[1].legend(['a', 'b', 'c'], loc='lower left')
```
We'll use this to explore how these plots look using the various built-in styles.
### Default style
The default style is what we've been seeing so far throughout the book; we'll start with that.
First, let's reset our runtime configuration to the notebook default:
```python
# reset rcParams
plt.rcParams.update(IPython_default);
```
Now let's see how it looks:
```python
hist_and_lines()
```
### FiveThiryEight style
The ``fivethirtyeight`` style mimics the graphics found on the popular [FiveThirtyEight website](https://fivethirtyeight.com).
As you can see here, it is typified by bold colors, thick lines, and transparent axes:
```python
with plt.style.context('fivethirtyeight'):
hist_and_lines()
```
### ggplot
The ``ggplot`` package in the R language is a very popular visualization tool.
Matplotlib's ``ggplot`` style mimics the default styles from that package:
```python
with plt.style.context('ggplot'):
hist_and_lines()
```
### *Bayesian Methods for Hackers( style
There is a very nice short online book called [*Probabilistic Programming and Bayesian Methods for Hackers*](http://camdavidsonpilon.github.io/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/); it features figures created with Matplotlib, and uses a nice set of rc parameters to create a consistent and visually-appealing style throughout the book.
This style is reproduced in the ``bmh`` stylesheet:
```python
with plt.style.context('bmh'):
hist_and_lines()
```
### Dark background
For figures used within presentations, it is often useful to have a dark rather than light background.
The ``dark_background`` style provides this:
```python
with plt.style.context('dark_background'):
hist_and_lines()
```
### Grayscale
Sometimes you might find yourself preparing figures for a print publication that does not accept color figures.
For this, the ``grayscale`` style, shown here, can be very useful:
```python
with plt.style.context('grayscale'):
hist_and_lines()
```
### Seaborn style
Matplotlib also has stylesheets inspired by the Seaborn library (discussed more fully in [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb)).
As we will see, these styles are loaded automatically when Seaborn is imported into a notebook.
I've found these settings to be very nice, and tend to use them as defaults in my own data exploration.
```python
import seaborn
hist_and_lines()
```
With all of these built-in options for various plot styles, Matplotlib becomes much more useful for both interactive visualization and creation of figures for publication.
Throughout this book, I will generally use one or more of these style conventions when creating plots.
<!--NAVIGATION-->
< [Customizing Ticks](04.10-Customizing-Ticks.ipynb) | [Contents](Index.ipynb) | [Three-Dimensional Plotting in Matplotlib](04.12-Three-Dimensional-Plotting.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.11-Settings-and-Stylesheets.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,251 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Customizing Matplotlib: Configurations and Stylesheets](04.11-Settings-and-Stylesheets.ipynb) | [Contents](Index.ipynb) | [Geographic Data with Basemap](04.13-Geographic-Data-With-Basemap.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.12-Three-Dimensional-Plotting.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Three-Dimensional Plotting in Matplotlib
Matplotlib was initially designed with only two-dimensional plotting in mind.
Around the time of the 1.0 release, some three-dimensional plotting utilities were built on top of Matplotlib's two-dimensional display, and the result is a convenient (if somewhat limited) set of tools for three-dimensional data visualization.
three-dimensional plots are enabled by importing the ``mplot3d`` toolkit, included with the main Matplotlib installation:
```python
from mpl_toolkits import mplot3d
```
Once this submodule is imported, a three-dimensional axes can be created by passing the keyword ``projection='3d'`` to any of the normal axes creation routines:
```python
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
```
```python
fig = plt.figure()
ax = plt.axes(projection='3d')
```
With this three-dimensional axes enabled, we can now plot a variety of three-dimensional plot types.
Three-dimensional plotting is one of the functionalities that benefits immensely from viewing figures interactively rather than statically in the notebook; recall that to use interactive figures, you can use ``%matplotlib notebook`` rather than ``%matplotlib inline`` when running this code.
## Three-dimensional Points and Lines
The most basic three-dimensional plot is a line or collection of scatter plot created from sets of (x, y, z) triples.
In analogy with the more common two-dimensional plots discussed earlier, these can be created using the ``ax.plot3D`` and ``ax.scatter3D`` functions.
The call signature for these is nearly identical to that of their two-dimensional counterparts, so you can refer to [Simple Line Plots](04.01-Simple-Line-Plots.ipynb) and [Simple Scatter Plots](04.02-Simple-Scatter-Plots.ipynb) for more information on controlling the output.
Here we'll plot a trigonometric spiral, along with some points drawn randomly near the line:
```python
ax = plt.axes(projection='3d')
# Data for a three-dimensional line
zline = np.linspace(0, 15, 1000)
xline = np.sin(zline)
yline = np.cos(zline)
ax.plot3D(xline, yline, zline, 'gray')
# Data for three-dimensional scattered points
zdata = 15 * np.random.random(100)
xdata = np.sin(zdata) + 0.1 * np.random.randn(100)
ydata = np.cos(zdata) + 0.1 * np.random.randn(100)
ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens');
```
Notice that by default, the scatter points have their transparency adjusted to give a sense of depth on the page.
While the three-dimensional effect is sometimes difficult to see within a static image, an interactive view can lead to some nice intuition about the layout of the points.
## Three-dimensional Contour Plots
Analogous to the contour plots we explored in [Density and Contour Plots](04.04-Density-and-Contour-Plots.ipynb), ``mplot3d`` contains tools to create three-dimensional relief plots using the same inputs.
Like two-dimensional ``ax.contour`` plots, ``ax.contour3D`` requires all the input data to be in the form of two-dimensional regular grids, with the Z data evaluated at each point.
Here we'll show a three-dimensional contour diagram of a three-dimensional sinusoidal function:
```python
def f(x, y):
return np.sin(np.sqrt(x ** 2 + y ** 2))
x = np.linspace(-6, 6, 30)
y = np.linspace(-6, 6, 30)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
```
```python
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, Z, 50, cmap='binary')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
```
Sometimes the default viewing angle is not optimal, in which case we can use the ``view_init`` method to set the elevation and azimuthal angles. In the following example, we'll use an elevation of 60 degrees (that is, 60 degrees above the x-y plane) and an azimuth of 35 degrees (that is, rotated 35 degrees counter-clockwise about the z-axis):
```python
ax.view_init(60, 35)
fig
```
Again, note that this type of rotation can be accomplished interactively by clicking and dragging when using one of Matplotlib's interactive backends.
## Wireframes and Surface Plots
Two other types of three-dimensional plots that work on gridded data are wireframes and surface plots.
These take a grid of values and project it onto the specified three-dimensional surface, and can make the resulting three-dimensional forms quite easy to visualize.
Here's an example of using a wireframe:
```python
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z, color='black')
ax.set_title('wireframe');
```
A surface plot is like a wireframe plot, but each face of the wireframe is a filled polygon.
Adding a colormap to the filled polygons can aid perception of the topology of the surface being visualized:
```python
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis', edgecolor='none')
ax.set_title('surface');
```
Note that though the grid of values for a surface plot needs to be two-dimensional, it need not be rectilinear.
Here is an example of creating a partial polar grid, which when used with the ``surface3D`` plot can give us a slice into the function we're visualizing:
```python
r = np.linspace(0, 6, 20)
theta = np.linspace(-0.9 * np.pi, 0.8 * np.pi, 40)
r, theta = np.meshgrid(r, theta)
X = r * np.sin(theta)
Y = r * np.cos(theta)
Z = f(X, Y)
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,
cmap='viridis', edgecolor='none');
```
## Surface Triangulations
For some applications, the evenly sampled grids required by the above routines is overly restrictive and inconvenient.
In these situations, the triangulation-based plots can be very useful.
What if rather than an even draw from a Cartesian or a polar grid, we instead have a set of random draws?
```python
theta = 2 * np.pi * np.random.random(1000)
r = 6 * np.random.random(1000)
x = np.ravel(r * np.sin(theta))
y = np.ravel(r * np.cos(theta))
z = f(x, y)
```
We could create a scatter plot of the points to get an idea of the surface we're sampling from:
```python
ax = plt.axes(projection='3d')
ax.scatter(x, y, z, c=z, cmap='viridis', linewidth=0.5);
```
This leaves a lot to be desired.
The function that will help us in this case is ``ax.plot_trisurf``, which creates a surface by first finding a set of triangles formed between adjacent points (remember that x, y, and z here are one-dimensional arrays):
```python
ax = plt.axes(projection='3d')
ax.plot_trisurf(x, y, z,
cmap='viridis', edgecolor='none');
```
The result is certainly not as clean as when it is plotted with a grid, but the flexibility of such a triangulation allows for some really interesting three-dimensional plots.
For example, it is actually possible to plot a three-dimensional Möbius strip using this, as we'll see next.
### Example: Visualizing a Möbius strip
A Möbius strip is similar to a strip of paper glued into a loop with a half-twist.
Topologically, it's quite interesting because despite appearances it has only a single side!
Here we will visualize such an object using Matplotlib's three-dimensional tools.
The key to creating the Möbius strip is to think about it's parametrization: it's a two-dimensional strip, so we need two intrinsic dimensions. Let's call them $\theta$, which ranges from $0$ to $2\pi$ around the loop, and $w$ which ranges from -1 to 1 across the width of the strip:
```python
theta = np.linspace(0, 2 * np.pi, 30)
w = np.linspace(-0.25, 0.25, 8)
w, theta = np.meshgrid(w, theta)
```
Now from this parametrization, we must determine the *(x, y, z)* positions of the embedded strip.
Thinking about it, we might realize that there are two rotations happening: one is the position of the loop about its center (what we've called $\theta$), while the other is the twisting of the strip about its axis (we'll call this $\phi$). For a Möbius strip, we must have the strip makes half a twist during a full loop, or $\Delta\phi = \Delta\theta/2$.
```python
phi = 0.5 * theta
```
Now we use our recollection of trigonometry to derive the three-dimensional embedding.
We'll define $r$, the distance of each point from the center, and use this to find the embedded $(x, y, z)$ coordinates:
```python
# radius in x-y plane
r = 1 + w * np.cos(phi)
x = np.ravel(r * np.cos(theta))
y = np.ravel(r * np.sin(theta))
z = np.ravel(w * np.sin(phi))
```
Finally, to plot the object, we must make sure the triangulation is correct. The best way to do this is to define the triangulation *within the underlying parametrization*, and then let Matplotlib project this triangulation into the three-dimensional space of the Möbius strip.
This can be accomplished as follows:
```python
# triangulate in the underlying parametrization
from matplotlib.tri import Triangulation
tri = Triangulation(np.ravel(w), np.ravel(theta))
ax = plt.axes(projection='3d')
ax.plot_trisurf(x, y, z, triangles=tri.triangles,
cmap='viridis', linewidths=0.2);
ax.set_xlim(-1, 1); ax.set_ylim(-1, 1); ax.set_zlim(-1, 1);
```
Combining all of these techniques, it is possible to create and display a wide variety of three-dimensional objects and patterns in Matplotlib.
<!--NAVIGATION-->
< [Customizing Matplotlib: Configurations and Stylesheets](04.11-Settings-and-Stylesheets.ipynb) | [Contents](Index.ipynb) | [Geographic Data with Basemap](04.13-Geographic-Data-With-Basemap.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.12-Three-Dimensional-Plotting.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,400 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Three-Dimensional Plotting in Matplotlib](04.12-Three-Dimensional-Plotting.ipynb) | [Contents](Index.ipynb) | [Visualization with Seaborn](04.14-Visualization-With-Seaborn.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.13-Geographic-Data-With-Basemap.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Geographic Data with Basemap
One common type of visualization in data science is that of geographic data.
Matplotlib's main tool for this type of visualization is the Basemap toolkit, which is one of several Matplotlib toolkits which lives under the ``mpl_toolkits`` namespace.
Admittedly, Basemap feels a bit clunky to use, and often even simple visualizations take much longer to render than you might hope.
More modern solutions such as leaflet or the Google Maps API may be a better choice for more intensive map visualizations.
Still, Basemap is a useful tool for Python users to have in their virtual toolbelts.
In this section, we'll show several examples of the type of map visualization that is possible with this toolkit.
Installation of Basemap is straightforward; if you're using conda you can type this and the package will be downloaded:
```
$ conda install basemap
```
We add just a single new import to our standard boilerplate:
```python
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
```
Once you have the Basemap toolkit installed and imported, geographic plots are just a few lines away (the graphics in the following also requires the ``PIL`` package in Python 2, or the ``pillow`` package in Python 3):
```python
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None, lat_0=50, lon_0=-100)
m.bluemarble(scale=0.5);
```
The meaning of the arguments to ``Basemap`` will be discussed momentarily.
The useful thing is that the globe shown here is not a mere image; it is a fully-functioning Matplotlib axes that understands spherical coordinates and which allows us to easily overplot data on the map!
For example, we can use a different map projection, zoom-in to North America and plot the location of Seattle.
We'll use an etopo image (which shows topographical features both on land and under the ocean) as the map background:
```python
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)
# Map (long, lat) to (x, y) for plotting
x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);
```
This gives you a brief glimpse into the sort of geographic visualizations that are possible with just a few lines of Python.
We'll now discuss the features of Basemap in more depth, and provide several examples of visualizing map data.
Using these brief examples as building blocks, you should be able to create nearly any map visualization that you desire.
## Map Projections
The first thing to decide when using maps is what projection to use.
You're probably familiar with the fact that it is impossible to project a spherical map, such as that of the Earth, onto a flat surface without somehow distorting it or breaking its continuity.
These projections have been developed over the course of human history, and there are a lot of choices!
Depending on the intended use of the map projection, there are certain map features (e.g., direction, area, distance, shape, or other considerations) that are useful to maintain.
The Basemap package implements several dozen such projections, all referenced by a short format code.
Here we'll briefly demonstrate some of the more common ones.
We'll start by defining a convenience routine to draw our world map along with the longitude and latitude lines:
```python
from itertools import chain
def draw_map(m, scale=0.2):
# draw a shaded-relief image
m.shadedrelief(scale=scale)
# lats and longs are returned as a dictionary
lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))
# keys contain the plt.Line2D instances
lat_lines = chain(*(tup[1][0] for tup in lats.items()))
lon_lines = chain(*(tup[1][0] for tup in lons.items()))
all_lines = chain(lat_lines, lon_lines)
# cycle through these lines and set the desired style
for line in all_lines:
line.set(linestyle='-', alpha=0.3, color='w')
```
### Cylindrical projections
The simplest of map projections are cylindrical projections, in which lines of constant latitude and longitude are mapped to horizontal and vertical lines, respectively.
This type of mapping represents equatorial regions quite well, but results in extreme distortions near the poles.
The spacing of latitude lines varies between different cylindrical projections, leading to different conservation properties, and different distortion near the poles.
In the following figure we show an example of the *equidistant cylindrical projection*, which chooses a latitude scaling that preserves distances along meridians.
Other cylindrical projections are the Mercator (``projection='merc'``) and the cylindrical equal area (``projection='cea'``) projections.
```python
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='cyl', resolution=None,
llcrnrlat=-90, urcrnrlat=90,
llcrnrlon=-180, urcrnrlon=180, )
draw_map(m)
```
The additional arguments to Basemap for this view specify the latitude (``lat``) and longitude (``lon``) of the lower-left corner (``llcrnr``) and upper-right corner (``urcrnr``) for the desired map, in units of degrees.
### Pseudo-cylindrical projections
Pseudo-cylindrical projections relax the requirement that meridians (lines of constant longitude) remain vertical; this can give better properties near the poles of the projection.
The Mollweide projection (``projection='moll'``) is one common example of this, in which all meridians are elliptical arcs.
It is constructed so as to preserve area across the map: though there are distortions near the poles, the area of small patches reflects the true area.
Other pseudo-cylindrical projections are the sinusoidal (``projection='sinu'``) and Robinson (``projection='robin'``) projections.
```python
fig = plt.figure(figsize=(8, 6), edgecolor='w')
m = Basemap(projection='moll', resolution=None,
lat_0=0, lon_0=0)
draw_map(m)
```
The extra arguments to Basemap here refer to the central latitude (``lat_0``) and longitude (``lon_0``) for the desired map.
### Perspective projections
Perspective projections are constructed using a particular choice of perspective point, similar to if you photographed the Earth from a particular point in space (a point which, for some projections, technically lies within the Earth!).
One common example is the orthographic projection (``projection='ortho'``), which shows one side of the globe as seen from a viewer at a very long distance. As such, it can show only half the globe at a time.
Other perspective-based projections include the gnomonic projection (``projection='gnom'``) and stereographic projection (``projection='stere'``).
These are often the most useful for showing small portions of the map.
Here is an example of the orthographic projection:
```python
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', resolution=None,
lat_0=50, lon_0=0)
draw_map(m);
```
### Conic projections
A Conic projection projects the map onto a single cone, which is then unrolled.
This can lead to very good local properties, but regions far from the focus point of the cone may become very distorted.
One example of this is the Lambert Conformal Conic projection (``projection='lcc'``), which we saw earlier in the map of North America.
It projects the map onto a cone arranged in such a way that two standard parallels (specified in Basemap by ``lat_1`` and ``lat_2``) have well-represented distances, with scale decreasing between them and increasing outside of them.
Other useful conic projections are the equidistant conic projection (``projection='eqdc'``) and the Albers equal-area projection (``projection='aea'``).
Conic projections, like perspective projections, tend to be good choices for representing small to medium patches of the globe.
```python
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution=None,
lon_0=0, lat_0=50, lat_1=45, lat_2=55,
width=1.6E7, height=1.2E7)
draw_map(m)
```
### Other projections
If you're going to do much with map-based visualizations, I encourage you to read up on other available projections, along with their properties, advantages, and disadvantages.
Most likely, they are available in the [Basemap package](http://matplotlib.org/basemap/users/mapsetup.html).
If you dig deep enough into this topic, you'll find an incredible subculture of geo-viz geeks who will be ready to argue fervently in support of their favorite projection for any given application!
## Drawing a Map Background
Earlier we saw the ``bluemarble()`` and ``shadedrelief()`` methods for projecting global images on the map, as well as the ``drawparallels()`` and ``drawmeridians()`` methods for drawing lines of constant latitude and longitude.
The Basemap package contains a range of useful functions for drawing borders of physical features like continents, oceans, lakes, and rivers, as well as political boundaries such as countries and US states and counties.
The following are some of the available drawing functions that you may wish to explore using IPython's help features:
- **Physical boundaries and bodies of water**
- ``drawcoastlines()``: Draw continental coast lines
- ``drawlsmask()``: Draw a mask between the land and sea, for use with projecting images on one or the other
- ``drawmapboundary()``: Draw the map boundary, including the fill color for oceans.
- ``drawrivers()``: Draw rivers on the map
- ``fillcontinents()``: Fill the continents with a given color; optionally fill lakes with another color
- **Political boundaries**
- ``drawcountries()``: Draw country boundaries
- ``drawstates()``: Draw US state boundaries
- ``drawcounties()``: Draw US county boundaries
- **Map features**
- ``drawgreatcircle()``: Draw a great circle between two points
- ``drawparallels()``: Draw lines of constant latitude
- ``drawmeridians()``: Draw lines of constant longitude
- ``drawmapscale()``: Draw a linear scale on the map
- **Whole-globe images**
- ``bluemarble()``: Project NASA's blue marble image onto the map
- ``shadedrelief()``: Project a shaded relief image onto the map
- ``etopo()``: Draw an etopo relief image onto the map
- ``warpimage()``: Project a user-provided image onto the map
For the boundary-based features, you must set the desired resolution when creating a Basemap image.
The ``resolution`` argument of the ``Basemap`` class sets the level of detail in boundaries, either ``'c'`` (crude), ``'l'`` (low), ``'i'`` (intermediate), ``'h'`` (high), ``'f'`` (full), or ``None`` if no boundaries will be used.
This choice is important: setting high-resolution boundaries on a global map, for example, can be *very* slow.
Here's an example of drawing land/sea boundaries, and the effect of the resolution parameter.
We'll create both a low- and high-resolution map of Scotland's beautiful Isle of Skye.
It's located at 57.3°N, 6.2°W, and a map of 90,000 × 120,000 kilometers shows it well:
```python
fig, ax = plt.subplots(1, 2, figsize=(12, 8))
for i, res in enumerate(['l', 'h']):
m = Basemap(projection='gnom', lat_0=57.3, lon_0=-6.2,
width=90000, height=120000, resolution=res, ax=ax[i])
m.fillcontinents(color="#FFDDCC", lake_color='#DDEEFF')
m.drawmapboundary(fill_color="#DDEEFF")
m.drawcoastlines()
ax[i].set_title("resolution='{0}'".format(res));
```
Notice that the low-resolution coastlines are not suitable for this level of zoom, while high-resolution works just fine.
The low level would work just fine for a global view, however, and would be *much* faster than loading the high-resolution border data for the entire globe!
It might require some experimentation to find the correct resolution parameter for a given view: the best route is to start with a fast, low-resolution plot and increase the resolution as needed.
## Plotting Data on Maps
Perhaps the most useful piece of the Basemap toolkit is the ability to over-plot a variety of data onto a map background.
For simple plotting and text, any ``plt`` function works on the map; you can use the ``Basemap`` instance to project latitude and longitude coordinates to ``(x, y)`` coordinates for plotting with ``plt``, as we saw earlier in the Seattle example.
In addition to this, there are many map-specific functions available as methods of the ``Basemap`` instance.
These work very similarly to their standard Matplotlib counterparts, but have an additional Boolean argument ``latlon``, which if set to ``True`` allows you to pass raw latitudes and longitudes to the method, rather than projected ``(x, y)`` coordinates.
Some of these map-specific methods are:
- ``contour()``/``contourf()`` : Draw contour lines or filled contours
- ``imshow()``: Draw an image
- ``pcolor()``/``pcolormesh()`` : Draw a pseudocolor plot for irregular/regular meshes
- ``plot()``: Draw lines and/or markers.
- ``scatter()``: Draw points with markers.
- ``quiver()``: Draw vectors.
- ``barbs()``: Draw wind barbs.
- ``drawgreatcircle()``: Draw a great circle.
We'll see some examples of a few of these as we continue.
For more information on these functions, including several example plots, see the [online Basemap documentation](http://matplotlib.org/basemap/).
## Example: California Cities
Recall that in [Customizing Plot Legends](04.06-Customizing-Legends.ipynb), we demonstrated the use of size and color in a scatter plot to convey information about the location, size, and population of California cities.
Here, we'll create this plot again, but using Basemap to put the data in context.
We start with loading the data, as we did before:
```python
import pandas as pd
cities = pd.read_csv('data/california_cities.csv')
# Extract the data we're interested in
lat = cities['latd'].values
lon = cities['longd'].values
population = cities['population_total'].values
area = cities['area_total_km2'].values
```
Next, we set up the map projection, scatter the data, and then create a colorbar and legend:
```python
# 1. Draw the map background
fig = plt.figure(figsize=(8, 8))
m = Basemap(projection='lcc', resolution='h',
lat_0=37.5, lon_0=-119,
width=1E6, height=1.2E6)
m.shadedrelief()
m.drawcoastlines(color='gray')
m.drawcountries(color='gray')
m.drawstates(color='gray')
# 2. scatter city data, with color reflecting population
# and size reflecting area
m.scatter(lon, lat, latlon=True,
c=np.log10(population), s=area,
cmap='Reds', alpha=0.5)
# 3. create colorbar and legend
plt.colorbar(label=r'$\log_{10}({\rm population})$')
plt.clim(3, 7)
# make legend with dummy points
for a in [100, 300, 500]:
plt.scatter([], [], c='k', alpha=0.5, s=a,
label=str(a) + ' km$^2$')
plt.legend(scatterpoints=1, frameon=False,
labelspacing=1, loc='lower left');
```
This shows us roughly where larger populations of people have settled in California: they are clustered near the coast in the Los Angeles and San Francisco areas, stretched along the highways in the flat central valley, and avoiding almost completely the mountainous regions along the borders of the state.
## Example: Surface Temperature Data
As an example of visualizing some more continuous geographic data, let's consider the "polar vortex" that hit the eastern half of the United States in January of 2014.
A great source for any sort of climatic data is [NASA's Goddard Institute for Space Studies](http://data.giss.nasa.gov/).
Here we'll use the GIS 250 temperature data, which we can download using shell commands (these commands may have to be modified on Windows machines).
The data used here was downloaded on 6/12/2016, and the file size is approximately 9MB:
```python
# !curl -O http://data.giss.nasa.gov/pub/gistemp/gistemp250.nc.gz
# !gunzip gistemp250.nc.gz
```
The data comes in NetCDF format, which can be read in Python by the ``netCDF4`` library.
You can install this library as shown here
```
$ conda install netcdf4
```
We read the data as follows:
```python
from netCDF4 import Dataset
data = Dataset('gistemp250.nc')
```
The file contains many global temperature readings on a variety of dates; we need to select the index of the date we're interested in—in this case, January 15, 2014:
```python
from netCDF4 import date2index
from datetime import datetime
timeindex = date2index(datetime(2014, 1, 15),
data.variables['time'])
```
Now we can load the latitude and longitude data, as well as the temperature anomaly for this index:
```python
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
lon, lat = np.meshgrid(lon, lat)
temp_anomaly = data.variables['tempanomaly'][timeindex]
```
Finally, we'll use the ``pcolormesh()`` method to draw a color mesh of the data.
We'll look at North America, and use a shaded relief map in the background.
Note that for this data we specifically chose a divergent colormap, which has a neutral color at zero and two contrasting colors at negative and positive values.
We'll also lightly draw the coastlines over the colors for reference:
```python
fig = plt.figure(figsize=(10, 8))
m = Basemap(projection='lcc', resolution='c',
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.shadedrelief(scale=0.5)
m.pcolormesh(lon, lat, temp_anomaly,
latlon=True, cmap='RdBu_r')
plt.clim(-8, 8)
m.drawcoastlines(color='lightgray')
plt.title('January 2014 Temperature Anomaly')
plt.colorbar(label='temperature anomaly (°C)');
```
The data paints a picture of the localized, extreme temperature anomalies that happened during that month.
The eastern half of the United States was much colder than normal, while the western half and Alaska were much warmer.
Regions with no recorded temperature show the map background.
<!--NAVIGATION-->
< [Three-Dimensional Plotting in Matplotlib](04.12-Three-Dimensional-Plotting.ipynb) | [Contents](Index.ipynb) | [Visualization with Seaborn](04.14-Visualization-With-Seaborn.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.13-Geographic-Data-With-Basemap.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

File diff suppressed because one or more lines are too long

View File

@ -0,0 +1,398 @@
---
jupyter:
jupytext:
formats: ipynb,md
text_representation:
extension: .md
format_name: markdown
format_version: '1.3'
jupytext_version: 1.13.0
kernelspec:
display_name: Python 3
language: python
name: python3
---
<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="figures/PDSH-cover-small.png">
*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*
<!--NAVIGATION-->
< [Geographic Data with Basemap](04.13-Geographic-Data-With-Basemap.ipynb) | [Contents](Index.ipynb) | [Further Resources](04.15-Further-Resources.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.14-Visualization-With-Seaborn.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
# Visualization with Seaborn
Matplotlib has proven to be an incredibly useful and popular visualization tool, but even avid users will admit it often leaves much to be desired.
There are several valid complaints about Matplotlib that often come up:
- Prior to version 2.0, Matplotlib's defaults are not exactly the best choices. It was based off of MATLAB circa 1999, and this often shows.
- Matplotlib's API is relatively low level. Doing sophisticated statistical visualization is possible, but often requires a *lot* of boilerplate code.
- Matplotlib predated Pandas by more than a decade, and thus is not designed for use with Pandas ``DataFrame``s. In order to visualize data from a Pandas ``DataFrame``, you must extract each ``Series`` and often concatenate them together into the right format. It would be nicer to have a plotting library that can intelligently use the ``DataFrame`` labels in a plot.
An answer to these problems is [Seaborn](http://seaborn.pydata.org/). Seaborn provides an API on top of Matplotlib that offers sane choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas ``DataFrame``s.
To be fair, the Matplotlib team is addressing this: it has recently added the ``plt.style`` tools discussed in [Customizing Matplotlib: Configurations and Style Sheets](04.11-Settings-and-Stylesheets.ipynb), and is starting to handle Pandas data more seamlessly.
The 2.0 release of the library will include a new default stylesheet that will improve on the current status quo.
But for all the reasons just discussed, Seaborn remains an extremely useful addon.
## Seaborn Versus Matplotlib
Here is an example of a simple random-walk plot in Matplotlib, using its classic plot formatting and colors.
We start with the typical imports:
```python
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline
import numpy as np
import pandas as pd
```
Now we create some random walk data:
```python
# Create some data
rng = np.random.RandomState(0)
x = np.linspace(0, 10, 500)
y = np.cumsum(rng.randn(500, 6), 0)
```
And do a simple plot:
```python
# Plot the data with Matplotlib defaults
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');
```
Although the result contains all the information we'd like it to convey, it does so in a way that is not all that aesthetically pleasing, and even looks a bit old-fashioned in the context of 21st-century data visualization.
Now let's take a look at how it works with Seaborn.
As we will see, Seaborn has many of its own high-level plotting routines, but it can also overwrite Matplotlib's default parameters and in turn get even simple Matplotlib scripts to produce vastly superior output.
We can set the style by calling Seaborn's ``set()`` method.
By convention, Seaborn is imported as ``sns``:
```python
import seaborn as sns
sns.set()
```
Now let's rerun the same two lines as before:
```python
# same plotting code as above!
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');
```
Ah, much better!
## Exploring Seaborn Plots
The main idea of Seaborn is that it provides high-level commands to create a variety of plot types useful for statistical data exploration, and even some statistical model fitting.
Let's take a look at a few of the datasets and plot types available in Seaborn. Note that all of the following *could* be done using raw Matplotlib commands (this is, in fact, what Seaborn does under the hood) but the Seaborn API is much more convenient.
### Histograms, KDE, and densities
Often in statistical data visualization, all you want is to plot histograms and joint distributions of variables.
We have seen that this is relatively straightforward in Matplotlib:
```python
data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])
for col in 'xy':
plt.hist(data[col], normed=True, alpha=0.5)
```
Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation, which Seaborn does with ``sns.kdeplot``:
```python
for col in 'xy':
sns.kdeplot(data[col], shade=True)
```
Histograms and KDE can be combined using ``distplot``:
```python
sns.distplot(data['x'])
sns.distplot(data['y']);
```
If we pass the full two-dimensional dataset to ``kdeplot``, we will get a two-dimensional visualization of the data:
```python
sns.kdeplot(data);
```
We can see the joint distribution and the marginal distributions together using ``sns.jointplot``.
For this plot, we'll set the style to a white background:
```python
with sns.axes_style('white'):
sns.jointplot("x", "y", data, kind='kde');
```
There are other parameters that can be passed to ``jointplot``—for example, we can use a hexagonally based histogram instead:
```python
with sns.axes_style('white'):
sns.jointplot("x", "y", data, kind='hex')
```
### Pair plots
When you generalize joint plots to datasets of larger dimensions, you end up with *pair plots*. This is very useful for exploring correlations between multidimensional data, when you'd like to plot all pairs of values against each other.
We'll demo this with the well-known Iris dataset, which lists measurements of petals and sepals of three iris species:
```python
iris = sns.load_dataset("iris")
iris.head()
```
Visualizing the multidimensional relationships among the samples is as easy as calling ``sns.pairplot``:
```python
sns.pairplot(iris, hue='species', size=2.5);
```
### Faceted histograms
Sometimes the best way to view data is via histograms of subsets. Seaborn's ``FacetGrid`` makes this extremely simple.
We'll take a look at some data that shows the amount that restaurant staff receive in tips based on various indicator data:
```python
tips = sns.load_dataset('tips')
tips.head()
```
```python
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']
grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15));
```
### Factor plots
Factor plots can be useful for this kind of visualization as well. This allows you to view the distribution of a parameter within bins defined by any other parameter:
```python
with sns.axes_style(style='ticks'):
g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box")
g.set_axis_labels("Day", "Total Bill");
```
### Joint distributions
Similar to the pairplot we saw earlier, we can use ``sns.jointplot`` to show the joint distribution between different datasets, along with the associated marginal distributions:
```python
with sns.axes_style('white'):
sns.jointplot("total_bill", "tip", data=tips, kind='hex')
```
The joint plot can even do some automatic kernel density estimation and regression:
```python
sns.jointplot("total_bill", "tip", data=tips, kind='reg');
```
### Bar plots
Time series can be plotted using ``sns.factorplot``. In the following example, we'll use the Planets data that we first saw in [Aggregation and Grouping](03.08-Aggregation-and-Grouping.ipynb):
```python
planets = sns.load_dataset('planets')
planets.head()
```
```python
with sns.axes_style('white'):
g = sns.factorplot("year", data=planets, aspect=2,
kind="count", color='steelblue')
g.set_xticklabels(step=5)
```
We can learn more by looking at the *method* of discovery of each of these planets:
```python
with sns.axes_style('white'):
g = sns.factorplot("year", data=planets, aspect=4.0, kind='count',
hue='method', order=range(2001, 2015))
g.set_ylabels('Number of Planets Discovered')
```
For more information on plotting with Seaborn, see the [Seaborn documentation](http://seaborn.pydata.org/), a [tutorial](http://seaborn.pydata.org/
tutorial.htm), and the [Seaborn gallery](http://seaborn.pydata.org/examples/index.html).
## Example: Exploring Marathon Finishing Times
Here we'll look at using Seaborn to help visualize and understand finishing results from a marathon.
I've scraped the data from sources on the Web, aggregated it and removed any identifying information, and put it on GitHub where it can be downloaded
(if you are interested in using Python for web scraping, I would recommend [*Web Scraping with Python*](http://shop.oreilly.com/product/0636920034391.do) by Ryan Mitchell).
We will start by downloading the data from
the Web, and loading it into Pandas:
```python
# !curl -O https://raw.githubusercontent.com/jakevdp/marathon-data/master/marathon-data.csv
```
```python
data = pd.read_csv('marathon-data.csv')
data.head()
```
By default, Pandas loaded the time columns as Python strings (type ``object``); we can see this by looking at the ``dtypes`` attribute of the DataFrame:
```python
data.dtypes
```
Let's fix this by providing a converter for the times:
```python
import datetime
def convert_time(s):
h, m, s = map(int, s.split(':'))
return datetime.timedelta(hours=h, minutes=m, seconds=s)
data = pd.read_csv('marathon-data.csv',
converters={'split':convert_time, 'final':convert_time})
data.head()
```
```python
data.dtypes
```
That looks much better. For the purpose of our Seaborn plotting utilities, let's next add columns that give the times in seconds:
```python
data['split_sec'] = data['split'].astype(int) / 1E9
data['final_sec'] = data['final'].astype(int) / 1E9
data.head()
```
To get an idea of what the data looks like, we can plot a ``jointplot`` over the data:
```python
with sns.axes_style('white'):
g = sns.jointplot("split_sec", "final_sec", data, kind='hex')
g.ax_joint.plot(np.linspace(4000, 16000),
np.linspace(8000, 32000), ':k')
```
The dotted line shows where someone's time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon.
If you have run competitively, you'll know that those who do the opposite—run faster during the second half of the race—are said to have "negative-split" the race.
Let's create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race:
```python
data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec']
data.head()
```
Where this split difference is less than zero, the person negative-split the race by that fraction.
Let's do a distribution plot of this split fraction:
```python
sns.distplot(data['split_frac'], kde=False);
plt.axvline(0, color="k", linestyle="--");
```
```python
sum(data.split_frac < 0)
```
Out of nearly 40,000 participants, there were only 250 people who negative-split their marathon.
Let's see whether there is any correlation between this split fraction and other variables. We'll do this using a ``pairgrid``, which draws plots of all these correlations:
```python
g = sns.PairGrid(data, vars=['age', 'split_sec', 'final_sec', 'split_frac'],
hue='gender', palette='RdBu_r')
g.map(plt.scatter, alpha=0.8)
g.add_legend();
```
It looks like the split fraction does not correlate particularly with age, but does correlate with the final time: faster runners tend to have closer to even splits on their marathon time.
(We see here that Seaborn is no panacea for Matplotlib's ills when it comes to plot styles: in particular, the x-axis labels overlap. Because the output is a simple Matplotlib plot, however, the methods in [Customizing Ticks](04.10-Customizing-Ticks.ipynb) can be used to adjust such things if desired.)
The difference between men and women here is interesting. Let's look at the histogram of split fractions for these two groups:
```python
sns.kdeplot(data.split_frac[data.gender=='M'], label='men', shade=True)
sns.kdeplot(data.split_frac[data.gender=='W'], label='women', shade=True)
plt.xlabel('split_frac');
```
The interesting thing here is that there are many more men than women who are running close to an even split!
This almost looks like some kind of bimodal distribution among the men and women. Let's see if we can suss-out what's going on by looking at the distributions as a function of age.
A nice way to compare distributions is to use a *violin plot*
```python
sns.violinplot("gender", "split_frac", data=data,
palette=["lightblue", "lightpink"]);
```
This is yet another way to compare the distributions between men and women.
Let's look a little deeper, and compare these violin plots as a function of age. We'll start by creating a new column in the array that specifies the decade of age that each person is in:
```python
data['age_dec'] = data.age.map(lambda age: 10 * (age // 10))
data.head()
```
```python
men = (data.gender == 'M')
women = (data.gender == 'W')
with sns.axes_style(style=None):
sns.violinplot("age_dec", "split_frac", hue="gender", data=data,
split=True, inner="quartile",
palette=["lightblue", "lightpink"]);
```
Looking at this, we can see where the distributions of men and women differ: the split distributions of men in their 20s to 50s show a pronounced over-density toward lower splits when compared to women of the same age (or of any age, for that matter).
Also surprisingly, the 80-year-old women seem to outperform *everyone* in terms of their split time. This is probably due to the fact that we're estimating the distribution from small numbers, as there are only a handful of runners in that range:
```python
(data.age > 80).sum()
```
Back to the men with negative splits: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We'll use ``regplot``, which will automatically fit a linear regression to the data:
```python
g = sns.lmplot('final_sec', 'split_frac', col='gender', data=data,
markers=".", scatter_kws=dict(color='c'))
g.map(plt.axhline, y=0.1, color="k", ls=":");
```
Apparently the people with fast splits are the elite runners who are finishing within ~15,000 seconds, or about 4 hours. People slower than that are much less likely to have a fast second split.
<!--NAVIGATION-->
< [Geographic Data with Basemap](04.13-Geographic-Data-With-Basemap.ipynb) | [Contents](Index.ipynb) | [Further Resources](04.15-Further-Resources.ipynb) >
<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.14-Visualization-With-Seaborn.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

Some files were not shown because too many files have changed in this diff Show More