Data Carpentry Tutorials

Data workflows and automation

Sept 21, 2020

Alison Fowler

In [4]:
%cd /Users/alisonfowler/Docs/Jupityr-notebooks/
/Users/alisonfowler/Docs/Jupityr-notebooks
In [5]:
import pandas as pd
In [6]:
animals = ['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
In [7]:
print(animals)
['lion', 'tiger', 'crocodile', 'vulture', 'hippo']
In [8]:
for creature in animals: 
    print (creature)
lion
tiger
crocodile
vulture
hippo
In [ ]:
# In this example, creature is the loop variable that takes the value of the next entry in animals every time the loop goes 
# around. We can call the loop variable anything we like. After the loop finishes, the loop variable will still exist and 
# will have the value of the last entry in the collection:
In [9]:
for creature in animals:
    pass
In [10]:
print('The loop variable is now: ' + creature)
The loop variable is now: hippo
In [ ]:
# We are not asking Python to print the value of the loop variable anymore, but the for loop still runs and 
# the value of creature changes on each pass through the loop. The statement pass in the body of the loop means “do nothing”.
In [ ]:
# Challenge - What happens if we don't include the 'pass' statement?
In [11]:
for creature in animals: 
    print('The loop variable is now: ' + creature)
The loop variable is now: lion
The loop variable is now: tiger
The loop variable is now: crocodile
The loop variable is now: vulture
The loop variable is now: hippo
In [ ]:
# Rewrite the loop so that the animals are separated by commas, not new lines 
# (Hint: You can concatenate strings using a plus sign. 
# For example, print(string1 + string2) outputs ‘string1string2’).
In [19]:
for creature in animals: 
    print(creature + ",")
lion,
tiger,
crocodile,
vulture,
hippo,
In [ ]:
# not sure how to get them all on the same line
In [ ]:
# Automating data processing using For Loops
In [14]:
import os
In [ ]:
# make a new directory inside the folder data to store all these files using the module "os"
In [18]:
os.mkdir("data/yearly_files")
In [16]:
os.listdir('data')
Out[16]:
['surveys2001.csv',
 'surveys2002.csv',
 'speciesSubset.csv',
 'bouldercreek_09_2013.txt',
 'species.csv',
 '.DS_Store',
 'yearly_files',
 'output',
 'surveys.csv',
 'portal_mammals.sqlite',
 'README.txt',
 'surveys_complete.csv',
 'plots.csv',
 'surveys 2.csv']
In [17]:
# load the data into a DataFrame
surveys_df = pd.read_csv("data/surveys.csv")
In [19]:
# select only data for the year 2002
surveys2002 = surveys_df[surveys_df.year == 2002]
In [20]:
# write the new DataFrame to a CSV file 
surveys2002.to_csv('data/yearly_files/surveys2002.csv')
In [21]:
# let's write a loop that does this for all the years! 
surveys_df['year']
Out[21]:
0        1977
1        1977
2        1977
3        1977
4        1977
         ... 
35544    2002
35545    2002
35546    2002
35547    2002
35548    2002
Name: year, Length: 35549, dtype: int64
In [ ]:
# we want only unique years 
In [23]:
surveys_df['year'].unique()
Out[23]:
array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002])
In [24]:
for year in surveys_df['year'].unique():
    filename = 'data/yearly_files/surveys' + str(year) + '.csv'
    print(filename)
data/yearly_files/surveys1977.csv
data/yearly_files/surveys1978.csv
data/yearly_files/surveys1979.csv
data/yearly_files/surveys1980.csv
data/yearly_files/surveys1981.csv
data/yearly_files/surveys1982.csv
data/yearly_files/surveys1983.csv
data/yearly_files/surveys1984.csv
data/yearly_files/surveys1985.csv
data/yearly_files/surveys1986.csv
data/yearly_files/surveys1987.csv
data/yearly_files/surveys1988.csv
data/yearly_files/surveys1989.csv
data/yearly_files/surveys1990.csv
data/yearly_files/surveys1991.csv
data/yearly_files/surveys1992.csv
data/yearly_files/surveys1993.csv
data/yearly_files/surveys1994.csv
data/yearly_files/surveys1995.csv
data/yearly_files/surveys1996.csv
data/yearly_files/surveys1997.csv
data/yearly_files/surveys1998.csv
data/yearly_files/surveys1999.csv
data/yearly_files/surveys2000.csv
data/yearly_files/surveys2001.csv
data/yearly_files/surveys2002.csv
In [ ]:
# we can now add the rest of the steps we need to create separate text files
In [25]:
# load the data into a DataFrame
surveys_df = pd.read_csv('data/surveys.csv')
In [26]:
for year in surveys_df['year'].unique(): 
    
    # select data for the year 
    surveys_year = surveys_df[surveys_df.year == year]
    
    # write the new DataFrame to a csv file
    filename = 'data/yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)
In [ ]:
# Writing unique file names 
In [27]:
filename = 'data/yearly_files/surveys' + str(year) + '.csv'
In [ ]:
# Let’s break down the parts of this name:

# The first part is some text that specifies the directory to store our data file in (data/yearly_files/) and the 
# first part of the file name (surveys): 'data/yearly_files/surveys'

# We can concatenate this with the value of a variable, in this case year by using the plus + sign and the variable we
# want to add to the file name: + str(year)

# Then we add the file extension as another text string: + '.csv'

# Notice that we use sinlge quotes to add text strings. 
# The variable is not surrounded by quotes. This code produces the string data/yearly_files/surveys2002.csv which contrains 
# the path to the new filename AND the file name itself.
In [ ]:
# Challenge - Modifying loops 
# some of the surveys you saved are missing data (they have null values that show 
# up as NaN - Not a number - in the dataframes and do not show up in the text files). 
# Modify the for loop so that the entries with null values are not included in the yearly files. 
In [39]:
for year in surveys_df['year'].unique(): 
    
    # select data for the year 
    surveys_year = surveys_df[surveys_df.year == year]
    
    # remove NaNs 
    surveys_year = surveys_year.dropna(axis=0)
    
    # write the new DataFrame to a csv file
    filename = 'data/yearly_files/surveys' + str(year) + '.csv'
    surveys_year.to_csv(filename)
In [ ]:
# Let's say you only want to look at data from a given multiple of years. 
# How would you modify your loop in order to generate a data file for only every 
# 5th year, starting from 1977? 
In [ ]:
# Instead of splitting out the data by years, a colleague wants to do analyses for 
# each species separately. How would you write a unique CSV file for each species? 
In [ ]:
 
In [ ]: