Exploring Weather Data

Joao Martins

Outline

To compare annual average temperature records from Portugal against corresponding world annual average temperature records between 1750 and 2013, we take the following steps:

  1. Download regional and global temperature trends from http://berkeleyearth.lbl.gov using the command line (see Bash script in annex);

  2. Load, prepare, and visualize the trends using Python (see script in annex):
    2.1. Open, clean the CSV table with pandas;
    2.2. Produce a line chart showing a moving average of the temperatures with plotnine1 plotnine is based on Matplotlib and implements ggplot2’s grammar of graphics, well-suited for exploratory analyses.. plotnine’s moving average smoothing method is based on the pandas.rolling() function.

  3. Display the results in R Markdown, following the tufte layout.

Observations

Average temperatures (ºC) between 1750 and 2013. Each dot represents the average temperature over 1 year. Lines show a moving average temperature over a time period of 10 years. Temperatures in Portugal are shown in red, world temperatures in blue.

Average temperatures (ºC) between 1750 and 2013.  Each dot represents the average temperature over 1 year. Lines show a moving average temperature over a time period of 10 years.  Temperatures in Portugal are shown in red, world temperatures in blue.

Figure 1 shows a number of interesting observations:

  1. Over the past 250 years, Portugal has been consistently warmer than the rest of the world by about 7ºC.

  2. Year-over-year temperatures appear to oscillate more at a regional level. The ruggedness of the moving average and the level of dispersion of the data points seems higher for Portugal. By contrast, world average temperatures appear to typically more consistent from a year to the next. Possibly, they do not suffer from regional fluctuations.

  3. The 20th century appears to show a regional and global warming trend.

Annex

Bash

#!/bin/bash
echo 'parsing monthly average temperatures for Portugal...';
curl -s http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/portugal-TAVG-Trend.txt \
  | egrep "^% Estimated Jan(.*)+monthly" -A 2 \
  | tail -n 1 | tr -d "%" | tr -s '[:blank:]' \
  | cut -c 2- \
  | tr ' ' '\n' \
  > portugal_monthly_avg.csv;
echo 'parsing monthly historic temperature anomalies for Portugal...';
curl -s http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/portugal-TAVG-Trend.txt \
  | egrep -v "^%|^( )?$" \
  | tr -s '[:blank:]' \
  | cut -c 2- \
  | cut -d ' ' -f 1,2,3 \
  | tr ' ' ',' \
  > portugal_monthly_anom.csv;
echo 'parsing monthly average temperatures worldwide...';
curl -s http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/global-land-TAVG-Trend.txt \
  | egrep "^% Estimated Jan(.*)+monthly" -A 2 \
  | tail -n 1 | tr -d "%" | tr -s '[:blank:]' \
  | cut -c 2- \
  | tr ' ' '\n' \
  > world_monthly_avg.csv;
echo 'parsing monthly historic temperature anomalies worldwide...';
curl -s http://berkeleyearth.lbl.gov/auto/Regional/TAVG/Text/global-land-TAVG-Trend.txt \
  | egrep -v "^%|^( )?$" \
  | tr -s '[:blank:]' \
  | cut -c 2- \
  | cut -d ' ' -f 1,2,3 \
  | tr ' ' ',' \
  > world_monthly_anom.csv;

Python

import numpy as np
import pandas as pd
from plotnine import *

# load data files 
reference_pt = pd.read_csv('../data/portugal_monthly_avg.csv', \
                           header = None, \
                           names = ['reference'])
temperature_pt = pd.read_csv('../data/portugal_monthly_anom.csv', \
                             header = None, \
                             names = ['year', 'month', 'delta'])
reference_world = pd.read_csv('../data/world_monthly_avg.csv', \
                              header = None, \
                              names = ['reference'])
temperature_world = pd.read_csv('../data/world_monthly_anom.csv', \
                  header = None, \
                  names = ['year', 'month', 'delta'])

reference_pt['month'] = range(1, 13)
reference_world['month'] = range(1, 13)

# --- calculate temperatures from reference and delta temperature files

def calculate_temperatures(ref_temp, delta_temp, loc):
        assert 'delta' in delta_temp.columns, 'delta_temp lacks a delta column'
        assert 'reference' in ref_temp.columns, 'ref_temp lacks a reference column'
        temp = ref_temp.merge(delta_temp, on = 'month')
        temp['absolute'] = temp.reference + temp.delta
        # remove years with missing measurements
        temp['count'] = temp.groupby('year')['year'].transform('count')
        temp = temp.query('count == 12')
        # average annual temperature
        temp = temp.groupby('year').mean()[['absolute']]
        temp['year'] = temp.index
        assert temp.year.duplicated().sum() == 0, 'remove duplicate years'
        temp['location'] = np.repeat(loc, len(temp))
        return temp

temperature_pt = calculate_temperatures(reference_pt, temperature_pt, 'Portugal')
temperature_world = calculate_temperatures(reference_world, temperature_world, 'World')

# --- combine data frames

temperatures = pd.concat([temperature_pt, temperature_world], ignore_index = True)
temperatures = temperatures.dropna()

# ------- line charts with moving averages

p = ggplot(temperatures, aes(x = 'year', y = 'absolute', color = 'location')) \
  + geom_point(alpha = .4, size = 2) \
  + geom_smooth(method = 'mavg', method_args = {'window': 10},  se = False) \
  + labs(x = 'Year', y = 'Average Temperature') \
  + theme_minimal() \
  + theme(legend_position = (0.75, 0.25))