Data Preparation and Advanced Visualization Using The Washington Post's series, "2ºC: Beyond the Limit." Data

Climate change and global warming are real, observable, and measurable. This project conducts time series analysis using the data supporting The Washington Post's series, "2ºC: Beyond the Limit.", to quantify the links between the climate variation and different drivers of change.

Motivation and Summary

I got really interested in finding and analyzing this data set after reading the Washington Post's article. Here's the link to WP's github repository and some more details about the data set:

  • The data supporting the Washington Post's series, 2°C: Beyond the limit, that was awarded the Pulitzer prize for explanatory reporting (link to data). The dataset includes:
      fips: A five digit fips code for the county
      CTYNAME: the name of the county
      STNAME: the name of the state
      Annual: Estimate of annual average temperature change in Celsius for the county, 1895-2019
      Fall: temperature change in September, October and November
      Spring: temperature change in March, April and May
      Summer: temperature change in June, July and August
      Winter: temperature change in December and the following January and February
      max_warming_season: the season where temperatures are increasing fastest

The data set includes county-level annual temperatures for the US counties. It can be used, as is and without much cleaning, for analysis and visualization purposes. However, after going through the raw files in the repository, I realized there's another file that has the data above with more details, where the average temperature for each county is recorded monthly for the same period of time (1895-2019). This is great because it gives me a better opportunity to analyze not only the long-time behavior of the temperature time series (Drift) but also the Seasonal fluctuations over time.

I decided to breakdown this project into two parts where in part 1, I go through the data cleaning, data imputation, and data visualization using Plotly. In part 2, I use the data from part 1 to design a 'Drift' + 'Seasonal' + 'Noise' time series model for temperature forecasting.

Data Exploration

Here's the first few lines of the raw file:

head -5 climdiv-tmpccy-v1.0.0-20200106
01001021895  44.00  38.20  55.50  64.10  70.60  78.30  80.40  80.40  79.00  61.40  54.40  45.30   
01001021896  44.30  49.00  54.00  69.30  76.80  78.00  81.70  83.10  77.90  64.70  58.00  47.30   
01001021897  43.70  52.30  61.30  63.00  70.00  82.40  82.40  79.60  76.60  67.40  54.90  48.20   
01001021898  50.10  46.80  60.10  59.60  75.00  81.50  80.80  79.20  76.20  62.10  50.20  44.20   
01001021899  44.60  41.50  56.60  62.30  76.70  81.00  81.00  81.50  74.30  66.60  55.70  45.30 

As you can see, the file doesn't have a header so we need to create a list for the column names. The first entry seems to be county FIPS code (5 digits) + 02 + year with the next 12 entries denoting the monthly average temperatures.

I start with creating a list and dictionary for month names:

months = [month for month in month_name][1:]
print(months)
months_dict = dict(zip(months,np.arange(1,13)))
print(months_dict)
    ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6, 'July': 7, 'August': 8, 'September': 9, 'October': 10, 'November': 11, 'December': 12}

The file doesn't have a header so we create a list for the column names

WP_df = pd.read_csv('climdiv-tmpccy-v1.0.0-20200106', 
                                 delimiter=r"\s+", 
                                 header=None, 
                                 names = ['FipsYear']+months)
WP_df.head()
FipsYear January February March April May June July August September October November December
0 1001021895 44.000 38.200 55.500 64.100 70.600 78.300 80.400 80.400 79.000 61.400 54.400 45.300
1 1001021896 44.300 49.000 54.000 69.300 76.800 78.000 81.700 83.100 77.900 64.700 58.000 47.300
2 1001021897 43.700 52.300 61.300 63.000 70.000 82.400 82.400 79.600 76.600 67.400 54.900 48.200
3 1001021898 50.100 46.800 60.100 59.600 75.000 81.500 80.800 79.200 76.200 62.100 50.200 44.200
4 1001021899 44.600 41.500 56.600 62.300 76.700 81.000 81.000 81.500 74.300 66.600 55.700 45.300
WP_df.dtypes
    FipsYear       int64
    January      float64
    February     float64
    March        float64
    April        float64
    May          float64
    June         float64
    July         float64
    August       float64
    September    float64
    October      float64
    November     float64
    December     float64
    dtype: object

Note that pandas recognizes the first column to have the type int which might cause some issues later because the leading zero for the fips numbers starting with 0 has been mistaknely eliminated. We can resolve this issue in multiple different ways but here, we use the apply() method of Pandas to make the transofrmation. We also split the contents of the first column to two new columns: fips and year.

WP_df['year']=WP_df['FipsYear'].apply(lambda x: str(x)[-4:])
print(WP_df['year'].head())
    0    1895
    1    1896
    2    1897
    3    1898
    4    1899
    Name: year, dtype: object
WP_df['fips']=WP_df['FipsYear']. \
                           apply(lambda x: '0'+str(x)[:4] if len(str(x))<11 else str(x)[:5])
print(WP_df['fips'].head())
    0    01001
    1    01001
    2    01001
    3    01001
    4    01001
    Name: fips, dtype: object
WP_df.drop(['FipsYear'], axis=1, inplace=True)
WP_df.head()
January February March April May June July August September October November December year fips
0 44.000 38.200 55.500 64.100 70.600 78.300 80.400 80.400 79.000 61.400 54.400 45.300 1895 01001
1 44.300 49.000 54.000 69.300 76.800 78.000 81.700 83.100 77.900 64.700 58.000 47.300 1896 01001
2 43.700 52.300 61.300 63.000 70.000 82.400 82.400 79.600 76.600 67.400 54.900 48.200 1897 01001
3 50.100 46.800 60.100 59.600 75.000 81.500 80.800 79.200 76.200 62.100 50.200 44.200 1898 01001
4 44.600 41.500 56.600 62.300 76.700 81.000 81.000 81.500 74.300 66.600 55.700 45.300 1899 01001

Transforming our data into a more convenient format

Time Series data are best useful when represented using a timestamp, i.e., a column that represents the temperature for each county using a year-month format. The new DataFrame has five columns: timestamp, temperature (in $^{\circ}\text{F}\,$), fips code, county name, and state name.

climate_df = pd.DataFrame(columns=['timestamp', 'tempf', 'fips', 'county', 'state'])
                                    climate_df.head()
timestamp tempf fips county state

Adding fips information to Washington Post data

Washington Post data that includes the variation of temperatures over the period 1895-2019 is a county-level data where the counties are represented using their Federal Information Processing Standards (FIPS) codes. As we will see later, including the county and state names will be helpful for visualization purposes.

We need to find a table that converts these codes to the county/state names before being able to join the two climate dataframes.

## obtaining the fips table from
                                    ## https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697

NRCS page has a table including FIPS codes for every county in the US. We use a relatively straightforward way to convert this table into a useful dictionary in Python. First, we need to get the contents of the html page.

response = requests.get('https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697')

The most important components of a response to check before moving forward are the status code and the content.

response.status_code
    200

We get a response code of 200, meaning that our request was 'Successful'!

response.content[:100]
    b'\r\n\r\n\r\n\r\n \r\n  \r\n\r\n  \r\n\r\n \r\n   \r\n\r\n \r\n             \r\n    \r\n      \r\n    \r\n   \r\n \r\n \r\n \r\n'

We can check the response text using BeautifulSoup. BeautifulSoup allows us to find our tag of interest by specifying one of its attributes. By inspecting the table tag in this case we realize that it has and attribute class="data". Furthermore, SoupStrainer method of the bs4 library can be used to limit our search to table tags only.

soup = BeautifulSoup(response.text,'html.parser',parse_only=SoupStrainer('table'))
table = soup.find("table", { "class" : "data" })
print(table.text[:50])
FIPS
Name
State
               
01001
Autauga
AL

01003
Bald

Looks Ok! Next we want to encapsulate this information into a list() or DataFrame

def get_row_data(tr, column_tag='td'):
      return [td.get_text(strip=True) for td in tr.find_all(column_tag)]  

rows = []
trs = table.find_all('tr')
headerow = get_row_data(trs[0], 'th')
for tr in trs: # for every table row
rows.append(get_row_data(tr)) # data row
print(headerow)
    ['FIPS', 'Name', 'State']
rows[1:5]
    [['01001', 'Autauga', 'AL'],
     ['01003', 'Baldwin', 'AL'],
     ['01005', 'Barbour', 'AL'],
     ['01007', 'Bibb', 'AL']]
fips_df_nrcs = pd.DataFrame(rows[1:],columns=['FIPS', 'county', 'state'])
fips_df_nrcs.head()
FIPS county state
0 01001 Autauga AL
1 01003 Baldwin AL
2 01005 Barbour AL
3 01007 Bibb AL
4 01009 Blount AL

We save the dataframe for later use.

#fips_df_nrcs.to_pickle('fips.pkl')
fips_df_nrcs=pd.read_pickle('fips.pkl')

Creating a DataFrame suited for Time Series Analysis

Now that I have the fips dictionary ready, I have the following plan to populate the DataFrame:

  • First, I filter the DataFrame rows by fips.
  • Then, I reshape the month column temperatures into a single-column vector that are temporally ordered
  • Next, I add a timestamp column that I have generated before. The timestamp column temporally matches the temperatures in the previous column.
  • Finally, I create an array of ($12\times\text{number of years}$) rows and 3 columns for the fips code, the county name, and the state name.
  • I repeat the same procedure for the rest of the fips codes.
WP_df.head()
January February March April May June July August September October November December year fips
0 44.000 38.200 55.500 64.100 70.600 78.300 80.400 80.400 79.000 61.400 54.400 45.300 1895 01001
1 44.300 49.000 54.000 69.300 76.800 78.000 81.700 83.100 77.900 64.700 58.000 47.300 1896 01001
2 43.700 52.300 61.300 63.000 70.000 82.400 82.400 79.600 76.600 67.400 54.900 48.200 1897 01001
3 50.100 46.800 60.100 59.600 75.000 81.500 80.800 79.200 76.200 62.100 50.200 44.200 1898 01001
4 44.600 41.500 56.600 62.300 76.700 81.000 81.000 81.500 74.300 66.600 55.700 45.300 1899 01001
Creating the timestamp column
all_years = WP_df.year.unique()
all_months = months
# A little test before going full on
for i, year_month in enumerate(product(all_years[:2], all_months)):
print(pd.to_datetime(year_month[0]+year_month[1], format='%Y%B'))
    1895-01-01 00:00:00
    1895-02-01 00:00:00
    1895-03-01 00:00:00
    1895-04-01 00:00:00
    1895-05-01 00:00:00
    1895-06-01 00:00:00
    1895-07-01 00:00:00
    1895-08-01 00:00:00
    1895-09-01 00:00:00
    1895-10-01 00:00:00
    1895-11-01 00:00:00
    1895-12-01 00:00:00
    1896-01-01 00:00:00
    1896-02-01 00:00:00
    1896-03-01 00:00:00
    1896-04-01 00:00:00
    1896-05-01 00:00:00
    1896-06-01 00:00:00
    1896-07-01 00:00:00
    1896-08-01 00:00:00
    1896-09-01 00:00:00
    1896-10-01 00:00:00
    1896-11-01 00:00:00
1896-12-01 00:00:00

Looks fine! Moving on!

timestamps_list = []
for i, year_month in enumerate(product(all_years, all_months)):
timestamps_list.append(pd.to_datetime(year_month[0]+year_month[1], format='%Y%B'))

Sanity check!

assert((2019-1895+1)*12==len(timestamps_list))

Again, a little test before iterating through the entire rows:

climate_df = pd.DataFrame(columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])
climate_df.head()
timestamp tempf fips county state
for i, fips in enumerate(WP_df.fips.unique()[:2]):
    temperature_array = WP_df[WP_df['fips']==fips] \
                                          .iloc[:,:12] \
                                          .values \
                                          .reshape(-1,1)
    fips_array = np.asarray([fips] * len(temperature_array)).reshape(-1,1)
    county_and_state = fips_df_nrcs.set_index('FIPS').loc[fips].to_list()
    county_and_state_array = np.asarray(county_and_state * len(temperature_array)).reshape(-1,2)
    data_array = np.hstack([np.asarray(timestamps_list).reshape(-1,1), 
                            temperature_array, 
                            fips_array, 
                            county_and_state_array])
    tmp_df = pd.DataFrame(data_array, columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])
climate_df = pd.concat([climate_df, tmp_df])
climate_df.head()
timestamp tempf fips county state
0 1895-01-01 44.000 01001 Autauga AL
1 1895-02-01 38.200 01001 Autauga AL
2 1895-03-01 55.500 01001 Autauga AL
3 1895-04-01 64.100 01001 Autauga AL
4 1895-05-01 70.600 01001 Autauga AL

Perfect! Now we can go ahead and create the DataFrame:

for i, fips in enumerate(WP_df.fips.unique()[2:]):
    temperature_array = WP_df[WP_df['fips']==fips] \
                                          .iloc[:,:12] \
                                          .values \
                                          .reshape(-1,1)
    fips_array = np.asarray([fips] * len(temperature_array)).reshape(-1,1)
    county_and_state = fips_df_nrcs.set_index('FIPS').loc[fips].to_list()
    county_and_state_array = np.asarray(county_and_state * len(temperature_array)).reshape(-1,2)
    data_array = np.hstack([np.asarray(timestamps_list).reshape(-1,1), 
                            temperature_array, 
                            fips_array, 
                            county_and_state_array])
    tmp_df = pd.DataFrame(data_array, columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])
climate_df = pd.concat([climate_df, tmp_df])
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    ...
KeyError: '02001'

Even though the web scraping works fine, it seems like the fips data obtained from the above method lacks a few fips numbers, e.g., '02001'; therefore I use a public dataset from Plotly to obtain a complete fips data:

fips_df_plotly = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/fips-unemp-16.csv", dtype={"fips": str})
fips_df_plotly.head()
fips unemp
0 01001 5.300
1 01003 5.400
2 01005 8.600
3 01007 6.600
4 01009 5.500

Comparing it with the initial fips data:

len(fips_df_nrcs), len(fips_df_plotly)
    (3231, 3219)

Interestingly, the initial fips data has more entries! Let's compare the fips data in with the union of the two fips datasets that we have obtained so far:

len(set(WP_df.fips).difference(set(fips_df_plotly.fips).union(set(fips_df_nrcs.FIPS))))
    1538

Ok! something's definintely wrong! What I do next is that I print the number of unique fips codes for each state (the first two digits in the fips code denotes the state) from the three sources to figure out what causes this discrepancy. My guess is that one dataset has shifted the state codes that causes the mismatch. The maximum and minimum values of the state code are '01'('AL' or 'Alabama') and '78'('VI' or 'Virgin Islands').

state_codes = ['0'+f'{x}' if x<10 else f'{x}' for x in range(1,73)]
print(state_codes)
    ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18', 
    '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', 
    '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', 
    '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72']
for state_code in state_codes:
    WP_unique_fips = len(WP_df[WP_df['fips'].str.startswith(state_code)]['fips'].unique())
    unique_fips_df_nrcs = len(fips_df_nrcs[fips_df_nrcs['FIPS'].str.startswith(state_code)]['FIPS'].unique())
    unique_fips_df_plotly = len(fips_df_plotly[fips_df_plotly['fips'].str.startswith(state_code)]['fips'].unique())
print(state_code, WP_unique_fips, unique_fips_df_nrcs, unique_fips_df_plotly)
    01 67 67 67
    02 15 27 29
    03 75 0 0
    04 58 15 15
    05 64 75 75
    06 8 58 58
    07 3 0 0
    08 67 63 64
    09 159 8 8
    10 44 3 3
    11 102 1 1
    12 92 67 67
    13 99 159 159
    14 105 0 0
    15 120 4 4
    16 64 44 44
    17 16 102 102
    18 25 92 92
    19 14 99 99
    20 83 105 105
    21 87 120 120
    22 82 64 64
    23 115 16 16
    24 56 24 24
    25 93 14 14
    26 17 83 83
    27 10 87 87
    28 21 82 82
    29 33 115 115
    30 62 57 56
    31 100 93 93
    32 53 17 17
    33 88 10 10
    34 77 21 21
    35 36 33 33
    36 67 62 62
    37 5 100 100
    38 46 53 53
    39 66 88 88
    40 95 77 77
    41 254 36 36
    42 29 67 67
    43 14 0 0
    44 132 5 5
    45 39 46 46
    46 55 66 66
    47 72 95 95
    48 23 254 254
    49 0 29 29
    50 29 14 14
    51 0 136 133
    52 0 0 0
    53 0 39 39
    54 0 55 55
    55 0 72 72
    56 0 23 23
    57 0 0 0
    58 0 0 0
    59 0 0 0
    60 0 3 0
    61 0 0 0
    62 0 0 0
    63 0 0 0
    64 0 0 0
    65 0 0 0
    66 0 1 0
    67 0 0 0
    68 0 0 0
    69 0 3 0
    70 0 0 0
    71 0 0 0
    72 0 76 78

Looks like our suspicion was correct! By visually inspecting the table we can see that except the first row, the numbers in the first column matches the number in the second and third column a few rows apart. We can fix the problem by defining a dictionary that takes the wrong state_codes as keys and returns the correct state_codeas values. We also note that the second fips dataset obtained from Plotly better matches with the WP fips codes.

state_correction_dict = {}
fips_count_per_state = []
avail_states = []
for state_code in state_codes:
    unique_fips_df_plotly = len(fips_df_plotly[fips_df_plotly['fips'].str.startswith(state_code)]['fips'].unique())
    if unique_fips_df_plotly!=0:
        avail_states.append(state_code)
        fips_count_per_state.append(unique_fips_df_plotly)
fips_dict=dict(zip(avail_states,fips_count_per_state))
for i, state_code in enumerate(state_codes):
    print(state_code+":")
    WP_unique_fips = len(WP_df[WP_df['fips'].str.startswith(state_code)]['fips'].unique())
    if WP_unique_fips!=0:
        if fips_count_per_state.count(WP_unique_fips)==1:
            index = fips_count_per_state.index(WP_unique_fips)
            state_correction_dict[state_code]=avail_states[index]
            print(f'found a matching state! the state code {avail_states[index]} is mistakenly denoted by {state_code}')
        elif fips_count_per_state.count(WP_unique_fips)>1:
            indices = [i for i, e in enumerate(fips_count_per_state) if e == WP_unique_fips]
            print('found multiple matching states!')
            print([avail_states[i] for i in indices])
            state_correction_dict[state_code]=input('which code is the correct one?')
        else:
state_correction_dict[state_code]=input('manually assign the correct code?')
    01:
    found multiple matching states!
    ['01', '12', '42']
    which code is the correct one?01
    02:
    found a matching state! the state code 04 is mistakenly denoted by 02
    03:
    found a matching state! the state code 05 is mistakenly denoted by 03
    04:
    found a matching state! the state code 06 is mistakenly denoted by 04
    05:
    found multiple matching states!
    ['08', '22']
    which code is the correct one?08
    06:
    found a matching state! the state code 09 is mistakenly denoted by 06
    07:
    found a matching state! the state code 10 is mistakenly denoted by 07
    08:
    found multiple matching states!
    ['01', '12', '42']
    which code is the correct one?12
    09:
    found a matching state! the state code 13 is mistakenly denoted by 09
    10:
    found a matching state! the state code 16 is mistakenly denoted by 10
    11:
    found a matching state! the state code 17 is mistakenly denoted by 11
    12:
    found a matching state! the state code 18 is mistakenly denoted by 12
    13:
    found a matching state! the state code 19 is mistakenly denoted by 13
    14:
    found a matching state! the state code 20 is mistakenly denoted by 14
    15:
    found a matching state! the state code 21 is mistakenly denoted by 15
    16:
    found multiple matching states!
    ['08', '22']
    which code is the correct one?22
    17:
    found a matching state! the state code 23 is mistakenly denoted by 17
    18:
    manually assign the correct code?24
    19:
    found multiple matching states!
    ['25', '50']
    which code is the correct one?25
    20:
    found a matching state! the state code 26 is mistakenly denoted by 20
    21:
    found a matching state! the state code 27 is mistakenly denoted by 21
    22:
    found a matching state! the state code 28 is mistakenly denoted by 22
    23:
    found a matching state! the state code 29 is mistakenly denoted by 23
    24:
    found a matching state! the state code 30 is mistakenly denoted by 24
    25:
    found a matching state! the state code 31 is mistakenly denoted by 25
    26:
    found a matching state! the state code 32 is mistakenly denoted by 26
    27:
    found a matching state! the state code 33 is mistakenly denoted by 27
    28:
    found a matching state! the state code 34 is mistakenly denoted by 28
    29:
    found a matching state! the state code 35 is mistakenly denoted by 29
    30:
    found a matching state! the state code 36 is mistakenly denoted by 30
    31:
    found a matching state! the state code 37 is mistakenly denoted by 31
    32:
    found a matching state! the state code 38 is mistakenly denoted by 32
    33:
    found a matching state! the state code 39 is mistakenly denoted by 33
    34:
    found a matching state! the state code 40 is mistakenly denoted by 34
    35:
    found a matching state! the state code 41 is mistakenly denoted by 35
    36:
    found multiple matching states!
    ['01', '12', '42']
    which code is the correct one?42
    37:
    found a matching state! the state code 44 is mistakenly denoted by 37
    38:
    found a matching state! the state code 45 is mistakenly denoted by 38
    39:
    found a matching state! the state code 46 is mistakenly denoted by 39
    40:
    found a matching state! the state code 47 is mistakenly denoted by 40
    41:
    found a matching state! the state code 48 is mistakenly denoted by 41
    42:
    found multiple matching states!
    ['02', '49']
    which code is the correct one?49
    43:
    found multiple matching states!
    ['25', '50']
    which code is the correct one?50
    44:
    manually assign the correct code?51
    45:
    found a matching state! the state code 53 is mistakenly denoted by 45
    46:
    found a matching state! the state code 54 is mistakenly denoted by 46
    47:
    found a matching state! the state code 55 is mistakenly denoted by 47
    48:
    found a matching state! the state code 56 is mistakenly denoted by 48
    49:
    50:
    found multiple matching states!
    ['02', '49']
    which code is the correct one?02
    51:
    52:
    53:
    54:
    55:
    56:
    57:
    58:
    59:
    60:
    61:
    62:
    63:
    64:
    65:
    66:
    67:
    68:
    69:
    70:
    71:
    72:
print(state_correction_dict)
    {'01': '01', '02': '04', '03': '05', '04': '06', '05': '08', '06': '09', '07': '10', 
     '08': '12', '09': '13', '10': '16', '11': '17', '12': '18', '13': '19', '14': '20', 
     '15': '21', '16': '22', '17': '23', '18': '24', '19': '25', '20': '26', '21': '27', 
     '22': '28', '23': '29', '24': '30', '25': '31', '26': '32', '27': '33', '28': '34', 
     '29': '35', '30': '36', '31': '37', '32': '38', '33': '39', '34': '40', '35': '41', 
     '36': '42', '37': '44', '38': '45', '39': '46', '40': '47', '41': '48', '42': '49', 
     '43': '50', '44': '51', '45': '53', '46': '54', '47': '55', '48': '56', '50': '02'}

By inspection, I figured out that even though the state_code 18 with 25 unique fips codes doesn't match any entries from the fips data, because it follows the order of the neighboring rows, I assigned it to the state code 24 with 24 unique fips entries. This means that there's likely to be a fips code that we will have to remove.

WP_df[WP_df['fips'].str.startswith('18')]['fips'].unique()
    array(['18001', '18003', '18005', '18009', '18011', '18013', '18015', '18017', '18019', 
     '18021', '18023', '18025', '18027', '18029', '18031', '18033', '18035', '18037', 
     '18039', '18041', '18043', '18045', '18047', '18510', '18511'], dtype=object)
fips_df_plotly[fips_df_plotly['fips'].str.startswith('24')]['fips'].unique()
    array(['24001', '24003', '24005', '24009', '24011', '24013', '24015', '24017', '24019', 
'24021', '24023', '24025', '24027', '24029', '24031', '24033', '24035', '24037', '24039', 
'24041', '24043', '24045', '24047', '24510'], dtype=object)

Note that by comparing the two outputs, we notice that fips='24511', the correct code for the observations with fips='18511', is not among the in the fips codes. We either need to drop those observations from the climate data or add fips='24511' to the fips DataFrame. We go with the second option. Before adding it to our fips data, we first update the NRCS fips data, fips_df_nrcs, with the fips codes that don't exist in fips_df_nrcs but are tabulated in fips_df_plotly.

fips codes that exist in Plotly data but not in NRCS page.

new_fips_entries = set(fips_df_plotly.fips).difference(set(fips_df_nrcs.FIPS))
new_fips_entries
    {'02105',
     '02158',
     '02195',
     '02198',
     '02230',
     '02275',
     '08014',
     '46102',
     '72039',
     '72069'}

fips codes that exist in WP data but neither exist in NRCS nor Plotly data.

new_fips_entries.add('24511')
for i, fips_code in enumerate(new_fips_entries):
    state = fips_df_nrcs[fips_df_nrcs['FIPS'].str.startswith(fips_code[:2])]['state'].iloc[0]
    fips_df_nrcs = fips_df_nrcs.append({'FIPS': fips_code, 'county':'UNKNOWN', 'state':state}, ignore_index=True)
fips_df_nrcs.tail(15)
FIPS county state
3228 78010 St. Croix VI
3229 78020 St. John VI
3230 78030 St. Thomas VI
3231 01001 UNKNOWN AL
3232 24511 UNKNOWN MD
3233 46102 UNKNOWN SD
3234 02158 UNKNOWN AK
3235 02195 UNKNOWN AK
3236 02198 UNKNOWN AK
3237 02230 UNKNOWN AK
3238 02105 UNKNOWN AK
3239 08014 UNKNOWN CO
3240 02275 UNKNOWN AK
3241 72039 UNKNOWN PR
3242 72069 UNKNOWN PR

Adding the missing FIPS county names

Finally, I decided to find the missing counties and add their names to the fips DataFrame. The only fips code that I was unable to find a corresponding county was 24511.

missing_counties_dict = {
'02105':['Hoonah Angoon','AK'],
'02158':['Kusilvak','AK'],
'02195':['Petersburg Borough','AK'],
'02198':['Prince of Wales-Hyder Census Area','AK'],
'02230':['Skagway Municipality','AK'],
'02275':['Wrangell City and Borough' ,'AK'],
'08014':['Broomfield County','CO'],
'72039':['Ciales Municipio','PR'],
'72069':['Humacao Municipio','PR'],
'46102':['Oglala Lakota','SD']
}
assert(len(fips_df_nrcs[fips_df_nrcs['county']=='UNKNOWN'])==len(new_fips_entries))
fips_df_nrcs.to_pickle('fips_df_new.pkl') # save
fips_df_nrcs.head()
FIPS county state
0 01003 Baldwin AL
1 01005 Barbour AL
2 01007 Bibb AL
3 01009 Blount AL
4 01011 Bullock AL

Now we repeat the previous loop to construct the DataFrame, however, instead of updating the DataFrame by concatentaing the new DataFrame to it, we append the results to a predefined list as the first approach becomes inefficent for large DataFrames.

WP_df[WP_df['fips']==wrong_fips] \
.iloc[:,:12].values.reshape(-1,1)
    array([[28.2],
           [26.4],
           [46. ],
           ...,
           [58.1],
           [41.4],
           [42. ]])
climate_df = pd.DataFrame(columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])
num_fips = len(WP_df.fips.unique())
num_years = (2019-1895+1)
num_months = 12
step = 500 # num iterations for timing report 
start_time = datetime.now()
now = start_time
fips_list = []
temps_list = []
county_and_states_list = []
timestamps = []
for i, wrong_fips in enumerate(WP_df.fips.unique()):
    # wrong_fips is ok for looking up the temperatures
    temps =  WP_df[WP_df['fips']==wrong_fips] \
                                          .iloc[:,:12] \
                                          .values \
                                          .reshape(-1,1)
    correct_fips = state_correction_dict[wrong_fips[:2]]+wrong_fips[2:]
    temps_list.extend(temps)
    fips_list.extend([correct_fips] * len(temps))
    county_and_state = fips_df_nrcs.set_index('FIPS').loc[correct_fips].to_list()
    county_and_states_list.extend(county_and_state * len(temps))
    # Temperature data for a few counties does not cover the entire 1895-2019 range
    if len(temps)!=1500:
        timestamps_=[]
        years_ = WP_df[WP_df['fips']==wrong_fips].year.unique()
        for i, year_month in enumerate(product(years_, all_months)):
            timestamps_.append(pd.to_datetime(year_month[0]+year_month[1], format='%Y%B'))
        timestamps.extend(timestamps_)
    else:
        timestamps.extend(timestamps_list)
    if i%step==0:
        print(f"processing the past {step} fips codes took {(datetime.now()-now).total_seconds():.2f} seconds")
        print(f"total time spent so far is {(datetime.now()-start_time).total_seconds():.2f} seconds")
        print(f'{num_fips-i} fips_code remain to be processed!')
now = datetime.now()
    processing the past 500 fips codes took 0.04 seconds
    total time spent so far is 0.05 seconds
    3136 fips_code remain to be processed!
    processing the past 500 fips codes took 11.80 seconds
    total time spent so far is 11.85 seconds
    2636 fips_code remain to be processed!
    processing the past 500 fips codes took 11.16 seconds
    total time spent so far is 23.00 seconds
    2136 fips_code remain to be processed!
    processing the past 500 fips codes took 11.90 seconds
    total time spent so far is 34.90 seconds
    1636 fips_code remain to be processed!
    processing the past 500 fips codes took 28.75 seconds
    total time spent so far is 63.65 seconds
    1136 fips_code remain to be processed!
    processing the past 500 fips codes took 11.99 seconds
    total time spent so far is 75.64 seconds
    636 fips_code remain to be processed!
    processing the past 500 fips codes took 10.69 seconds
    total time spent so far is 86.33 seconds
    136 fips_code remain to be processed!
data_array = np.hstack([np.reshape(timestamps,(-1,1)),  
                        np.reshape(temps_list,(-1,1)), 
                        np.reshape(fips_list,(-1,1)), 
                        np.reshape(county_and_states_list,(-1,2))])
climate_df = pd.DataFrame(data_array, columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])

Since the number of counties with missing data wete not that many and for the sake of completeness, I decided to manually add the UNKNOWN county names. I realized some county names are missing because they have either been modified or recently (after 2015) been added to FIPS data. Here's what I found:

missing_counties_dict = {
'02105':['Hoonah Angoon','AK'],
'02158':['Kusilvak','AK'],
'02195':['Petersburg Borough','AK'],
'02198':['Prince of Wales-Hyder Census Area','AK'],
'02230':['Skagway Municipality','AK'],
'02275':['Wrangell City and Borough' ,'AK'],
'08014':['Broomfield County','CO'],
'72039':['Ciales Municipio','PR'],
'72069':['Humacao Municipio','PR'],
'46102':['Oglala Lakota','SD']
}

for k,v in missing_counties_dict.items():
    row_index = climate_df[climate_df['fips']==k].index
    climate_df.loc[row_index,'county'] = v[0]

climate_df.head()
timestamp tempf fips county state
0 1895-01-01 44.000 01001 Autauga AL
1 1895-02-01 38.200 01001 Autauga AL
2 1895-03-01 55.500 01001 Autauga AL
3 1895-04-01 64.100 01001 Autauga AL
4 1895-05-01 70.600 01001 Autauga AL

Finally, before saving the data, let's transform the data types into correct format.

climate_df.dtypes
    timestamp    datetime64[ns]
    tempf                object
    fips                 object
    county               object
    state                object
    dtype: object
climate_df['tempf'] = climate_df['tempf'].astype(float)
climate_df['fips'] = climate_df['fips'].astype(str)
climate_df['county'] = climate_df['county'].astype(str)
climate_df['state'] = climate_df['state'].astype(str)

# save
climate_df.to_pickle('WP_climate_data.pkl')

Great! Now that we are done with data cleaning, we can start Time Series forecasting!

Time Series Analysis

Time series forecasting or prediction presents a different problem compared to other ML problems in that it introduces a time dimension. All the features in every observation row in a Time Series data have one thing in common; They have been observed at the same time. We can think of time as the index of our DataFrame as it orders the observations temporally.

The goal of Time Series prediction is to make predictions for some time in Future. Due to the temporal and periodic nature of the time, Time Series models usually incorporate different methods to capture different time-scales. For instance, long-time "trends", AKA "drift", periodic trends, AKA "seasonality", and the "residual noise" are modeled using different mesthods and assumptions that best fits that aspect of the time component.

Making Predictions

When predicting a variable in future, we need to have a good understanding about the temporal nature of the target variable. Some of the questions that we might want to aske ourselves before attempting to design a model are:

  • What is the resolution of the data that we have? Is it yearly, monthly, weekly etc.
  • How much data do we have? 1 year? 1 decade? ...
  • What is the time window that we would like th emodel to predict? Is it the next day? Short term? Long term? ...
  • Will we update the data that we have or is it going to be used just one time?
Answering these questions can really help deciding on what model to use and what are the advantages and disadvantages.

Model Validation

As one might intuitively think, the model validation is going to be slightly different from the common validation techniques used in ML in that all the observations in the validation data has to correspond to some time after the most recent observation in the training data. It is important to take into account the potential relationships between the observations, e.g., seasonal, monthly, or hourly correlations to make the most accurate model.

Applications

Time Series models can be applied to a lot of problems in different areas, for example:

  • Forecasting the BitCoin price
  • Forecasting the weather
  • Forecasting the traffic
  • Forecasting the housing price
  • Forecasting the percipitation
  • ...

Modeling Different components of a Time Series

A time series data $X_t$ can be expressed as the sum of drift, seasonal, and noise components: $$ X_t = X_{d,t} + X_{s,t} + X_{n,t}, $$ where $t$ denotes the time index. We aim to develop a mathematical model for each of these constituents so that we can make prediction sthe series in future times.

Modeling the general trend, AKA "drift"

The general, long term time series behavior is referred to as trend. The best way to check the presence of such behavior is to look at the series in a long period of time. For instance, we can take a look at the average temperature of Pennsylvania from 1895-2019. We use annual average, seasonal average, and monthly data to plot the temperature variation in three different time scales and use that data to calcualte the best linear fit.

fig, ax = plt.subplots(3,1,figsize=(12,12),dpi=80)
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0.25)
state = 'PA'
state_df = climate_df[climate_df['state']==state]
# Set the timestamp as index
state_df.set_index('timestamp', inplace=True)
year_min = state_df.index.year.min()
year_max = state_df.index.year.max()
# We define an lookup array that given the month index (1, 2, ...), returns the season
month_to_season_arr = np.array([
    None,
    'Winter', 'Winter',
    'Spring', 'Spring', 'Spring',
    'Summer', 'Summer', 'Summer',
    'Fall', 'Fall', 'Fall',
    'Winter'
])
season_groupby = month_to_season_arr[state_df.index.month]
groupby_list = [
    [(state_df.index.year)], 
    [(state_df.index.year), season_groupby], 
    [(state_df.index.year),(state_df.index.month)]
]
titles = ['Annual', 'Seasonal', 'Monthly']
for i,ax_ in enumerate(ax):
    state_df.groupby(groupby_list[i]).mean().plot(ax=ax_)
    ax_.set_xlabel('')
    ax_.legend([]) # remove the legends
    ax_.set_title(f'{titles[i]} average temperature ($^\circ$F) in {state} from {year_min} to {year_max}', 
                  fontsize=18)

Notice the linear upward trend in temperatures from 1895 to 2019 in Figure 1. Also notice that the higher resolution, e.g. plotting monthly instead of yearly, makes visualizing the general trends more difficult. This makes sense because annual temperature averages fluctuate far less than monthly data. This is not quite true when we compare the seasonal and monthly plots because the monthly temperatures does not vary significantly in a season; therefore, seasonal averages remain close to the monthly temperatures in the months belonging to that seaoson.

Had we plotted the series in a shorter time-period, we might not have been able to notice this trend. For instance, taking a look at the last 30 years instead of the entire time period shows:

fig, ax = plt.subplots(1,1,figsize=(12, 4),dpi=300)
last_many_years = 30
state_df.groupby(state_df.index.year).mean()[-last_many_years:].plot(ax=ax, marker='o')
ax.set_xlabel('')
ax.legend([])
ax.set_title(f'{titles[0]} average temperature ($^\circ$F) in {state} from {year_max-last_many_years} to {year_max}', 
fontsize=14);

Therefore, revealing the long-time trend in time seties data requires data from some time significantly far in the past. In the next part, we will go through the common ways that we can model the trend.

Modeling trend using linear regression

For the sake of convenience, I use the LinearRegression() module of sklearn to find the best linear fit to the time series data. We use the same data from previous part (Annual temperature averages of PA).

linear_model_y = LinearRegression(normalize=True, fit_intercept=True) 
x_data_y = state_df.index.year.unique().values.reshape(-1,1)
y_data_y = state_df.groupby(state_df.index.year).mean()
linear_model_y.fit(x_data_y, y_data_y) 
print(f'linear coefficient={linear_model_y.coef_[0][0]:.4f} and intercept={linear_model_y.intercept_[0]:.2f} ')
    linear coefficient=0.0150 and intercept=19.24 

Similarly, we can use the monthly data for linear regression:

linear_model_m = LinearRegression(normalize=True, fit_intercept=True) 
y_data_m = state_df.groupby([(state_df.index.year),(state_df.index.month)]).mean()
x_data_m = np.linspace(1895, 2019, len(y_data_m)).reshape(-1,1)
linear_model_m.fit(x_data_m, y_data_m) 
print(f'linear coefficient={linear_model_m.coef_[0]:.4f} and intercept={linear_model_m.intercept_:.2f}')
    linear coefficient=0.0161 and intercept=17.05

Notice the slight difference between the two fits. We can visualize the overlay of the linear model prediction and the actual data. The linear model represents the drift constituent of the time series model, $X_{d,t}$

sns.set(rc={'axes.facecolor':'whitesmoke', 'figure.facecolor':'gainsboro', 'legend.facecolor':'white'})
fig, ax = plt.subplots(2, 1, figsize=(12, 8), dpi=300)
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0.4)
a_y, b_y = linear_model_y.coef_[0][0], linear_model_y.intercept_[0] # f(x) = a.x + b (year)
a_m, b_m = linear_model_m.coef_[0], linear_model_m.intercept_ # f(x) = a.x + b (month)
y_pred_y = a_y*x_data_y + b_y
y_pred_m = a_m*x_data_m + b_m
y_data = [y_data_y, y_data_m]
x_data = [x_data_y, x_data_m]
y_pred = [y_pred_y, y_pred_m]
titles = ['Annual', 'Monthly']
for i, ax_ in enumerate(ax):
    ax_.plot(x_data[i], y_data[i], color='deepskyblue', linestyle='-', label='data')
    ax_.plot(x_data[i], y_pred[i], color='maroon', linestyle='--', label='linear fit')
    ax_.set_xlabel('year', fontsize=18, labelpad=10)
    ax_.set_ylabel('temperature ($^\circ$F)', fontsize=18, labelpad=10)
    if i==0:
        ax_.legend(bbox_to_anchor=(1,1,0,0), fontsize=14, loc='upper left');
    ax_.set_title(f'Linear fit of time series (using {titles[i]} data)', fontsize=22);
    ax_.set_xlim([1894, 2020])
    ax_.grid(False)
    plt.setp(ax_.get_xticklabels(), fontsize=14)
plt.setp(ax_.get_yticklabels(), fontsize=14)

Now that we know how to model long time behavior of the temperature, we can start visualizing the temperature data to see if we can gain any insights using this extremely simple model.

Visualizing the temperature variations in different US regions

Let's take a look at the temperature variations in New England (sorry!) states, i.e., Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont.

fig, ax = plt.subplots(1,1,figsize=(12,6),dpi=100)
states_list = ['MA', 'RI', 'CT', 'ME', 'NH', 'VT']
temp_unit = 'F' # 'C' for Celsius or 'F' for farenheit

for state in states_list:
    state_df = climate_df[climate_df['state']==state].set_index("timestamp")
    means = state_df.groupby(state_df.index.year) \
                    .agg({'temp'+temp_unit.lower(): 'mean'})
    years = means.index.tolist()
    ax.plot(years, means, label=state)
ax.set_xlabel('year', labelpad=10, fontsize=18)
ax.set_ylabel(f'Temperature ($^\circ${temp_unit})', labelpad=15, fontsize=18)
states = ', '.join([state for state in states_list])
ax.set_title(f'Temperature variation over the period {min(years)}-{max(years)} for \n{states}', 
             fontsize=22, y=1.02)
plt.setp(ax.get_xticklabels(), fontsize=14)
plt.setp(ax.get_yticklabels(), fontsize=14)
ax.legend(bbox_to_anchor=(1, 0, 0.1, 1), loc='upper left', fontsize=14);

Notice that we see two distinct groups of states in Figure 5, the Northern('NH', 'ME', and 'VT') and Southern('CT', 'MA', and 'RI') New England states.

I wrapped the code above in a function to display the temperature variations for a state or group of states, where the states are color coded by their mean temperature.

def plot_ts_temp(df, states_list, ax=None, 
                 figsize=(15,10), dpi=300, cmap='coolwarm', 
                 temp_unit = 'F', lbfs=18, lbpdx=10, lbpdy=15,
                 ttlfs=22, tckfs=14, title='', xlab='year'):
    colors = col_to_hex(cmap, n=len(states_list))
    # mean temp over time. we use this to sort the states_list by temperature
    means_df = df.query("state in @states_list").groupby(climate_df.state).agg('mean')
    soted_by_temp_states = means_df.index[np.argsort(temps.reshape(1,-1))][0]
    if ax is None:
        fig=plt.figure(figsize=figsize, dpi=dpi)
        ax = fig.add_subplot(111)
    temp_unit = 'F' # 'C' for Celsius or 'F' for farenheit
    print(soted_by_temp_states)
    for i,state in enumerate(soted_by_temp_states):
        state_df = df[df['state']==state].set_index('timestamp')
        means = state_df.groupby(state_df.index.year) \
                        .agg({'temp'+temp_unit.lower(): 'mean'})
        years = means.index.tolist()
        ax.plot(years, means, label=state, color=colors[i])
    ax.set_xlabel(xlab, labelpad=lbpdx, fontsize=lbfs)
    ax.set_ylabel(f'Temperature ($^\circ${temp_unit})', labelpad=lbpdy, fontsize=lbfs)
    states = ', '.join([state for state in states_list])
    if title is not None:
        ax.set_title(f'Temperature variation over the period {min(years)}-{max(years)} for \n{states}', 
                     fontsize=ttlfs, y=1.02)
    plt.setp(ax.get_xticklabels(), fontsize=tckfs)
    plt.setp(ax.get_yticklabels(), fontsize=tckfs)
    ax.legend(bbox_to_anchor=(1, 0, 0.1, 1), loc='upper left', fontsize=tckfs)

This time, let's take a look at East North Central states, i.e., Illinois, Indiana, Michigan, Ohio, and Wisconsin.

plot_ts_temp(climate_df, ['IL', 'IN', 'MI', 'OH', 'WI'])

Figure 6 shows something similar to what we saw in Figure 5, where the temperature variations for East North Central states are further divided into two groups of states, the Northern (WI and MI) and Southern states (IL, IN, and OH) of the region.

Finally, plotting the two groups of states, New England and East North Central states together.

fig, ax = plt.subplots(1,1,figsize=(8,8),sharex=True)
plot_ts_temp(climate_df, ['IL', 'IN', 'MI', 'OH', 'WI', 'MA', 'RI', 'CT', 'ME', 'NH', 'VT'], 
             ax=ax, title=None, cmap='coolwarm', xlab='year')

As one might have expected, the Northern and Southern states belonging to the two regions are now adjacent plots in Figure 7 where we can see a distinct gap between the Northern ('NH', 'ME', 'VT', WI and MI) and Southern ('CT', 'MA', 'RI', IL, IN, and OH)states of the union of the two regions.

Visualizing temperature heatmaps using Plotly

Plotly's Coropleth maps are an efficient, beautiful way of representing and analyzing the spatial variation of a quntity. In addition, if we split our time series data into separate groups, e.g., monthly aggregates, Plotly's slider element can be used to look into the variation of temperature at different months, seasons etc.

In this part of the project, I will first go through visualizing the mean temperatures using Plotly and then, I'll demonstrate how the slider element can be used to explore the temperatures and temperature variation rates at different scales, e.g., state-level, county-level etc.

state-level temperatures in 2019

# mean state temperature in 2019
state_avg_temp_2019 = climate_df[climate_df['timestamp'].dt.year==2019]. \
                      groupby('state')['tempf'].agg('mean')
state_avg_temp_2019 = state_avg_temp_2019.reset_index()
state_avg_temp_2019.head()code>
residual_data.head()
state tempf
0 AK 36.110
1 AL 65.165
2 AR 61.176
3 AZ 61.758
4 CA 57.551

Choropleth maps use

  • a data dictionary that contains the data about the variable to be plotted as well as some geometrical attributes regarding the locations in which the data belong to
  • a layout dictionary that controls the appearance of the map onto which data is plotted.

The following piece of code plots the average state temperatures in 2019.

county-level temperatures in 2019

We can reproduce Figure 8 in greater detail by calculating the mean temperatures for every county instead of every state.

county_avg_temp_2019 = climate_df[climate_df['timestamp'].dt.year==2019]. \
                       groupby(['fips','county','state'])['tempf'].agg('mean')
county_avg_temp_2019 = county_avg_temp_2019.reset_index()
county_avg_temp_2019.head()
fips county state tempf
0 01001 Autauga AL 366.608
1 01003 Baldwin AL 68.792
2 01005 Barbour AL 67.133
3 01007 Bibb AL 64.746
4 01009 Blount AL 63.583

County-level temperature variation rate in US in (°F per 100 years)