Data Preparation and Advanced Visualization Using The Washington Post's series, "2ºC: Beyond the Limit." Data
Climate change and global warming are real, observable, and measurable. This project conducts time series analysis using the data supporting The Washington Post's series, "2ºC: Beyond the Limit.", to quantify the links between the climate variation and different drivers of change.
Motivation and Summary
I got really interested in finding and analyzing this data set after reading the Washington Post's article. Here's the link to WP's github repository and some more details about the data set:
-
The data supporting the Washington Post's series, 2°C: Beyond the limit, that was awarded the Pulitzer prize for explanatory reporting (link to data). The dataset includes:
fips: A five digit fips code for the countyCTYNAME: the name of the countySTNAME: the name of the stateAnnual: Estimate of annual average temperature change in Celsius for the county, 1895-2019Fall: temperature change in September, October and NovemberSpring: temperature change in March, April and MaySummer: temperature change in June, July and AugustWinter: temperature change in December and the following January and Februarymax_warming_season: the season where temperatures are increasing fastest
The data set includes county-level annual temperatures for the US counties. It can be used, as is and without much cleaning, for analysis and visualization purposes. However, after going through the raw files in the repository, I realized there's another file that has the data above with more details, where the average temperature for each county is recorded monthly for the same period of time (1895-2019). This is great because it gives me a better opportunity to analyze not only the long-time behavior of the temperature time series (Drift) but also the Seasonal fluctuations over time.
I decided to breakdown this project into two parts where in part 1, I go through the data cleaning, data imputation, and data visualization using Plotly. In part 2, I use the data from part 1 to design a 'Drift' + 'Seasonal' + 'Noise' time series model for temperature forecasting.
Data Exploration
Here's the first few lines of the raw file:
head -5 climdiv-tmpccy-v1.0.0-20200106
01001021895 44.00 38.20 55.50 64.10 70.60 78.30 80.40 80.40 79.00 61.40 54.40 45.30 01001021896 44.30 49.00 54.00 69.30 76.80 78.00 81.70 83.10 77.90 64.70 58.00 47.30 01001021897 43.70 52.30 61.30 63.00 70.00 82.40 82.40 79.60 76.60 67.40 54.90 48.20 01001021898 50.10 46.80 60.10 59.60 75.00 81.50 80.80 79.20 76.20 62.10 50.20 44.20 01001021899 44.60 41.50 56.60 62.30 76.70 81.00 81.00 81.50 74.30 66.60 55.70 45.30
As you can see, the file doesn't have a header so we need to create a list for the column names. The first entry seems to be county FIPS code (5 digits) + 02 + year with the next 12 entries denoting the monthly average temperatures.
I start with creating a list and dictionary for month names:
months = [month for month in month_name][1:]
print(months)
months_dict = dict(zip(months,np.arange(1,13)))
print(months_dict)
['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
{'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6, 'July': 7, 'August': 8, 'September': 9, 'October': 10, 'November': 11, 'December': 12}
The file doesn't have a header so we create a list for the column names
WP_df = pd.read_csv('climdiv-tmpccy-v1.0.0-20200106',
delimiter=r"\s+",
header=None,
names = ['FipsYear']+months)
WP_df.head()
| FipsYear | January | February | March | April | May | June | July | August | September | October | November | December | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1001021895 | 44.000 | 38.200 | 55.500 | 64.100 | 70.600 | 78.300 | 80.400 | 80.400 | 79.000 | 61.400 | 54.400 | 45.300 |
| 1 | 1001021896 | 44.300 | 49.000 | 54.000 | 69.300 | 76.800 | 78.000 | 81.700 | 83.100 | 77.900 | 64.700 | 58.000 | 47.300 |
| 2 | 1001021897 | 43.700 | 52.300 | 61.300 | 63.000 | 70.000 | 82.400 | 82.400 | 79.600 | 76.600 | 67.400 | 54.900 | 48.200 |
| 3 | 1001021898 | 50.100 | 46.800 | 60.100 | 59.600 | 75.000 | 81.500 | 80.800 | 79.200 | 76.200 | 62.100 | 50.200 | 44.200 |
| 4 | 1001021899 | 44.600 | 41.500 | 56.600 | 62.300 | 76.700 | 81.000 | 81.000 | 81.500 | 74.300 | 66.600 | 55.700 | 45.300 |
WP_df.dtypes
FipsYear int64
January float64
February float64
March float64
April float64
May float64
June float64
July float64
August float64
September float64
October float64
November float64
December float64
dtype: object
Note that pandas recognizes the first column to have the type int which might cause some issues later because the leading zero for the fips numbers starting with 0 has been mistaknely eliminated. We can resolve this issue in multiple different ways but here, we use the apply() method of Pandas to make the transofrmation. We also split the contents of the first column to two new columns: fips and year.
WP_df['year']=WP_df['FipsYear'].apply(lambda x: str(x)[-4:])
print(WP_df['year'].head())
0 1895
1 1896
2 1897
3 1898
4 1899
Name: year, dtype: object
WP_df['fips']=WP_df['FipsYear']. \
apply(lambda x: '0'+str(x)[:4] if len(str(x))<11 else str(x)[:5])
print(WP_df['fips'].head())
0 01001
1 01001
2 01001
3 01001
4 01001
Name: fips, dtype: object
WP_df.drop(['FipsYear'], axis=1, inplace=True)
WP_df.head()
| January | February | March | April | May | June | July | August | September | October | November | December | year | fips | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44.000 | 38.200 | 55.500 | 64.100 | 70.600 | 78.300 | 80.400 | 80.400 | 79.000 | 61.400 | 54.400 | 45.300 | 1895 | 01001 |
| 1 | 44.300 | 49.000 | 54.000 | 69.300 | 76.800 | 78.000 | 81.700 | 83.100 | 77.900 | 64.700 | 58.000 | 47.300 | 1896 | 01001 |
| 2 | 43.700 | 52.300 | 61.300 | 63.000 | 70.000 | 82.400 | 82.400 | 79.600 | 76.600 | 67.400 | 54.900 | 48.200 | 1897 | 01001 |
| 3 | 50.100 | 46.800 | 60.100 | 59.600 | 75.000 | 81.500 | 80.800 | 79.200 | 76.200 | 62.100 | 50.200 | 44.200 | 1898 | 01001 |
| 4 | 44.600 | 41.500 | 56.600 | 62.300 | 76.700 | 81.000 | 81.000 | 81.500 | 74.300 | 66.600 | 55.700 | 45.300 | 1899 | 01001 |
Transforming our data into a more convenient format
Time Series data are best useful when represented using a timestamp, i.e., a column that represents the temperature for each county using a year-month format. The new DataFrame has five columns: timestamp, temperature (in $^{\circ}\text{F}\,$), fips code, county name, and state name.
climate_df = pd.DataFrame(columns=['timestamp', 'tempf', 'fips', 'county', 'state'])
climate_df.head()
| timestamp | tempf | fips | county | state |
|---|
Adding fips information to Washington Post data
Washington Post data that includes the variation of temperatures over the period 1895-2019 is a county-level data where the counties are represented using their Federal Information Processing Standards (FIPS) codes. As we will see later, including the county and state names will be helpful for visualization purposes.
We need to find a table that converts these codes to the county/state names before being able to join the two climate dataframes.
## obtaining the fips table from
## https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697
NRCS page has a table including FIPS codes for every county in the US. We use a relatively straightforward way to convert this table into a useful dictionary in Python. First, we need to get the contents of the html page.
response = requests.get('https://www.nrcs.usda.gov/wps/portal/nrcs/detail/national/home/?cid=nrcs143_013697')
The most important components of a response to check before moving forward are the status code and the content.
response.status_code
200
We get a response code of 200, meaning that our request was 'Successful'!
response.content[:100]
b'\r\n\r\n\r\n\r\n \r\n \r\n\r\n \r\n\r\n \r\n \r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n'
We can check the response text using BeautifulSoup. BeautifulSoup allows us to find our tag of interest by specifying one of its attributes. By inspecting the table tag in this case we realize that it has and attribute class="data". Furthermore, SoupStrainer method of the bs4 library can be used to limit our search to table tags only.
soup = BeautifulSoup(response.text,'html.parser',parse_only=SoupStrainer('table'))
table = soup.find("table", { "class" : "data" })
print(table.text[:50])
FIPS
Name
State
01001
Autauga
AL
01003
Bald
Looks Ok! Next we want to encapsulate this information into a list() or DataFrame
def get_row_data(tr, column_tag='td'):
return [td.get_text(strip=True) for td in tr.find_all(column_tag)]
rows = []
trs = table.find_all('tr')
headerow = get_row_data(trs[0], 'th')
for tr in trs: # for every table row
rows.append(get_row_data(tr)) # data row
print(headerow)
['FIPS', 'Name', 'State']
rows[1:5]
[['01001', 'Autauga', 'AL'],
['01003', 'Baldwin', 'AL'],
['01005', 'Barbour', 'AL'],
['01007', 'Bibb', 'AL']]
fips_df_nrcs = pd.DataFrame(rows[1:],columns=['FIPS', 'county', 'state'])
fips_df_nrcs.head()
| FIPS | county | state | |
|---|---|---|---|
| 0 | 01001 | Autauga | AL |
| 1 | 01003 | Baldwin | AL |
| 2 | 01005 | Barbour | AL |
| 3 | 01007 | Bibb | AL |
| 4 | 01009 | Blount | AL |
We save the dataframe for later use.
#fips_df_nrcs.to_pickle('fips.pkl')
fips_df_nrcs=pd.read_pickle('fips.pkl')
Creating a DataFrame suited for Time Series Analysis
Now that I have the fips dictionary ready, I have the following plan to populate the DataFrame:
- First, I filter the DataFrame rows by fips.
- Then, I reshape the month column temperatures into a single-column vector that are temporally ordered
- Next, I add a timestamp column that I have generated before. The timestamp column temporally matches the temperatures in the previous column.
- Finally, I create an array of ($12\times\text{number of years}$) rows and 3 columns for the fips code, the county name, and the state name.
- I repeat the same procedure for the rest of the fips codes.
WP_df.head()
| January | February | March | April | May | June | July | August | September | October | November | December | year | fips | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44.000 | 38.200 | 55.500 | 64.100 | 70.600 | 78.300 | 80.400 | 80.400 | 79.000 | 61.400 | 54.400 | 45.300 | 1895 | 01001 |
| 1 | 44.300 | 49.000 | 54.000 | 69.300 | 76.800 | 78.000 | 81.700 | 83.100 | 77.900 | 64.700 | 58.000 | 47.300 | 1896 | 01001 |
| 2 | 43.700 | 52.300 | 61.300 | 63.000 | 70.000 | 82.400 | 82.400 | 79.600 | 76.600 | 67.400 | 54.900 | 48.200 | 1897 | 01001 |
| 3 | 50.100 | 46.800 | 60.100 | 59.600 | 75.000 | 81.500 | 80.800 | 79.200 | 76.200 | 62.100 | 50.200 | 44.200 | 1898 | 01001 |
| 4 | 44.600 | 41.500 | 56.600 | 62.300 | 76.700 | 81.000 | 81.000 | 81.500 | 74.300 | 66.600 | 55.700 | 45.300 | 1899 | 01001 |
Creating the timestamp column
all_years = WP_df.year.unique()
all_months = months
# A little test before going full on
for i, year_month in enumerate(product(all_years[:2], all_months)):
print(pd.to_datetime(year_month[0]+year_month[1], format='%Y%B'))
1895-01-01 00:00:00
1895-02-01 00:00:00
1895-03-01 00:00:00
1895-04-01 00:00:00
1895-05-01 00:00:00
1895-06-01 00:00:00
1895-07-01 00:00:00
1895-08-01 00:00:00
1895-09-01 00:00:00
1895-10-01 00:00:00
1895-11-01 00:00:00
1895-12-01 00:00:00
1896-01-01 00:00:00
1896-02-01 00:00:00
1896-03-01 00:00:00
1896-04-01 00:00:00
1896-05-01 00:00:00
1896-06-01 00:00:00
1896-07-01 00:00:00
1896-08-01 00:00:00
1896-09-01 00:00:00
1896-10-01 00:00:00
1896-11-01 00:00:00
1896-12-01 00:00:00
Looks fine! Moving on!
timestamps_list = []
for i, year_month in enumerate(product(all_years, all_months)):
timestamps_list.append(pd.to_datetime(year_month[0]+year_month[1], format='%Y%B'))
Sanity check!
assert((2019-1895+1)*12==len(timestamps_list))
Again, a little test before iterating through the entire rows:
climate_df = pd.DataFrame(columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])
climate_df.head()
| timestamp | tempf | fips | county | state |
|---|
for i, fips in enumerate(WP_df.fips.unique()[:2]):
temperature_array = WP_df[WP_df['fips']==fips] \
.iloc[:,:12] \
.values \
.reshape(-1,1)
fips_array = np.asarray([fips] * len(temperature_array)).reshape(-1,1)
county_and_state = fips_df_nrcs.set_index('FIPS').loc[fips].to_list()
county_and_state_array = np.asarray(county_and_state * len(temperature_array)).reshape(-1,2)
data_array = np.hstack([np.asarray(timestamps_list).reshape(-1,1),
temperature_array,
fips_array,
county_and_state_array])
tmp_df = pd.DataFrame(data_array, columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])
climate_df = pd.concat([climate_df, tmp_df])
climate_df.head()
| timestamp | tempf | fips | county | state | |
|---|---|---|---|---|---|
| 0 | 1895-01-01 | 44.000 | 01001 | Autauga | AL |
| 1 | 1895-02-01 | 38.200 | 01001 | Autauga | AL |
| 2 | 1895-03-01 | 55.500 | 01001 | Autauga | AL |
| 3 | 1895-04-01 | 64.100 | 01001 | Autauga | AL |
| 4 | 1895-05-01 | 70.600 | 01001 | Autauga | AL |
Perfect! Now we can go ahead and create the DataFrame:
for i, fips in enumerate(WP_df.fips.unique()[2:]):
temperature_array = WP_df[WP_df['fips']==fips] \
.iloc[:,:12] \
.values \
.reshape(-1,1)
fips_array = np.asarray([fips] * len(temperature_array)).reshape(-1,1)
county_and_state = fips_df_nrcs.set_index('FIPS').loc[fips].to_list()
county_and_state_array = np.asarray(county_and_state * len(temperature_array)).reshape(-1,2)
data_array = np.hstack([np.asarray(timestamps_list).reshape(-1,1),
temperature_array,
fips_array,
county_and_state_array])
tmp_df = pd.DataFrame(data_array, columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])
climate_df = pd.concat([climate_df, tmp_df])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
...
KeyError: '02001'
Even though the web scraping works fine, it seems like the fips data obtained from the above method lacks a few fips numbers, e.g., '02001'; therefore I use a public dataset from Plotly to obtain a complete fips data:
fips_df_plotly = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/fips-unemp-16.csv", dtype={"fips": str})
fips_df_plotly.head()
| fips | unemp | |
|---|---|---|
| 0 | 01001 | 5.300 |
| 1 | 01003 | 5.400 |
| 2 | 01005 | 8.600 |
| 3 | 01007 | 6.600 |
| 4 | 01009 | 5.500 |
Comparing it with the initial fips data:
len(fips_df_nrcs), len(fips_df_plotly)
(3231, 3219)
Interestingly, the initial fips data has more entries! Let's compare the fips data in with the union of the two fips datasets that we have obtained so far:
len(set(WP_df.fips).difference(set(fips_df_plotly.fips).union(set(fips_df_nrcs.FIPS))))
1538
Ok! something's definintely wrong! What I do next is that I print the number of unique fips codes for each state (the first two digits in the fips code denotes the state) from the three sources to figure out what causes this discrepancy. My guess is that one dataset has shifted the state codes that causes the mismatch. The maximum and minimum values of the state code are '01'('AL' or 'Alabama') and '78'('VI' or 'Virgin Islands').
state_codes = ['0'+f'{x}' if x<10 else f'{x}' for x in range(1,73)]
print(state_codes)
['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14', '15', '16', '17', '18',
'19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36',
'37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54',
'55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72']
for state_code in state_codes:
WP_unique_fips = len(WP_df[WP_df['fips'].str.startswith(state_code)]['fips'].unique())
unique_fips_df_nrcs = len(fips_df_nrcs[fips_df_nrcs['FIPS'].str.startswith(state_code)]['FIPS'].unique())
unique_fips_df_plotly = len(fips_df_plotly[fips_df_plotly['fips'].str.startswith(state_code)]['fips'].unique())
print(state_code, WP_unique_fips, unique_fips_df_nrcs, unique_fips_df_plotly)
01 67 67 67
02 15 27 29
03 75 0 0
04 58 15 15
05 64 75 75
06 8 58 58
07 3 0 0
08 67 63 64
09 159 8 8
10 44 3 3
11 102 1 1
12 92 67 67
13 99 159 159
14 105 0 0
15 120 4 4
16 64 44 44
17 16 102 102
18 25 92 92
19 14 99 99
20 83 105 105
21 87 120 120
22 82 64 64
23 115 16 16
24 56 24 24
25 93 14 14
26 17 83 83
27 10 87 87
28 21 82 82
29 33 115 115
30 62 57 56
31 100 93 93
32 53 17 17
33 88 10 10
34 77 21 21
35 36 33 33
36 67 62 62
37 5 100 100
38 46 53 53
39 66 88 88
40 95 77 77
41 254 36 36
42 29 67 67
43 14 0 0
44 132 5 5
45 39 46 46
46 55 66 66
47 72 95 95
48 23 254 254
49 0 29 29
50 29 14 14
51 0 136 133
52 0 0 0
53 0 39 39
54 0 55 55
55 0 72 72
56 0 23 23
57 0 0 0
58 0 0 0
59 0 0 0
60 0 3 0
61 0 0 0
62 0 0 0
63 0 0 0
64 0 0 0
65 0 0 0
66 0 1 0
67 0 0 0
68 0 0 0
69 0 3 0
70 0 0 0
71 0 0 0
72 0 76 78
Looks like our suspicion was correct! By visually inspecting the table we can see that except the first row, the numbers in the first column matches the number in the second and third column a few rows apart. We can fix the problem by defining a dictionary that takes the wrong state_codes as keys and returns the correct state_codeas values. We also note that the second fips dataset obtained from Plotly better matches with the WP fips codes.
state_correction_dict = {}
fips_count_per_state = []
avail_states = []
for state_code in state_codes:
unique_fips_df_plotly = len(fips_df_plotly[fips_df_plotly['fips'].str.startswith(state_code)]['fips'].unique())
if unique_fips_df_plotly!=0:
avail_states.append(state_code)
fips_count_per_state.append(unique_fips_df_plotly)
fips_dict=dict(zip(avail_states,fips_count_per_state))
for i, state_code in enumerate(state_codes):
print(state_code+":")
WP_unique_fips = len(WP_df[WP_df['fips'].str.startswith(state_code)]['fips'].unique())
if WP_unique_fips!=0:
if fips_count_per_state.count(WP_unique_fips)==1:
index = fips_count_per_state.index(WP_unique_fips)
state_correction_dict[state_code]=avail_states[index]
print(f'found a matching state! the state code {avail_states[index]} is mistakenly denoted by {state_code}')
elif fips_count_per_state.count(WP_unique_fips)>1:
indices = [i for i, e in enumerate(fips_count_per_state) if e == WP_unique_fips]
print('found multiple matching states!')
print([avail_states[i] for i in indices])
state_correction_dict[state_code]=input('which code is the correct one?')
else:
state_correction_dict[state_code]=input('manually assign the correct code?')
01:
found multiple matching states!
['01', '12', '42']
which code is the correct one?01
02:
found a matching state! the state code 04 is mistakenly denoted by 02
03:
found a matching state! the state code 05 is mistakenly denoted by 03
04:
found a matching state! the state code 06 is mistakenly denoted by 04
05:
found multiple matching states!
['08', '22']
which code is the correct one?08
06:
found a matching state! the state code 09 is mistakenly denoted by 06
07:
found a matching state! the state code 10 is mistakenly denoted by 07
08:
found multiple matching states!
['01', '12', '42']
which code is the correct one?12
09:
found a matching state! the state code 13 is mistakenly denoted by 09
10:
found a matching state! the state code 16 is mistakenly denoted by 10
11:
found a matching state! the state code 17 is mistakenly denoted by 11
12:
found a matching state! the state code 18 is mistakenly denoted by 12
13:
found a matching state! the state code 19 is mistakenly denoted by 13
14:
found a matching state! the state code 20 is mistakenly denoted by 14
15:
found a matching state! the state code 21 is mistakenly denoted by 15
16:
found multiple matching states!
['08', '22']
which code is the correct one?22
17:
found a matching state! the state code 23 is mistakenly denoted by 17
18:
manually assign the correct code?24
19:
found multiple matching states!
['25', '50']
which code is the correct one?25
20:
found a matching state! the state code 26 is mistakenly denoted by 20
21:
found a matching state! the state code 27 is mistakenly denoted by 21
22:
found a matching state! the state code 28 is mistakenly denoted by 22
23:
found a matching state! the state code 29 is mistakenly denoted by 23
24:
found a matching state! the state code 30 is mistakenly denoted by 24
25:
found a matching state! the state code 31 is mistakenly denoted by 25
26:
found a matching state! the state code 32 is mistakenly denoted by 26
27:
found a matching state! the state code 33 is mistakenly denoted by 27
28:
found a matching state! the state code 34 is mistakenly denoted by 28
29:
found a matching state! the state code 35 is mistakenly denoted by 29
30:
found a matching state! the state code 36 is mistakenly denoted by 30
31:
found a matching state! the state code 37 is mistakenly denoted by 31
32:
found a matching state! the state code 38 is mistakenly denoted by 32
33:
found a matching state! the state code 39 is mistakenly denoted by 33
34:
found a matching state! the state code 40 is mistakenly denoted by 34
35:
found a matching state! the state code 41 is mistakenly denoted by 35
36:
found multiple matching states!
['01', '12', '42']
which code is the correct one?42
37:
found a matching state! the state code 44 is mistakenly denoted by 37
38:
found a matching state! the state code 45 is mistakenly denoted by 38
39:
found a matching state! the state code 46 is mistakenly denoted by 39
40:
found a matching state! the state code 47 is mistakenly denoted by 40
41:
found a matching state! the state code 48 is mistakenly denoted by 41
42:
found multiple matching states!
['02', '49']
which code is the correct one?49
43:
found multiple matching states!
['25', '50']
which code is the correct one?50
44:
manually assign the correct code?51
45:
found a matching state! the state code 53 is mistakenly denoted by 45
46:
found a matching state! the state code 54 is mistakenly denoted by 46
47:
found a matching state! the state code 55 is mistakenly denoted by 47
48:
found a matching state! the state code 56 is mistakenly denoted by 48
49:
50:
found multiple matching states!
['02', '49']
which code is the correct one?02
51:
52:
53:
54:
55:
56:
57:
58:
59:
60:
61:
62:
63:
64:
65:
66:
67:
68:
69:
70:
71:
72:
print(state_correction_dict)
{'01': '01', '02': '04', '03': '05', '04': '06', '05': '08', '06': '09', '07': '10',
'08': '12', '09': '13', '10': '16', '11': '17', '12': '18', '13': '19', '14': '20',
'15': '21', '16': '22', '17': '23', '18': '24', '19': '25', '20': '26', '21': '27',
'22': '28', '23': '29', '24': '30', '25': '31', '26': '32', '27': '33', '28': '34',
'29': '35', '30': '36', '31': '37', '32': '38', '33': '39', '34': '40', '35': '41',
'36': '42', '37': '44', '38': '45', '39': '46', '40': '47', '41': '48', '42': '49',
'43': '50', '44': '51', '45': '53', '46': '54', '47': '55', '48': '56', '50': '02'}
By inspection, I figured out that even though the state_code 18 with 25 unique fips codes doesn't match any entries from the fips data, because it follows the order of the neighboring rows, I assigned it to the state code 24 with 24 unique fips entries. This means that there's likely to be a fips code that we will have to remove.
WP_df[WP_df['fips'].str.startswith('18')]['fips'].unique()
array(['18001', '18003', '18005', '18009', '18011', '18013', '18015', '18017', '18019',
'18021', '18023', '18025', '18027', '18029', '18031', '18033', '18035', '18037',
'18039', '18041', '18043', '18045', '18047', '18510', '18511'], dtype=object)
fips_df_plotly[fips_df_plotly['fips'].str.startswith('24')]['fips'].unique()
array(['24001', '24003', '24005', '24009', '24011', '24013', '24015', '24017', '24019', '24021', '24023', '24025', '24027', '24029', '24031', '24033', '24035', '24037', '24039', '24041', '24043', '24045', '24047', '24510'], dtype=object)
Note that by comparing the two outputs, we notice that fips='24511', the correct code for the observations with fips='18511', is not among the in the fips codes. We either need to drop those observations from the climate data or add fips='24511' to the fips DataFrame. We go with the second option. Before adding it to our fips data, we first update the NRCS fips data, fips_df_nrcs, with the fips codes that don't exist in fips_df_nrcs but are tabulated in fips_df_plotly.
fips codes that exist in Plotly data but not in NRCS page.
new_fips_entries = set(fips_df_plotly.fips).difference(set(fips_df_nrcs.FIPS))
new_fips_entries
{'02105',
'02158',
'02195',
'02198',
'02230',
'02275',
'08014',
'46102',
'72039',
'72069'}
fips codes that exist in WP data but neither exist in NRCS nor Plotly data.
new_fips_entries.add('24511')
for i, fips_code in enumerate(new_fips_entries):
state = fips_df_nrcs[fips_df_nrcs['FIPS'].str.startswith(fips_code[:2])]['state'].iloc[0]
fips_df_nrcs = fips_df_nrcs.append({'FIPS': fips_code, 'county':'UNKNOWN', 'state':state}, ignore_index=True)
fips_df_nrcs.tail(15)
| FIPS | county | state | |
|---|---|---|---|
| 3228 | 78010 | St. Croix | VI |
| 3229 | 78020 | St. John | VI |
| 3230 | 78030 | St. Thomas | VI |
| 3231 | 01001 | UNKNOWN | AL |
| 3232 | 24511 | UNKNOWN | MD |
| 3233 | 46102 | UNKNOWN | SD |
| 3234 | 02158 | UNKNOWN | AK |
| 3235 | 02195 | UNKNOWN | AK |
| 3236 | 02198 | UNKNOWN | AK |
| 3237 | 02230 | UNKNOWN | AK |
| 3238 | 02105 | UNKNOWN | AK |
| 3239 | 08014 | UNKNOWN | CO |
| 3240 | 02275 | UNKNOWN | AK |
| 3241 | 72039 | UNKNOWN | PR |
| 3242 | 72069 | UNKNOWN | PR |
Adding the missing FIPS county names
Finally, I decided to find the missing counties and add their names to the fips DataFrame. The only fips code that I was unable to find a corresponding county was 24511.
missing_counties_dict = {
'02105':['Hoonah Angoon','AK'],
'02158':['Kusilvak','AK'],
'02195':['Petersburg Borough','AK'],
'02198':['Prince of Wales-Hyder Census Area','AK'],
'02230':['Skagway Municipality','AK'],
'02275':['Wrangell City and Borough' ,'AK'],
'08014':['Broomfield County','CO'],
'72039':['Ciales Municipio','PR'],
'72069':['Humacao Municipio','PR'],
'46102':['Oglala Lakota','SD']
}
assert(len(fips_df_nrcs[fips_df_nrcs['county']=='UNKNOWN'])==len(new_fips_entries))
fips_df_nrcs.to_pickle('fips_df_new.pkl') # save
fips_df_nrcs.head()
| FIPS | county | state | |
|---|---|---|---|
| 0 | 01003 | Baldwin | AL |
| 1 | 01005 | Barbour | AL |
| 2 | 01007 | Bibb | AL |
| 3 | 01009 | Blount | AL |
| 4 | 01011 | Bullock | AL |
Now we repeat the previous loop to construct the DataFrame, however, instead of updating the DataFrame by concatentaing the new DataFrame to it, we append the results to a predefined list as the first approach becomes inefficent for large DataFrames.
WP_df[WP_df['fips']==wrong_fips] \
.iloc[:,:12].values.reshape(-1,1)
array([[28.2],
[26.4],
[46. ],
...,
[58.1],
[41.4],
[42. ]])
climate_df = pd.DataFrame(columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])
num_fips = len(WP_df.fips.unique())
num_years = (2019-1895+1)
num_months = 12
step = 500 # num iterations for timing report
start_time = datetime.now()
now = start_time
fips_list = []
temps_list = []
county_and_states_list = []
timestamps = []
for i, wrong_fips in enumerate(WP_df.fips.unique()):
# wrong_fips is ok for looking up the temperatures
temps = WP_df[WP_df['fips']==wrong_fips] \
.iloc[:,:12] \
.values \
.reshape(-1,1)
correct_fips = state_correction_dict[wrong_fips[:2]]+wrong_fips[2:]
temps_list.extend(temps)
fips_list.extend([correct_fips] * len(temps))
county_and_state = fips_df_nrcs.set_index('FIPS').loc[correct_fips].to_list()
county_and_states_list.extend(county_and_state * len(temps))
# Temperature data for a few counties does not cover the entire 1895-2019 range
if len(temps)!=1500:
timestamps_=[]
years_ = WP_df[WP_df['fips']==wrong_fips].year.unique()
for i, year_month in enumerate(product(years_, all_months)):
timestamps_.append(pd.to_datetime(year_month[0]+year_month[1], format='%Y%B'))
timestamps.extend(timestamps_)
else:
timestamps.extend(timestamps_list)
if i%step==0:
print(f"processing the past {step} fips codes took {(datetime.now()-now).total_seconds():.2f} seconds")
print(f"total time spent so far is {(datetime.now()-start_time).total_seconds():.2f} seconds")
print(f'{num_fips-i} fips_code remain to be processed!')
now = datetime.now()
processing the past 500 fips codes took 0.04 seconds
total time spent so far is 0.05 seconds
3136 fips_code remain to be processed!
processing the past 500 fips codes took 11.80 seconds
total time spent so far is 11.85 seconds
2636 fips_code remain to be processed!
processing the past 500 fips codes took 11.16 seconds
total time spent so far is 23.00 seconds
2136 fips_code remain to be processed!
processing the past 500 fips codes took 11.90 seconds
total time spent so far is 34.90 seconds
1636 fips_code remain to be processed!
processing the past 500 fips codes took 28.75 seconds
total time spent so far is 63.65 seconds
1136 fips_code remain to be processed!
processing the past 500 fips codes took 11.99 seconds
total time spent so far is 75.64 seconds
636 fips_code remain to be processed!
processing the past 500 fips codes took 10.69 seconds
total time spent so far is 86.33 seconds
136 fips_code remain to be processed!
data_array = np.hstack([np.reshape(timestamps,(-1,1)),
np.reshape(temps_list,(-1,1)),
np.reshape(fips_list,(-1,1)),
np.reshape(county_and_states_list,(-1,2))])
climate_df = pd.DataFrame(data_array, columns = ['timestamp', 'tempf', 'fips', 'county', 'state'])
Since the number of counties with missing data wete not that many and for the sake of completeness, I decided to manually add the UNKNOWN county names. I realized some county names are missing because they have either been modified or recently (after 2015) been added to FIPS data. Here's what I found:
missing_counties_dict = {
'02105':['Hoonah Angoon','AK'],
'02158':['Kusilvak','AK'],
'02195':['Petersburg Borough','AK'],
'02198':['Prince of Wales-Hyder Census Area','AK'],
'02230':['Skagway Municipality','AK'],
'02275':['Wrangell City and Borough' ,'AK'],
'08014':['Broomfield County','CO'],
'72039':['Ciales Municipio','PR'],
'72069':['Humacao Municipio','PR'],
'46102':['Oglala Lakota','SD']
}
for k,v in missing_counties_dict.items():
row_index = climate_df[climate_df['fips']==k].index
climate_df.loc[row_index,'county'] = v[0]
climate_df.head()
| timestamp | tempf | fips | county | state | |
|---|---|---|---|---|---|
| 0 | 1895-01-01 | 44.000 | 01001 | Autauga | AL |
| 1 | 1895-02-01 | 38.200 | 01001 | Autauga | AL |
| 2 | 1895-03-01 | 55.500 | 01001 | Autauga | AL |
| 3 | 1895-04-01 | 64.100 | 01001 | Autauga | AL |
| 4 | 1895-05-01 | 70.600 | 01001 | Autauga | AL |
Finally, before saving the data, let's transform the data types into correct format.
climate_df.dtypes
timestamp datetime64[ns]
tempf object
fips object
county object
state object
dtype: object
climate_df['tempf'] = climate_df['tempf'].astype(float)
climate_df['fips'] = climate_df['fips'].astype(str)
climate_df['county'] = climate_df['county'].astype(str)
climate_df['state'] = climate_df['state'].astype(str)
# save
climate_df.to_pickle('WP_climate_data.pkl')
Great! Now that we are done with data cleaning, we can start Time Series forecasting!
Time Series Analysis
Time series forecasting or prediction presents a different problem compared to other ML problems in that it introduces a time dimension. All the features in every observation row in a Time Series data have one thing in common; They have been observed at the same time. We can think of time as the index of our DataFrame as it orders the observations temporally.
The goal of Time Series prediction is to make predictions for some time in Future. Due to the temporal and periodic nature of the time, Time Series models usually incorporate different methods to capture different time-scales. For instance, long-time "trends", AKA "drift", periodic trends, AKA "seasonality", and the "residual noise" are modeled using different mesthods and assumptions that best fits that aspect of the time component.
Making Predictions
When predicting a variable in future, we need to have a good understanding about the temporal nature of the target variable. Some of the questions that we might want to aske ourselves before attempting to design a model are:
- What is the resolution of the data that we have? Is it yearly, monthly, weekly etc.
- How much data do we have? 1 year? 1 decade? ...
- What is the time window that we would like th emodel to predict? Is it the next day? Short term? Long term? ...
- Will we update the data that we have or is it going to be used just one time?
Model Validation
As one might intuitively think, the model validation is going to be slightly different from the common validation techniques used in ML in that all the observations in the validation data has to correspond to some time after the most recent observation in the training data. It is important to take into account the potential relationships between the observations, e.g., seasonal, monthly, or hourly correlations to make the most accurate model.
Applications
Time Series models can be applied to a lot of problems in different areas, for example:
- Forecasting the BitCoin price
- Forecasting the weather
- Forecasting the traffic
- Forecasting the housing price
- Forecasting the percipitation
- ...
Modeling Different components of a Time Series
A time series data $X_t$ can be expressed as the sum of drift, seasonal, and noise components: $$ X_t = X_{d,t} + X_{s,t} + X_{n,t}, $$ where $t$ denotes the time index. We aim to develop a mathematical model for each of these constituents so that we can make prediction sthe series in future times.
Modeling the general trend, AKA "drift"
The general, long term time series behavior is referred to as trend. The best way to check the presence of such behavior is to look at the series in a long period of time. For instance, we can take a look at the average temperature of Pennsylvania from 1895-2019. We use annual average, seasonal average, and monthly data to plot the temperature variation in three different time scales and use that data to calcualte the best linear fit.
fig, ax = plt.subplots(3,1,figsize=(12,12),dpi=80)
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0.25)
state = 'PA'
state_df = climate_df[climate_df['state']==state]
# Set the timestamp as index
state_df.set_index('timestamp', inplace=True)
year_min = state_df.index.year.min()
year_max = state_df.index.year.max()
# We define an lookup array that given the month index (1, 2, ...), returns the season
month_to_season_arr = np.array([
None,
'Winter', 'Winter',
'Spring', 'Spring', 'Spring',
'Summer', 'Summer', 'Summer',
'Fall', 'Fall', 'Fall',
'Winter'
])
season_groupby = month_to_season_arr[state_df.index.month]
groupby_list = [
[(state_df.index.year)],
[(state_df.index.year), season_groupby],
[(state_df.index.year),(state_df.index.month)]
]
titles = ['Annual', 'Seasonal', 'Monthly']
for i,ax_ in enumerate(ax):
state_df.groupby(groupby_list[i]).mean().plot(ax=ax_)
ax_.set_xlabel('')
ax_.legend([]) # remove the legends
ax_.set_title(f'{titles[i]} average temperature ($^\circ$F) in {state} from {year_min} to {year_max}',
fontsize=18)
Notice the linear upward trend in temperatures from 1895 to 2019 in Figure 1. Also notice that the higher resolution, e.g. plotting monthly instead of yearly, makes visualizing the general trends more difficult. This makes sense because annual temperature averages fluctuate far less than monthly data. This is not quite true when we compare the seasonal and monthly plots because the monthly temperatures does not vary significantly in a season; therefore, seasonal averages remain close to the monthly temperatures in the months belonging to that seaoson.
Had we plotted the series in a shorter time-period, we might not have been able to notice this trend. For instance, taking a look at the last 30 years instead of the entire time period shows:
fig, ax = plt.subplots(1,1,figsize=(12, 4),dpi=300)
last_many_years = 30
state_df.groupby(state_df.index.year).mean()[-last_many_years:].plot(ax=ax, marker='o')
ax.set_xlabel('')
ax.legend([])
ax.set_title(f'{titles[0]} average temperature ($^\circ$F) in {state} from {year_max-last_many_years} to {year_max}',
fontsize=14);
Therefore, revealing the long-time trend in time seties data requires data from some time significantly far in the past. In the next part, we will go through the common ways that we can model the trend.
Modeling trend using linear regression
For the sake of convenience, I use the LinearRegression() module of sklearn to find the best linear fit to the time series data. We use the same data from previous part (Annual temperature averages of PA).
linear_model_y = LinearRegression(normalize=True, fit_intercept=True)
x_data_y = state_df.index.year.unique().values.reshape(-1,1)
y_data_y = state_df.groupby(state_df.index.year).mean()
linear_model_y.fit(x_data_y, y_data_y)
print(f'linear coefficient={linear_model_y.coef_[0][0]:.4f} and intercept={linear_model_y.intercept_[0]:.2f} ')
linear coefficient=0.0150 and intercept=19.24
Similarly, we can use the monthly data for linear regression:
linear_model_m = LinearRegression(normalize=True, fit_intercept=True)
y_data_m = state_df.groupby([(state_df.index.year),(state_df.index.month)]).mean()
x_data_m = np.linspace(1895, 2019, len(y_data_m)).reshape(-1,1)
linear_model_m.fit(x_data_m, y_data_m)
print(f'linear coefficient={linear_model_m.coef_[0]:.4f} and intercept={linear_model_m.intercept_:.2f}')
linear coefficient=0.0161 and intercept=17.05
Notice the slight difference between the two fits. We can visualize the overlay of the linear model prediction and the actual data. The linear model represents the drift constituent of the time series model, $X_{d,t}$
sns.set(rc={'axes.facecolor':'whitesmoke', 'figure.facecolor':'gainsboro', 'legend.facecolor':'white'})
fig, ax = plt.subplots(2, 1, figsize=(12, 8), dpi=300)
plt.subplots_adjust(left=0, bottom=0, right=1, top=1, wspace=0, hspace=0.4)
a_y, b_y = linear_model_y.coef_[0][0], linear_model_y.intercept_[0] # f(x) = a.x + b (year)
a_m, b_m = linear_model_m.coef_[0], linear_model_m.intercept_ # f(x) = a.x + b (month)
y_pred_y = a_y*x_data_y + b_y
y_pred_m = a_m*x_data_m + b_m
y_data = [y_data_y, y_data_m]
x_data = [x_data_y, x_data_m]
y_pred = [y_pred_y, y_pred_m]
titles = ['Annual', 'Monthly']
for i, ax_ in enumerate(ax):
ax_.plot(x_data[i], y_data[i], color='deepskyblue', linestyle='-', label='data')
ax_.plot(x_data[i], y_pred[i], color='maroon', linestyle='--', label='linear fit')
ax_.set_xlabel('year', fontsize=18, labelpad=10)
ax_.set_ylabel('temperature ($^\circ$F)', fontsize=18, labelpad=10)
if i==0:
ax_.legend(bbox_to_anchor=(1,1,0,0), fontsize=14, loc='upper left');
ax_.set_title(f'Linear fit of time series (using {titles[i]} data)', fontsize=22);
ax_.set_xlim([1894, 2020])
ax_.grid(False)
plt.setp(ax_.get_xticklabels(), fontsize=14)
plt.setp(ax_.get_yticklabels(), fontsize=14)
Now that we know how to model long time behavior of the temperature, we can start visualizing the temperature data to see if we can gain any insights using this extremely simple model.
Visualizing the temperature variations in different US regions
Let's take a look at the temperature variations in New England (sorry!) states, i.e., Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont.
fig, ax = plt.subplots(1,1,figsize=(12,6),dpi=100)
states_list = ['MA', 'RI', 'CT', 'ME', 'NH', 'VT']
temp_unit = 'F' # 'C' for Celsius or 'F' for farenheit
for state in states_list:
state_df = climate_df[climate_df['state']==state].set_index("timestamp")
means = state_df.groupby(state_df.index.year) \
.agg({'temp'+temp_unit.lower(): 'mean'})
years = means.index.tolist()
ax.plot(years, means, label=state)
ax.set_xlabel('year', labelpad=10, fontsize=18)
ax.set_ylabel(f'Temperature ($^\circ${temp_unit})', labelpad=15, fontsize=18)
states = ', '.join([state for state in states_list])
ax.set_title(f'Temperature variation over the period {min(years)}-{max(years)} for \n{states}',
fontsize=22, y=1.02)
plt.setp(ax.get_xticklabels(), fontsize=14)
plt.setp(ax.get_yticklabels(), fontsize=14)
ax.legend(bbox_to_anchor=(1, 0, 0.1, 1), loc='upper left', fontsize=14);
Notice that we see two distinct groups of states in Figure 5, the Northern('NH', 'ME', and 'VT') and Southern('CT', 'MA', and 'RI') New England states.
I wrapped the code above in a function to display the temperature variations for a state or group of states, where the states are color coded by their mean temperature.
def plot_ts_temp(df, states_list, ax=None,
figsize=(15,10), dpi=300, cmap='coolwarm',
temp_unit = 'F', lbfs=18, lbpdx=10, lbpdy=15,
ttlfs=22, tckfs=14, title='', xlab='year'):
colors = col_to_hex(cmap, n=len(states_list))
# mean temp over time. we use this to sort the states_list by temperature
means_df = df.query("state in @states_list").groupby(climate_df.state).agg('mean')
soted_by_temp_states = means_df.index[np.argsort(temps.reshape(1,-1))][0]
if ax is None:
fig=plt.figure(figsize=figsize, dpi=dpi)
ax = fig.add_subplot(111)
temp_unit = 'F' # 'C' for Celsius or 'F' for farenheit
print(soted_by_temp_states)
for i,state in enumerate(soted_by_temp_states):
state_df = df[df['state']==state].set_index('timestamp')
means = state_df.groupby(state_df.index.year) \
.agg({'temp'+temp_unit.lower(): 'mean'})
years = means.index.tolist()
ax.plot(years, means, label=state, color=colors[i])
ax.set_xlabel(xlab, labelpad=lbpdx, fontsize=lbfs)
ax.set_ylabel(f'Temperature ($^\circ${temp_unit})', labelpad=lbpdy, fontsize=lbfs)
states = ', '.join([state for state in states_list])
if title is not None:
ax.set_title(f'Temperature variation over the period {min(years)}-{max(years)} for \n{states}',
fontsize=ttlfs, y=1.02)
plt.setp(ax.get_xticklabels(), fontsize=tckfs)
plt.setp(ax.get_yticklabels(), fontsize=tckfs)
ax.legend(bbox_to_anchor=(1, 0, 0.1, 1), loc='upper left', fontsize=tckfs)
This time, let's take a look at East North Central states, i.e., Illinois, Indiana, Michigan, Ohio, and Wisconsin.
plot_ts_temp(climate_df, ['IL', 'IN', 'MI', 'OH', 'WI'])
Figure 6 shows something similar to what we saw in Figure 5, where the temperature variations for East North Central states are further divided into two groups of states, the Northern (WI and MI) and Southern states (IL, IN, and OH) of the region.
Finally, plotting the two groups of states, New England and East North Central states together.
fig, ax = plt.subplots(1,1,figsize=(8,8),sharex=True)
plot_ts_temp(climate_df, ['IL', 'IN', 'MI', 'OH', 'WI', 'MA', 'RI', 'CT', 'ME', 'NH', 'VT'],
ax=ax, title=None, cmap='coolwarm', xlab='year')
As one might have expected, the Northern and Southern states belonging to the two regions are now adjacent plots in Figure 7 where we can see a distinct gap between the Northern ('NH', 'ME', 'VT', WI and MI) and Southern ('CT', 'MA', 'RI', IL, IN, and OH)states of the union of the two regions.
Visualizing temperature heatmaps using Plotly
Plotly's Coropleth maps are an efficient, beautiful way of representing and analyzing the spatial variation of a quntity. In addition, if we split our time series data into separate groups, e.g., monthly aggregates, Plotly's slider element can be used to look into the variation of temperature at different months, seasons etc.
In this part of the project, I will first go through visualizing the mean temperatures using Plotly and then, I'll demonstrate how the slider element can be used to explore the temperatures and temperature variation rates at different scales, e.g., state-level, county-level etc.
state-level temperatures in 2019
# mean state temperature in 2019
state_avg_temp_2019 = climate_df[climate_df['timestamp'].dt.year==2019]. \
groupby('state')['tempf'].agg('mean')
state_avg_temp_2019 = state_avg_temp_2019.reset_index()
state_avg_temp_2019.head()code>
residual_data.head()
| state | tempf | |
|---|---|---|
| 0 | AK | 36.110 |
| 1 | AL | 65.165 |
| 2 | AR | 61.176 |
| 3 | AZ | 61.758 |
| 4 | CA | 57.551 |
Choropleth maps use
-
a
datadictionary that contains the data about the variable to be plotted as well as some geometrical attributes regarding the locations in which the data belong to -
a
layoutdictionary that controls the appearance of the map onto which data is plotted.
The following piece of code plots the average state temperatures in 2019.
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
data = dict(
type='choropleth', # type of map-plot
colorbar = dict(
x=0.9, #x-location of the colorbar wrt to (0,0)
title=dict(text="T (°F)",
font=dict(size=20)),
tickformat=".0f",
tickfont=dict(size=20)
), # colorbar title and tick format
geojson=counties, # GeoJSON data
locationmode="USA-states", # for state-level graphs
name='', # gets rid of the word 'trace' appearing when hover
locations = state_avg_temp_2019['state'], # the state column
z = np.round(state_avg_temp_2019['tempf'], 2), # the heatmap variable
zmax=max_temp, # max temperature
zmin=min_temp, # min temperature
autocolorscale = False
)
layout = dict(
title=dict(
text = "State-level temperature heatmap in 2019",
x=0.5,
font=dict(size=30),
), # title and location adjustment
geo = dict(
scope='usa',
projection=dict(type='albers usa')
)
)
fig=go.Figure(data=data,layout=layout)
fig.update_traces(marker_line_width=0)
# overlay a scatter plot showing the state names!
fig.add_scattergeo(locations=state_avg_temp_2019['state'],
locationmode='USA-states',
text=state_avg_temp_2019['state'],
mode='text',
showlegend = False,
hoverinfo='none',
textfont={'color':'white',
'size':20})
# pdf format generates the highest quality!
pio.write_image(fig, 'state_temps_2019.pdf', width=2000, height=1000)
iplot(fig)
county-level temperatures in 2019
We can reproduce Figure 8 in greater detail by calculating the mean temperatures for every county instead of every state.
county_avg_temp_2019 = climate_df[climate_df['timestamp'].dt.year==2019]. \
groupby(['fips','county','state'])['tempf'].agg('mean')
county_avg_temp_2019 = county_avg_temp_2019.reset_index()
county_avg_temp_2019.head()
| fips | county | state | tempf | |
|---|---|---|---|---|
| 0 | 01001 | Autauga | AL | 366.608 |
| 1 | 01003 | Baldwin | AL | 68.792 |
| 2 | 01005 | Barbour | AL | 67.133 |
| 3 | 01007 | Bibb | AL | 64.746 |
| 4 | 01009 | Blount | AL | 63.583 |
min_temp, max_temp = county_avg_temp_2019.describe()['tempf'][['min', 'max']].values
data = dict(
type='choropleth', # type of map-plot
colorbar = dict(
x=0.9, #x-location of the colorbar wrt to (0,0)
title=dict(text="T (°F)",
font=dict(size=20)),
tickformat=".0f",
tickfont=dict(size=20)
), # colorbar title and tick format
geojson=counties, # GeoJSON data
name='', # gets rid of the word 'trace' appearing when hover
locations = county_avg_temp_2019['fips'], # fips
z = np.round(county_avg_temp_2019['tempf'], 2), # the heatmap variable
zmax=max_temp, # max temperature
zmin=min_temp, # min temperature
text=county_avg_temp_2019['county']+' ('+county_avg_temp_2019['state']+')',
autocolorscale = False
)
layout = dict(
title=dict(
text = "County-level temperature heatmap in 2019",
x=0.5,
font=dict(size=30),
), # title and location adjustment
geo = dict(
scope='usa',
projection=dict(type='albers usa')
)
)
fig=go.Figure(data=data,layout=layout)
fig.update_traces(marker_line_width=0)
pio.write_image(fig, 'county_temps_2019.pdf', width=2000, height=1000)
iplot(fig)
County-level temperature variation rate in US in (°F per 100 years)