2) Tutorial - Advanced integration of a data set¶
General workfow steps¶
Convert data and manage quality
Convertion to numerical format
define units of entities
Recognizable identifiers for regions (ISO + predefined regions)
Mapping of meta data
Mapping to required meta data: entity, category, scenario, model, source
Storing additional meta data as required
Tracking of who processed the data and when
Origin of data ( URL for download or website)
Transparent documentation of read-in (ToDo)
Data download¶
Human development index from https://hdr.undp.org/data-center
Old file name: “HDR21-22_Composite_indices_complete_time_series.csv”
Understanding the data structure¶
What is the naming convention?
Is the data structured in columns or rows?
What is the bests approach to read the required data?
How to approach the conversion to the naming convention?
Excercises: Data Read-in¶
Steps:
Download meta data
Find mapping for column names to extract the indices for the individual datatables
Extract data for
Inequality-adjusted Human Development index
Gender Inequality Index
Life Expectancy at Birth (years)
Gross National Income Per Capita (2017 PPP$)
Create data source and commit table
----- Solution below ----
import os
import datatoolbox as dt
import pandas as pd
# config
year = 2024
source = f'HDI_{year}'
raw_folder = "/Users/andreasgeiges/Downloads"
# reading raw data
data_file = os.path.join(raw_folder, 'HDR23-24_Composite_indices_complete_time_series.csv')
meta_file = os.path.join(raw_folder, 'HDR23-24_Composite_indices_metadata.xlsx')
# mapping of entity naming
entity_mapping = {'hdi' : 'Human_development_index',
'gii' : 'Gender_inequality_index',
'le' : 'Life_expectandy_birth', #Life Expectancy at Birth (years)
'gnipc' : 'Gross_national_income_per_capity', #Gross National Income Per Capita
}
# mapping of units
unit_mapping = {'hdi' : '',
'gii' : '',
'le' : 'years',
'gnipc' : 'USD_ppp_2017'
}
# convert index to regional ISO codes
data = pd.read_csv(data_file, index_col=list(range(4)),encoding = "ISO-8859-1")
index_to_read = data.index.unique('iso3').intersection(dt.mapping.getValidSpatialIDs())
# extract individual entities
tables_to_commit = list()
# loop over individual entitiess
for code, code_name in entity_mapping.items():
# select columns of entity
columns_to_read = [col for col in data.columns if col[:-5] == code]
# subselect data
df = data.loc[index_to_read,columns_to_read]
# prepare columns as integer yearss
df.columns = [int(x.split('_')[1]) for x in df.columns]
# prepare index
df = df.idx.project('iso3')
meta ={'entity' : code_name,
'scenario' : 'Historic',
'source' : source,
'unit' : unit_mapping[code]}
#create and add table
table = dt.Datatable(df, meta = meta)
tables_to_commit.append(table)
# create source meta
sourceMeta = {
'SOURCE_ID':source,
'collected_by': 'AG',
'date': dt.core.get_date_string(),
'source_url': 'http://hdr.undp.org/en/data',
'licence': 'open source',
}
# commit final tables
dt.commitTables(tables_to_commit, sourceMetaDict = sourceMeta, message='HDI_2024 data')
```