2) Tutorial - Advanced integration of a data set¶

General workfow steps¶

Convert data and manage quality

Convertion to numerical format
define units of entities
Recognizable identifiers for regions (ISO + predefined regions)

Mapping of meta data

Mapping to required meta data: entity, category, scenario, model, source
Storing additional meta data as required
Tracking of who processed the data and when
Origin of data ( URL for download or website)

Transparent documentation of read-in (ToDo)

Data download¶

Human development index from https://hdr.undp.org/data-center

Old file name: “HDR21-22_Composite_indices_complete_time_series.csv”

Understanding the data structure¶

What is the naming convention?
Is the data structured in columns or rows?
What is the bests approach to read the required data?
How to approach the conversion to the naming convention?

Excercises: Data Read-in¶

Steps:

Download meta data
Find mapping for column names to extract the indices for the individual datatables
Extract data for
- Inequality-adjusted Human Development index
- Gender Inequality Index
- Life Expectancy at Birth (years)
- Gross National Income Per Capita (2017 PPP$)
Create data source and commit table

----- Solution below ----

import os
import datatoolbox as dt
import pandas as pd

# config
year = 2024
source = f'HDI_{year}'
raw_folder = "/Users/andreasgeiges/Downloads"

# reading raw data
data_file = os.path.join(raw_folder, 'HDR23-24_Composite_indices_complete_time_series.csv')
meta_file = os.path.join(raw_folder, 'HDR23-24_Composite_indices_metadata.xlsx')

# mapping of entity naming
entity_mapping  = {'hdi' : 'Human_development_index',
                   'gii' : 'Gender_inequality_index',
                  'le' : 'Life_expectandy_birth', #Life Expectancy at Birth (years)
                  'gnipc' : 'Gross_national_income_per_capity', #Gross National Income Per Capita 
                }
                
# mapping of units
unit_mapping  = {'hdi' : '',
                   'gii' : '',
                  'le' : 'years',
                  'gnipc' : 'USD_ppp_2017'
                }
              
# convert  index to regional ISO codes

data = pd.read_csv(data_file, index_col=list(range(4)),encoding = "ISO-8859-1")
index_to_read = data.index.unique('iso3').intersection(dt.mapping.getValidSpatialIDs())

# extract individual entities

tables_to_commit = list()

# loop over individual entitiess
for code, code_name in entity_mapping.items():

    # select columns of entity
    columns_to_read = [col for col in data.columns if col[:-5] == code]
    
    # subselect data
    df = data.loc[index_to_read,columns_to_read]
    
    # prepare columns as integer yearss
    df.columns = [int(x.split('_')[1]) for x in df.columns]
    
    # prepare index

    df = df.idx.project('iso3')
    
    meta ={'entity' : code_name,
           'scenario' : 'Historic',
           'source' : source,
           'unit' : unit_mapping[code]} 
    
    #create and add table

    table = dt.Datatable(df, meta = meta)
    
    tables_to_commit.append(table)       
    
# create source meta

sourceMeta = {
    'SOURCE_ID':source,
    'collected_by': 'AG',
    'date': dt.core.get_date_string(),
    'source_url': 'http://hdr.undp.org/en/data',
    'licence': 'open source',
}

# commit final tables
dt.commitTables(tables_to_commit, sourceMetaDict = sourceMeta, message='HDI_2024 data')
    
```

2) Tutorial - Advanced integration of a data set¶

General workfow steps¶

Data download¶

Understanding the data structure¶

Excercises: Data Read-in¶

datatoolbox

Navigation

Related Topics