# 2) Tutorial -  Advanced integration of a data set

## General workfow steps

Convert data and manage quality

- Convertion to numerical format
- define units of entities
- Recognizable identifiers for regions (ISO + predefined regions)

Mapping of meta data

- Mapping to required meta data: entity, category, scenario, model, source
- Storing additional meta data as required
- Tracking of who processed the data and when
- Origin of data ( URL for download or website)

Transparent documentation of read-in (ToDo)

## Data download
Human development index from https://hdr.undp.org/data-center

Old file name: "HDR21-22_Composite_indices_complete_time_series.csv"


## Understanding the data structure
- What is the naming convention?
- Is the data structured in columns or rows?
- What is the bests approach to read the required data?
- How to approach the conversion to the naming convention?


## Excercises: Data Read-in

Steps:

- Download meta data
- Find mapping for column names to extract the indices for the individual 
datatables
- Extract data for
   - Inequality-adjusted Human Development index
   - Gender Inequality Index
   - Life Expectancy at Birth (years)
   - Gross National Income Per Capita (2017 PPP$)
- Create data source and commit table   

````


----- Solution below ----


import os
import datatoolbox as dt
import pandas as pd


# config
year = 2024
source = f'HDI_{year}'
raw_folder = "/Users/andreasgeiges/Downloads"


# reading raw data
data_file = os.path.join(raw_folder, 'HDR23-24_Composite_indices_complete_time_series.csv')
meta_file = os.path.join(raw_folder, 'HDR23-24_Composite_indices_metadata.xlsx')


# mapping of entity naming
entity_mapping  = {'hdi' : 'Human_development_index',
                   'gii' : 'Gender_inequality_index',
                  'le' : 'Life_expectandy_birth', #Life Expectancy at Birth (years)
                  'gnipc' : 'Gross_national_income_per_capity', #Gross National Income Per Capita 
                }
                
# mapping of units
unit_mapping  = {'hdi' : '',
                   'gii' : '',
                  'le' : 'years',
                  'gnipc' : 'USD_ppp_2017'
                }
              
              
# convert  index to regional ISO codes

data = pd.read_csv(data_file, index_col=list(range(4)),encoding = "ISO-8859-1")
index_to_read = data.index.unique('iso3').intersection(dt.mapping.getValidSpatialIDs())

# extract individual entities

tables_to_commit = list()

# loop over individual entitiess
for code, code_name in entity_mapping.items():

    # select columns of entity
    columns_to_read = [col for col in data.columns if col[:-5] == code]
    
    # subselect data
    df = data.loc[index_to_read,columns_to_read]
    
    # prepare columns as integer yearss
    df.columns = [int(x.split('_')[1]) for x in df.columns]
    
    # prepare index

    df = df.idx.project('iso3')
    
    meta ={'entity' : code_name,
           'scenario' : 'Historic',
           'source' : source,
           'unit' : unit_mapping[code]} 
    

    #create and add table

    table = dt.Datatable(df, meta = meta)
    
    tables_to_commit.append(table)       
    
    
# create source meta


sourceMeta = {
    'SOURCE_ID':source,
    'collected_by': 'AG',
    'date': dt.core.get_date_string(),
    'source_url': 'http://hdr.undp.org/en/data',
    'licence': 'open source',
}

# commit final tables
dt.commitTables(tables_to_commit, sourceMetaDict = sourceMeta, message='HDI_2024 data')
    
```