# 2) Tutorial - Advanced integration of a data set ## General workfow steps Convert data and manage quality - Convertion to numerical format - define units of entities - Recognizable identifiers for regions (ISO + predefined regions) Mapping of meta data - Mapping to required meta data: entity, category, scenario, model, source - Storing additional meta data as required - Tracking of who processed the data and when - Origin of data ( URL for download or website) Transparent documentation of read-in (ToDo) ## Data download Human development index from https://hdr.undp.org/data-center Old file name: "HDR21-22_Composite_indices_complete_time_series.csv" ## Understanding the data structure - What is the naming convention? - Is the data structured in columns or rows? - What is the bests approach to read the required data? - How to approach the conversion to the naming convention? ## Excercises: Data Read-in Steps: - Download meta data - Find mapping for column names to extract the indices for the individual datatables - Extract data for - Inequality-adjusted Human Development index - Gender Inequality Index - Life Expectancy at Birth (years) - Gross National Income Per Capita (2017 PPP$) - Create data source and commit table ```` ----- Solution below ---- import os import datatoolbox as dt import pandas as pd # config year = 2024 source = f'HDI_{year}' raw_folder = "/Users/andreasgeiges/Downloads" # reading raw data data_file = os.path.join(raw_folder, 'HDR23-24_Composite_indices_complete_time_series.csv') meta_file = os.path.join(raw_folder, 'HDR23-24_Composite_indices_metadata.xlsx') # mapping of entity naming entity_mapping = {'hdi' : 'Human_development_index', 'gii' : 'Gender_inequality_index', 'le' : 'Life_expectandy_birth', #Life Expectancy at Birth (years) 'gnipc' : 'Gross_national_income_per_capity', #Gross National Income Per Capita } # mapping of units unit_mapping = {'hdi' : '', 'gii' : '', 'le' : 'years', 'gnipc' : 'USD_ppp_2017' } # convert index to regional ISO codes data = pd.read_csv(data_file, index_col=list(range(4)),encoding = "ISO-8859-1") index_to_read = data.index.unique('iso3').intersection(dt.mapping.getValidSpatialIDs()) # extract individual entities tables_to_commit = list() # loop over individual entitiess for code, code_name in entity_mapping.items(): # select columns of entity columns_to_read = [col for col in data.columns if col[:-5] == code] # subselect data df = data.loc[index_to_read,columns_to_read] # prepare columns as integer yearss df.columns = [int(x.split('_')[1]) for x in df.columns] # prepare index df = df.idx.project('iso3') meta ={'entity' : code_name, 'scenario' : 'Historic', 'source' : source, 'unit' : unit_mapping[code]} #create and add table table = dt.Datatable(df, meta = meta) tables_to_commit.append(table) # create source meta sourceMeta = { 'SOURCE_ID':source, 'collected_by': 'AG', 'date': dt.core.get_date_string(), 'source_url': 'http://hdr.undp.org/en/data', 'licence': 'open source', } # commit final tables dt.commitTables(tables_to_commit, sourceMetaDict = sourceMeta, message='HDI_2024 data') ```