1) Tutorial - Basic integration of raw data

Install additional packages

pip install panda_datareader

Downloading raw data using data reader

As first step we download an example dataset from the World Bank.

from pandas_datareader import wb
matches = wb.search('gdp.*capita.*const')

data_df = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)


matches2 = wb.search('emission')
emission_data = wb.download(indicator='CC.CO2.EMSE.EN', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)
population_data = wb.download(indicator='SP.POP.TOTL', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)

population_data = wb.download(indicator=['SP.POP.TOTL','CC.CO2.EMSE.EN'], country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)

Data transformation

In order to integrate the raw data into datatoolbox, the raw structure must be adapted to the following requirements:

  • ISO3 code for country as index

  • years or timestampe as columns

  • additional meta data describing the variable, scenario and source of the data

dt.Datatable takes as arguments a pandas dataframe and meta the meta data

dt.Datatable(data=pandas_df, meta  = meta_dict)

Exercise: convert the data to required format


















--- scroll below for solution  --- 

Solution

import datatoolbox as dt
from pandas_datareader import wb
matches = wb.search('gdp.*capita.*const')

data_df = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)



#assuming the correct meta data
meta = {'variable' : 'GDP_per_capita',
        'scenario' : 'Historic',
        'source' : 'WORLDBANK_2024',
        'unit' : 'USD/count'}
wide_df = data_df.reset_index().pivot(index='country', columns='year', values='NY.GDP.PCAP.KD')
  
        
        
gdp_table = dt.Datatable(data=wide_df,
                     meta=meta)
gdp_table.index = dt.idx.convert_idx_string_to_iso(gdp_table.index)

final_table = gdp_table                     

Commit data as datatoolbox source

Excercise - Create and commit data as new sources

  1. create the meta data dictionary for the new source

  2. use the function add_table or dt.add_tables to database

  3. verfiy that table is in database

# switch to testing database
dt.admin.switch_database_to_testing(force_new_DB=True)

# meta data for new source
source_dict =  {
    'SOURCE_ID': '',
    'collected_by': '',
    'date': dt.core.get_date_string(),
    'source_url': '',
    'licence': '',
}

# add your code here





Solution

dt.admin.switch_database_to_testing(force_new_DB=True)

# meta data for new source
source_dict =  {
    'SOURCE_ID': 'WORLDBANK_2024',
    'collected_by': 'MM',
    'date': dt.core.get_date_string(),
    'source_url': 'https://datatopics.worldbank.org/world-development-indicators/',
    'licence': 'open_source',
}
dt.commitTable(final_table, message = 'adding first table', sourceMetaDict = source_dict)


#verify that table is in inventory
dt.findp(source='WORLDBANK*')

What happend in the back ground

From the meta data, datatoolbox create a table ID that conains the variable, scenario and source.
This ID should be unique within your database and is used to directly access the data, e.g. via dt.getTable

print(final_table.meta['ID'])

table_reloaded =  dt.getTable('GDP_per_capita__Historic__WORLDBANK_2024')