1) Tutorial - Basic integration of raw data¶

Install additional packages¶

pip install panda_datareader

Downloading raw data using data reader¶

As first step we download an example dataset from the World Bank.

from pandas_datareader import wb
matches = wb.search('gdp.*capita.*const')

data_df = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)


matches2 = wb.search('emission')
emission_data = wb.download(indicator='CC.CO2.EMSE.EN', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)
population_data = wb.download(indicator='SP.POP.TOTL', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)

population_data = wb.download(indicator=['SP.POP.TOTL','CC.CO2.EMSE.EN'], country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)

Data transformation¶

In order to integrate the raw data into datatoolbox, the raw structure must be adapted to the following requirements:

ISO3 code for country as index
years or timestampe as columns
additional meta data describing the variable, scenario and source of the data

dt.Datatable takes as arguments a pandas dataframe and meta the meta data

dt.Datatable(data=pandas_df, meta  = meta_dict)

Exercise: convert the data to required format

--- scroll below for solution  --- 

Solution¶

import datatoolbox as dt
from pandas_datareader import wb
matches = wb.search('gdp.*capita.*const')

data_df = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)



#assuming the correct meta data
meta = {'variable' : 'GDP_per_capita',
        'scenario' : 'Historic',
        'source' : 'WORLDBANK_2024',
        'unit' : 'USD/count'}
wide_df = data_df.reset_index().pivot(index='country', columns='year', values='NY.GDP.PCAP.KD')
  
        
        
gdp_table = dt.Datatable(data=wide_df,
                     meta=meta)
gdp_table.index = dt.idx.convert_idx_string_to_iso(gdp_table.index)

final_table = gdp_table                     

Commit data as datatoolbox source¶

Excercise - Create and commit data as new sources

create the meta data dictionary for the new source
use the function add_table or dt.add_tables to database
verfiy that table is in database

# switch to testing database
dt.admin.switch_database_to_testing(force_new_DB=True)

# meta data for new source
source_dict =  {
    'SOURCE_ID': '',
    'collected_by': '',
    'date': dt.core.get_date_string(),
    'source_url': '',
    'licence': '',
}

# add your code here

Solution¶

dt.admin.switch_database_to_testing(force_new_DB=True)

# meta data for new source
source_dict =  {
    'SOURCE_ID': 'WORLDBANK_2024',
    'collected_by': 'MM',
    'date': dt.core.get_date_string(),
    'source_url': 'https://datatopics.worldbank.org/world-development-indicators/',
    'licence': 'open_source',
}
dt.commitTable(final_table, message = 'adding first table', sourceMetaDict = source_dict)


#verify that table is in inventory
dt.findp(source='WORLDBANK*')

What happend in the back ground¶

From the meta data, datatoolbox create a table ID that conains the variable, scenario and source.
This ID should be unique within your database and is used to directly access the data, e.g. via dt.getTable

print(final_table.meta['ID'])

table_reloaded =  dt.getTable('GDP_per_capita__Historic__WORLDBANK_2024')

1) Tutorial - Basic integration of raw data¶

Install additional packages¶

Downloading raw data using data reader¶

Data transformation¶

Solution¶

Commit data as datatoolbox source¶

Solution¶

What happend in the back ground¶

datatoolbox

Navigation

Related Topics