1) Tutorial - Basic integration of raw data¶
Install additional packages¶
pip install panda_datareader
Downloading raw data using data reader¶
As first step we download an example dataset from the World Bank.
from pandas_datareader import wb
matches = wb.search('gdp.*capita.*const')
data_df = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)
matches2 = wb.search('emission')
emission_data = wb.download(indicator='CC.CO2.EMSE.EN', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)
population_data = wb.download(indicator='SP.POP.TOTL', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)
population_data = wb.download(indicator=['SP.POP.TOTL','CC.CO2.EMSE.EN'], country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)
Data transformation¶
In order to integrate the raw data into datatoolbox, the raw structure must be adapted to the following requirements:
ISO3 code for country as index
years or timestampe as columns
additional meta data describing the variable, scenario and source of the data
dt.Datatable takes as arguments a pandas dataframe and meta the meta data
dt.Datatable(data=pandas_df, meta = meta_dict)
Exercise: convert the data to required format
--- scroll below for solution ---
Solution¶
import datatoolbox as dt
from pandas_datareader import wb
matches = wb.search('gdp.*capita.*const')
data_df = wb.download(indicator='NY.GDP.PCAP.KD', country=['US', 'CA', 'MX', 'DE'], start=2005, end=2024)
#assuming the correct meta data
meta = {'variable' : 'GDP_per_capita',
'scenario' : 'Historic',
'source' : 'WORLDBANK_2024',
'unit' : 'USD/count'}
wide_df = data_df.reset_index().pivot(index='country', columns='year', values='NY.GDP.PCAP.KD')
gdp_table = dt.Datatable(data=wide_df,
meta=meta)
gdp_table.index = dt.idx.convert_idx_string_to_iso(gdp_table.index)
final_table = gdp_table
Commit data as datatoolbox source¶
Excercise - Create and commit data as new sources
create the meta data dictionary for the new source
use the function add_table or dt.add_tables to database
verfiy that table is in database
# switch to testing database
dt.admin.switch_database_to_testing(force_new_DB=True)
# meta data for new source
source_dict = {
'SOURCE_ID': '',
'collected_by': '',
'date': dt.core.get_date_string(),
'source_url': '',
'licence': '',
}
# add your code here
Solution¶
dt.admin.switch_database_to_testing(force_new_DB=True)
# meta data for new source
source_dict = {
'SOURCE_ID': 'WORLDBANK_2024',
'collected_by': 'MM',
'date': dt.core.get_date_string(),
'source_url': 'https://datatopics.worldbank.org/world-development-indicators/',
'licence': 'open_source',
}
dt.commitTable(final_table, message = 'adding first table', sourceMetaDict = source_dict)
#verify that table is in inventory
dt.findp(source='WORLDBANK*')
What happend in the back ground¶
From the meta data, datatoolbox create a table ID that conains the variable, scenario and source.
This ID should be unique within your database and is used to directly access the data, e.g. via
dt.getTable
print(final_table.meta['ID'])
table_reloaded = dt.getTable('GDP_per_capita__Historic__WORLDBANK_2024')