CSV-based Database

class datatoolbox.database.Database[source]

CSV based database that uses git for as distributed version control system. Each table is saved locally as a csv file and identified by a unique ID. The csv files are organized in various sources in individual folders. Each sources comes with its own git repository and can be shared with others.

add_to_inventory(datatable)[source]

Method to add a table to the global inventory file. Input: datatable

checkout_source_version(sourceID: str, tag: str = 'latest')[source]

Method to integrate a specific version of a source. Possible tags are v1.0 and higher or “latest” for the most recent version.

Parameters

sourceIDstr

DESCRIPTION.

tagstr, optional

DESCRIPTION. The default is “latest”.

Returns

None.

clearLogTables()[source]

Clears the list of logged tables. This is anyway done if the package is newly loaded

commitTable(dataTable, message, sourceMetaDict=None)[source]

Adds a table permamently to the underlying database. For the first table of a new source, the meta data for the sources needs to be provides as well

Input

table : Datatable message : str sourceMetaDict [Optional] : dict

commitTables(dataTables, message, sourceMetaDict=None, append_data=False, update=False, overwrite=False, cleanTables=True)[source]

Adds multipe tables permamently to the underlying database. For the first table of a new source, the meta data for the sources needs to be provides as well

Input

tables : list of Datatable message : str sourceMetaDict [Optional] : dict append_data [optinal] : bool to choose if new data is added to the existing

table (new data does not overwrite old data)

update : [optional] : bool to choose if the exting data is updated overwrite : [optional] : bool to choose if data is overwriten (new data

overwrites old data)

cleanTables [optional]bool (default: true) to choose if tables are

cleaned before commit

TODO: Check flags

create_empty_datashelf(pathToDataself)[source]

Method to create the required files for an empty csv-based data base. (Equivalent to the fucntions in admin.py)

export_new_source_to_remote(sourceID)[source]

This function exports a new local dataset to the defind remote database.

Input is a local sourceID as a str.

findc(**kwargs)[source]

Method to search through the inventory. kwargs can be all inventory entires (see config.INVENTORY_FIELDS).

finde(**kwargs)[source]

Finds an exact match for the given filter criteria.

findp(level=None, regex=False, **filters)[source]

Future defaulf find method that allows for more sophisticated syntax in the filtering

Usage:

filtersUnion[str, Iterable[str]]

One or multiple patterns, which are OR’d together

regexbool, optional

Accept plain regex syntax instead of shell-style, default: False

Returns

matches : pd.Series Mask for selecting matched rows

getTable(ID, native_regions=False)[source]

Method to load the datatable for the given tableID.

Input

tableID : str native_regions : bool, optional

Load native region defintions if available. The default is False.

Returns table : Datatable

getTables(iterIDs, native_regions=False, disable_progress=None)[source]

Method to return multiple datatables at once as a dictionary like set fo tables.

Input

iterIDS: list [str]

List of IDs to load.

native_regionsbool, optional

Load native region defintions if available. The default is False.

disable_progressbool, optional

Disable displaying of progressbar. The default None hides the progressbar on non-tty outputs

Returns

tables : TableSet

getTablesAvailable()[source]

Return a locally stored pandas dataframe of tables on datashelf

get_inventory()[source]

Returns a copy of the data base inventory.

get_path_of_source(sourceID)[source]

Returns the harddisk path of a given source.

import_new_source_from_remote(remoteName)[source]

This functions imports (git clone) a remote dataset and creates a local copy of it.

Input is an existing sourceID.

info()[source]

Shows the most inmportant information about the status of the database

isConsistentTable(datatable)[source]

Checks if that table is fitting the following requirements - numeric data - spatial identifiers are known to the database - columns are propper years - index is not duplicated

isSource(sourceID)[source]

Checks is the source is in the database

Input

sourceID : str

list_source_versions(source_ID: str)[source]

Returns all available version of a given source

Parameters

source_IDstr

DESCRIPTION.

Returns

TYPE

DESCRIPTION.

pull_source_from_remote(repoName)[source]

Updates the local data repository by the newest version on the remote repository

Parameters

repoNamestr

Source ID string to identify which source repository should be updated.

Returns

None.

remote_sourceInfo()[source]

Returns a list of available sources and meta data

removeTable(tableID)[source]

Method to remnove tables from the database

Input

tableID : str

removeTables(IDList)[source]

Method to remnove tables from the database

Input

IDList : list of str

remove_from_inventory(tableID)[source]

Method to remove a table from the global inventory Input: tableID

remove_source(sourceID)[source]

Function to remove an entire source from the database.

saveTablesToDisk(folder, IDList)[source]

Function to save a list of tables to disk as csv files.

save_logged_tables(folder='data')[source]

Creates a local data directory that can be used to run the logges analysis indepenedly.

Parameters

folderstr, optional

DESCRIPTION. The default is ‘data’.

Returns

None.

sourceExists(source)[source]

Function to check if a source is propperly registered in the data base

Input: SourceID

sourceInfo(source_ID=None, show_number_of_table=False)[source]

Returns a list of available sources and meta data

startLogTables()[source]

Starts the logging of loaded datatables. This is useful to collect all required tables for a given analysis to create a datapackage for off-line useage

stopLogTables()[source]

Stops the logging process of datatables and return the list of loaded table IDs for more processing.

updateTable(oldTableID, newDataTable, message)[source]

Specific method to update the data of an existing table

Input

oldTableID : str newDataTabble : Datatable message : str

Commit message to describle the added data

updateTables(oldTableIDs, newDataTables, message)[source]

Equivalent method to updateTable, but for multiple tables at once

Input

oldTableIDs : list of str newDataTabbles : list of Datatable message : str

Commit message to describle the added data

update_mapping_file(sourceID, mapping_file_path, sourceMetaDict=None)[source]

adds mapping file to database

Parameters

sourceTYPE

DESCRIPTION.

IDTYPE

DESCRIPTION.

mappingTYPE

DESCRIPTION.

validate_ID(ID, print_statement=True)[source]

Method to chekc the validity of a table ID and check the state of the data