excel_source
Module containing ExcelSource class.
ExcelSource class handles loading of Excel data.
Classes
ExcelSource
class ExcelSource( path: Union[os.PathLike, pydantic.networks.AnyUrl, str], sheet_name: Union[str, Sequence[str], ForwardRef(None)] = None, column_names: Optional[List[str]] = None, dtype: Optional[Dict[str, Union[ForwardRef('ExtensionDtype'), str, numpy.dtype, Type[Union[str, complex, bool, object]]]]] = None, read_excel_kwargs: Optional[Dict[str, Any]] = None, data_splitter: Optional[DatasetSplitter] = None, seed: Optional[int] = None, modifiers: Optional[Dict[str, DataPathModifiers]] = None, ignore_cols: Optional[Union[str, Sequence[str]]] = None,):
Data source for loading excel files.
You must install a backend library to read excel files to use this data source. Currently supported engines are “xlrd”, “openpyxl”, “odf” and “pyxlsb”.
By default, the first row is used as the column names unless column_names
or the
header
keyword argument is provided.
Arguments
**read_excel_kwargs
: Additional arguments to be passed topandas.read_excel
.column_names
: The names of the columns if not using the first row of the sheet. Can only be used for single sheet excel files.data_splitter
: Approach used for splitting the data into training, test, validation. Defaults to None.dtype
: The dtypes of the columns.ignore_cols
: Column/list of columns to be ignored from the data. Defaults to None.modifiers
: Dictionary used for modifying paths/ extensions in the dataframe. Defaults to None.path
: The path or URL to the excel file.seed
: Random number seed. Used for setting random seed for all libraries. Defaults to None.sheet_name
: The name(s) of the sheet(s) to load. If not provided, the all sheets will be loaded.
Attributes
data
: A Dataframe-type object which contains the data.data_splitter
: Approach used for splitting the data into training, test, validation.seed
: Random number seed. Used for setting random seed for all libraries.
Raises
TypeError
: If the path does not have the correct extension denoting an excel file.ValueError
: If multiple sheet names are provided and column names are also provided.ValueError
: If sheets are referenced which do not exist in the excel file.
Ancestors
Variables
-
data : pandas.core.frame.DataFrame
- A property containing the underlying dataframe if the data has been loaded.Raises: DataNotLoadedError: If the data has not been loaded yet.
-
hash : str
- The hash associated with this BaseSource.This is the hash of the static information regarding the underlying DataFrame, primarily column names and content types but NOT anything content-related itself. It should be consistent across invocations, even if additional data is added, as long as the DataFrame is still compatible in its format.
Returns: The hexdigest of the DataFrame hash.
is_initialised : bool
- Checks ifBaseSource
was initialised.
is_task_running : bool
- Returns True if a task is running.
-
iterable : bool
- This returns False if the DataSource does not subclassIterableSource
.However, this property must be re-implemented in
IterableSource
, therefore it is not necessarily True if the DataSource inherits fromIterableSource
.
multi_table : bool
- Attribute to specify whether the datasource is multi table.
table_names : List[str]
- Excel sheet names in datasource.
Methods
get_column
def get_column( self: BaseSource, col_name: str, *args: Any, **kwargs: Any,) ‑> Union[numpy.ndarray, pandas.core.series.Series]:
Inherited from:
Implement this method to get single column from dataset.
get_column_names
def get_column_names( self, table_name: Optional[str] = None, **kwargs: Any,) ‑> Iterable[str]:
Get columns names in Excel dataset.
Arguments
table_name
: The name of the table from which the column names should be loaded. Defaults to None.
Returns The list of column names from the requested table or the single table if not a multi-table instance.
Raises
ValueError
: If the table name provided does not exist.ValueError
: If the data is multi-table but no table name provided.
get_data
def get_data( self, table_name: Optional[str] = None, **kwargs: Any,) ‑> Optional[pandas.core.frame.DataFrame]:
Loads and returns data from Excel dataset.
Arguments
table_name
: Table name for multi table data sources. This comes from the DataStructure and is ignored if sql_query has been provided.
Returns A DataFrame-type object which contains the data.
Raises
ValueError
: If the table name provided does not exist.
get_dtypes
def get_dtypes(self: BaseSource, *args: Any, **kwargs: Any) ‑> _Dtypes:
Inherited from:
Implement this method to get the columns and column types from dataset.
get_values
def get_values( self, col_names: List[str], table_name: Optional[str] = None, **kwargs: Any,) ‑> Dict[str, Iterable[Any]]:
Get distinct values from columns in Excel dataset.
Arguments
col_names
: The list of the columns whose distinct values should be returned.table_name
: The name of the table from which the column should be loaded. Defaults to None.
Returns The distinct values of the requested column as a mapping from col name to a series of distinct values.
Raises
ValueError
: If the table name provided does not exist.ValueError
: If the data is multi-table but no table name provided.
load_data
def load_data(self, **kwargs: Any) ‑> None:
Inherited from:
Load the data for the datasource.
Raises
TypeError
: If data format is not supported.