is the process of collecting data from all required data sources. Data sources come in many shapes and sizes, from RDBMS systems to file sharing APIs or from public to private sources or from paid to free data sources.
Data sources can
- contain personally identifiable information or intellectual property of the company
- be unorganised, unstructured or structured and well described
- generate data at varying frequencies or produce data constantly through data streams
- supporting pull data mechanisms or push data mechanisms in a synchronous or asynchronous manner
This means that the extracted part of the ETL tool must be extremely flexible, resilient and malleable to support the diversity of data sources and variations in data extraction procedures and protocols.
Data architectures must be able to connect to multiple data sources in parallel and extract data to make it available for further processing without affecting the retrievability of other extraction processes.