Source files contain the data that is being extracted from the source system before it's being transformed to the Common Data Format. Source files typically contain the data in its raw form. The data can be divided into any number and types of files, representing the way the data is stored in the source system. For instance, source files can be organized in a star schema with some files containing business facts, while other files containing business dimensions.
Types of Source Files
Source files can have different types and formats, such as:
- CSV / TSV
Processing Source Files
In a typical Data Standardization project, source files are identified and picked from one or more source systems. This can be a manual or automated process that may repeat itself in different intervals depending on how often new data is created. Before any transformation can take place, the files must be read, and their overriding data organization method must be understood.
There could be several problems with data organization in source files:
- Files may not have an easily understood organization method; i.e. a clear data schema.
- The organization method in a given file may change over time: new keys or columns may be introduced, removed, or renamed.
- Content in source files may change over time: a set of expected data values can increase or decrease without notice.
- The organization of data can either be very simplistic -- such as in a relational or tabular view -- or very complicated, including several levels of data nesting, data arrays, and complex data objects.