DATA CLEANING METHODS

Mar 6, 2023

Data cleaning is a crucial process in many fields to ensure accurate and reliable analysis results. Python offers a powerful toolset for this purpose, providing scalability, automation, and adaptability. With libraries such as Pandas, NumPy, and Matplotlib, professionals across various disciplines can efficiently clean, organize, and analyze their data. To effectively use Python for data cleaning, it is essential to grasp basic programming concepts and become familiar with these relevant libraries, while adhering to best practices. By utilizing Python, researchers and analysts can enhance the effectiveness and efficiency of their work, leading to meaningful and insightful outcomes.

This project is only intended to provide some overview of what common techniques and workflows might occur. It is important to note that each data set is a new challenge and one should not be tempted to run a quickly assembled script over a data set. Especially in a scientific context, a lot of analysis and planning is required. Writing code, running it, and possibly having a matching dataset is only a small part.

Python vs. Excel and co.
Workflow
Data Quality - Defining Requirements for Data
Analysis of the data
- Loading the data set
- Missing data
  - Technique 1: Missing data heatmap
  - Technique 2: Percentage list of missing data
Create a backup copy of the file/table
Data standardization
Data Types - Technical level
- Data type 1: Inconsistent data
- Data type 2: Irregular data (outliers)
Data Types - Meaning level
Scaling, transformation and normalization
Cleaning the data
- Missing values
Verification
Documentation
References
License

Python vs. Excel and co.

Python offers advantages over Excel and GUI-based tools in the data cleaning space, including automation, scalability, flexibility, integration, repeatability, traceability, version control, and improved error handling. This makes Python particularly suitable for large, complex data sets and demanding tasks. Some disadvantages of Python compared to GUI-based tools include a steeper learning curve, greater time investment, lack of visual tools, lack of immediate feedback, more complicated installation and configuration, and less suitability for small or simple data sets. These drawbacks stem mainly from the programming expertise required and the additional effort required for scripting and execution. However, Python remains a powerful and flexible option for data cleaning for larger, more complex data sets and advanced requirements.

Contents

Python vs. Excel and co.

Workflow

Data Quality - Defining Requirements for Data

Analysis of the data

Loading the data set

Missing data

Technique 1: Missing data heatmap

Technique 2: Percentage list of missing data

Create a backup copy of the file/table

Data standardization

Some examples

Data Types - Technical level

Data type 1: Inconsistent data

Data type 2: Irregular data (outliers)

Boxplot of observations

Histogram of observations

Descriptive statistics

LOF - Local Outlier Factors

Isolation Forest

Data Types - Meaning level

Type 1: Non-informative/repetitive

Type 2: Irrelevant data type

Type 3: Duplicates

Duplicate Type 1

Duplicate Type 2

Scaling, transformation and normalization

Cleaning the data

Missing values

Verification

Documentation

References

License