Data Mining basic and advanced concepts - Part 1 : Variables
It is important to get to know your data. It's very tempting to jump straight to analyzing and forecasting or building predictive or regression models. But Data needs to be ready for that!
Real world data is typically noisy, imperfect and heterogeneous. So cleaning this data and getting familiar with its structure is a crucial task.
Usually, this task of data cleaning is time-consuming.
I would say from a personal experience that about 75 to 80 percent of the time spent with data is time dedicated to getting it ready in a shape suitable for analysis and model building.
Getting to know your data
Checking the available data, basic structure: number of observations, number of columns and types of columns.
An important step is to check the data types you are dealing with. Making the distinguishing between different types of data attributes facilitates the analysis of this data. But what is an data attribute and what are the different types of data attributes ?
''An attribute is a data field that represents a characteristic or feature of a data object. The it can be called also dimension, feature, and variable. But statisticians always prefer to use the term variable.''
What are the different types of variables we can have?
Nominal Variables
It is a variables with no numerical values. Nominal means "relating to names". The values of a nominal variable are symbols or names of things.
Every value is a category, code, gender, occupation, hair color..etc
Nominal variables are also referred to as categorical. A purely nominal (or categorical) variable is one that simply allows you to assign categories but you can't clearly order the variables. If the variable has a certain order, then that variable would be an ordinal variable.
Binary Variables
Binary variables are actually nominal variables, but only having 2 different values: 1 or 0. Typically we can find binary variables having a "true" or "false" values. In that case we can call them "Boolean variables".
In general, binary variables are observations that occur in one of two possible states, often labelled zero and one. E.g., “improved/not improved” and “completed task/failed to complete task.”
Ordinal Variables
An ordinal variable is an attribute with values that have a certain meaningful order or ranking among them, but the magnitude between successive values is not known or precise .
An example of ordinal variables is school grades: the possible values of a grade in a school test are A, B, C, D and F. These values are ordered and can be compared with each-other (A is better than D for example )
Ordinal attributes are used to register subjective assessments of qualities that can't be measured objectively.
That's why they are used in surveys for ratings. Customer satisfaction can be measured in a survey by giving a value on a scale of 4 :
0: very dissatisfied,
1: somewhat dissatisfied
2: neutral
3: satisfied
4: very satisfied.
Numeric Variables
A numeric variable is a measurable variable, it has a quantitative value that is represented by an integer or a real number.
Numeric variables can also be in the form of intervals.
In the numeric variables, there are two types:
- Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing a ranking of values, such attributes allow us to compare and quantify the difference between values.
- Ratio-Scaled Attributes
A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is, if a measurement is ratio-scaled, we can speak of a value as being a multiple (or ratio) of another value. In addition, the values are ordered, and we can also compute the difference between values, as well as the mean, median, and mode.
I will cover more details with the next parts of the Data Mining Basic and Advanced Concepts series in the upcoming days.