Data conversion and its estimate: a comprehensive guide
Data converting and estimating it is important steps in the pre -data processing pipeline. They prepare the initial data for analysis by converting it into suitable mining models, and improving the efficiency and accuracy of data extraction algorithms. This article is deeply dived into the concepts, techniques and applications of the data conversion and its estimate.
1. What is the transfer of data?
Transforming data includes converting data into suitable mining forms. This step is necessary because raw data is often noisy, inconsistent or not suitable for direct analysis. Common data conversion strategies include:
- homogeneity: Removing noise from data (for example, using bining or assembly).
- Construction: Create new features of those present (for example, space = height x width).
- to gatherSummarizing data (for example, daily sales → monthly sales).
- normalizationData scale to a smaller range (for example, 0.0 to 1.0).
- Appreciation: Replace digital values with time breaks or conceptual posters (for example, age → “youth”, “adults”, “senior”).
- The concept of the hierarchy: Circulating data on the concepts of the higher level (for example, street → City → Country).
2. Why does the data turn important?
- Improves data qualityIt removes noise, contradictions and repetition.
- It enhances mining efficiencyReduces the size and complexity of data, and accelerate algorithms.
- Easy to see better visionsIt turns data into easier forms in analysis and interpretation.
3. Data conversion techniques
3.1 Normalization
Normalization measures digital features to a specific range, such as [0.0, 1.0] or [-1.0, 1.0]. This is especially useful for distance-based mining algorithms (for example, K-Nearest neighbors, assembly) to prevent the traits of larger ranges from controlling these smaller ranges.
3.1.1 min-max normalization
-
formula:
- V* ‘*: The original value of features.
- Mina: The minimum value of the feature A.
- UpperA: The maximum value of the feature A.
- new_minA: The minimum value of the new domain (for example, 0.0).
- new_maxA: The maximum value of the new domain (for example, 1.0).
-
example:
-
Suppose the “income” feature has a value of at least $ 12,000 and a maximum value of $ 98,000.
-
We want to normalize an income value of $ 73,600 to the range [0.0, 1.0].
-
Normal value 0.716.
-
3.1.2 Z-SCore Normalization
3.1.3 Normalization of decimal scaling
-
formula:
- J: The smallest correct number such as (max (| v ‘|) <1).
-
example:
-
Suppose the “price” feature has values ranging from -986 to 917.
-
The maximum absolute value is 986.
-
The smallest correct number (j) so that (986/10^J <1) is J = 3.
-
Normalization of value
-
Normal value -986.
-
3.2 Estimation
The estimation is the subject of digital values with the decisive or conceptual designations. This is useful for simplifying data and making patterns easier to understand.
3.2.1 Binning
Binning divides the feature of the feature into boxes (periods). There are two main types:
- Equality of the offer:
- Divide the range into periods (K) equally.
- Example: As for “age” with values [12, 15, 18, 20, 22, 25, 30, 35, 40]Create 3 boxes:
- Ben 1: [12, 20]
- Ben 2: [21, 30]
- Ben 3: [31, 40]
- Frequency equality:
- Divide the range into (K) boxes, each contains almost the same number of values.
- Example: For the same “age” values, create 3 boxes:
- Ben 1: [12, 15, 18]
- Ben 2: [20, 22, 25]
- Ben 3: [30, 35, 40]
3.2.2 graph analysis
Graphic fees divide the values of the characteristic in the breakdown domains (buckets). The graphic analysis algorithm can be applied frequently to create a multi -level hierarchical sequence.
- example:
- For “price” features with values [1, 1, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 25, 25, 25, 25, 28, 28, 30, 30, 30]:
- Create a graphic fee on an equal fee with a $ 10 box offer:
- Ben 1: [$0, $10]
- Ben 2: [$10, $20]
- Ben 3: [$20, $30]
3.2.3 The block, the decision tree and the analysis of the correlation
- Mass analysis:
- A similar set of values in groups and replace raw values with block stickers.
- Example: The values of “age” to “youth”, “middle age” and “senior”.
- Decision tree analysis:
- Use decision trees to divide digital features into periods based on separation labels.
- Example: Dividing “income” into periods of better predict “credit risk”.
- Correction analysis:
- Use measures such as Chi-Square to combine time breaks with similar category distributions.
- Example: Merging neighboring separations if they have similar distributions of “purchase behavior”.
3.3 Generating the concept of hierarchical sequence of nominal data
Circulating hierarchical serials to the concept on the nominal features of the concepts of a higher level (for example, street → city → country). It can be created manually or automatically based on the number of distinguished values for each feature.
- example:
- For features “Street”, “City”, “District”, and “Balad”:
- Sorting the number of distinguished values:
- Country (15) → Province (365) → City (3567) → Street (674,339).
- Birth of the hierarchy:
- Country → County → City → Street.
- Sorting the number of distinguished values:
- For features “Street”, “City”, “District”, and “Balad”:
4. Practical applications
- Customer fragmentation: Normalizing income and age traits for the group’s customers in sectors.
- Market basket analysis: Estimating the purchase amounts in periods to determine the patterns.
- Detection of fraudUsing hierarchical serials for concept to generalize transactions sites (for example, street → city → country).
5. Conclusion
Data converting and estimating it is essential steps in pre -processing of data. It improves data quality, enhances mining efficiency, and facilitates better visions. By normalizing the hierarchical sequences for detailing, estimating and generating it, you can convert the initial data into a ready -to -analyze model.