# Feature Engineering Mindset Patterns This document provides a comprehensive framework for **thinking** about feature engineering, not a list of patterns to apply blindly. ## The Core Philosophy **Feature engineering is not about finding predictive patterns—it's about understanding what data truly means and expressing that meaning in quantifiable ways.** ## 1. Data Semantic Understanding Framework ### Field Deconstruction Methodology **For each field, ask these fundamental questions:** #### What is being measured? - Not just the surface description—what is the actual entity or concept? - Example: Don't think "P/E ratio", think "price divided by earnings per share" - What is the "thing" behind the numbers? #### How is it measured? - Data collection method (survey, sensor, calculation) - Assumptions embedded in measurement - Frequency and timing considerations - Example: Book values are quarterly, audited, historical cost; market cap is continuous, forward-looking #### What is the time dimension? - Instantaneous snapshot (price at moment T) - Cumulative value (total sales to date) - Rate of change (velocity, acceleration) - Memory/persistence (how long effects last) #### Why does this field exist? - What problem was it designed to solve? - Who uses it and for what purpose? - What business process generates it? ### Field Relationship Mapping **Find the story the data tells:** #### Identify connections: - **Causal**: X causes Y (revenue → profit) - **Complementary**: X and Y measure related aspects (price & volume) - **Conflicting**: X and Y can diverge (book value vs. market cap) - **Independent**: X and Y are unrelated (company location vs. stock price) #### Build the narrative: - What is the complete picture these fields paint? - What are the key turning points? - What is missing that would complete the story? ### Data Quality Assessment **Evaluate from the source:** #### Generation mechanisms: - Manual entry (human error, bias, gaming) - Automated collection (sensor precision, calibration) - Calculated values (formula assumptions, input quality) #### Reliability indicators: - Audit trails and verification processes - Consistency checks across sources - Update frequency vs. true change rate ## 2. First-Principles Thinking **Strip away all labels and assumptions.** ### The Process: 1. **Forget what you "know"**: Ignore domain-specific labels 2. **Identify raw components**: What are the fundamental elements? 3. **Question everything**: Why is it measured this way? 4. **Rebuild from basics**: Construct features from fundamental truths ### Example: **Don't say**: "P/E ratio measures valuation" **Do say**: "Price per share divided by earnings per share compares market price to accounting profit" **First principles analysis**: - Price: What market participants collectively believe value is - Earnings: Accounting measure of profit generation - Ratio: Comparison of two different perspectives on value - **Insight**: The spread between perspectives is what matters, not the ratio itself ### Exercise: For any field, write down: - What is literally being measured (no jargon) - What assumptions are built in - What could cause it to be wrong - What it would mean if it were very high or very low ## 3. Question-Driven Feature Generation **Start with questions, not formulas.** ### The Question Bank: #### Q1: "What is stable?" (Invariance) **Purpose**: Find what doesn't change—it's often more meaningful than what does **Leads to features about:** - Stability measures (coefficient of variation) - Invariant relationships (ratios that stay constant) - Structural constants (parameters that define the system) **Examples**: - "Customer acquisition cost stability" = std_dev(CAC) / mean(CAC) - *Meaning*: Is our cost structure predictable? - *High value*: Costs are volatile, business model is unstable - *Low value*: Costs are predictable, scalable model #### Q2: "What is changing?" (Dynamics) **Purpose**: Understand motion, rate, and direction **Leads to features about:** - Velocity and acceleration - Trend vs. noise - Change significance **Examples**: - "Growth acceleration" = (revenue_t - revenue_{t-1}) - (revenue_{t-1} - revenue_{t-2}) - *Meaning*: Is growth speeding up or slowing down? - *High value*: Accelerating growth - *Low value*: Decelerating growth - *Why it matters*: Acceleration is early signal of inflection points #### Q3: "What is anomalous?" (Deviation) **Purpose**: Identify what breaks patterns—the exceptions reveal rules **Leads to features about:** - Outliers and extremes - Deviation from normal - Pattern breaks **Examples**: - "Earnings surprise magnitude" = (actual - expected) / |expected| - *Meaning*: How much did results deviate from expectations? - *High value*: Significant surprise (positive or negative) - *Why it matters*: Surprises often trigger re-evaluation #### Q4: "What is combined?" (Interaction) **Purpose**: Understand how elements affect each other **Leads to features about:** - Synergies and conflicts - Joint effects - Conditional relationships **Examples**: - "Marketing-sales synergy" = (marketing_spend × sales_efficiency) - *Meaning*: Do marketing and sales amplify each other? - *High value*: Strong synergy (1+1=3) - *Low value*: Weak synergy (1+1=1.5) - *Why it matters*: Synergy indicates scalability #### Q5: "What is structural?" (Composition) **Purpose**: Decompose wholes into meaningful parts **Leads to features about:** - Component breakdowns - Proportional relationships - Structure changes **Examples**: - "Recurring revenue quality" = subscription_revenue / total_revenue - *Meaning*: What portion of revenue is predictable? - *High value*: High-quality recurring revenue - *Low value*: Low-quality one-time revenue - *Why it matters*: Predictability affects valuation #### Q6: "What is cumulative?" (Accumulation) **Purpose**: Capture time-based build-up and decay **Leads to features about:** - Running totals and diminishing returns - Memory effects - Time-weighted values **Examples**: - "Customer relationship depth" = Σ(purchase_value × e^{-days_ago / half_life}) - *Meaning*: Time-decayed cumulative purchase value - *High value*: Deep, recent relationship - *Low value*: Shallow or old relationship - *Why it matters*: Recency and frequency predict loyalty #### Q7: "What is relative?" (Comparison) **Purpose**: Understand position in context **Leads to features about:** - Rankings and percentiles - Normalizations - Context-aware measures **Examples**: - "Relative efficiency" = company_efficiency / industry_median_efficiency - *Meaning*: How efficient vs. peers? - *High value*: More efficient than typical - *Low value*: Less efficient than typical - *Why it matters*: Competitiveness indicator #### Q8: "What is essential?" (Essence) **Purpose**: Distill to core truths **Leads to features about:** - First-principles measures - Fundamental relationships - Stripped-down indicators **Examples**: - "Core profitability" = (revenue - variable_costs) / revenue - *Meaning*: Profitability without fixed cost distortions - *Why it matters*: Shows true unit economics ### How to Use the Question Bank: **For any dataset**: 1. Go through each question 2. Ask: "Which fields or combinations can answer this?" 3. Formulate specific feature concepts 4. Validate each concept has clear meaning 5. Document the reasoning **Example Workflow:** ``` Dataset: Sales data with fields [customer_id, order_value, order_date, product_category] Q: "What is stable?" → Average order value per customer over time → Favorite category per customer (most frequent) → Purchase frequency pattern Q: "What is changing?" → Order value trend (increasing/decreasing) → Category preference evolution → Purchase interval changes Q: "What is anomalous?" → Orders far from customer's typical behavior → Sudden category switches → Unusually large/small orders Q: "What is combined?" → Order value × frequency = total value → Category diversity × consistency = loyalty measure → Recency × frequency = engagement score ... (continue through all questions) ``` ## 4. Field Combination Logic Patterns ### When you combine fields, what are you really doing? #### Addition: "X + Y" → What does this sum represent? **Good when**: Combining parts of a whole - Total revenue = product_A_revenue + product_B_revenue **Bad when**: Adding unrelated concepts - Price + volume (What does this mean?) #### Subtraction: "X - Y" → What is the difference telling you? **Good when**: Measuring gap or surplus - Profit = revenue - costs - Shortfall = target - actual **Bad when**: Ignoring that difference scales with magnitude - Revenue_2023 - revenue_2022 (better: percentage change) #### Multiplication: "X × Y" → What is the joint effect? **Good when**: Capturing interaction or scaling - Total_value = price × quantity - Weighted_importance = score × weight **Bad when**: Mixing units without meaning - Revenue × employee_count (What is "dollar-employees"?) #### Division: "X / Y" → What ratio or rate are you computing? **Good when**: Creating relative measures - Efficiency = output / input - Concentration = part / whole **Bad when**: Denominator can be zero or meaningless - Revenue / days_since_founded (early days distort heavily) #### Conditional: "If X then Y" → What condition matters? **Good when**: Threshold effects exist - If temperature > 100°C then phase = "gas" - If churn_risk > 0.8 then intervene = true **Bad when**: Arbitrary thresholds without justification - If customer_age > 30 then category = "old" (why 30?) ### The Deeper Question: **"What new information does this combination create?"** A good combination: - Reveals something the individual fields hide - Creates a new concept with clear meaning - Has intuitive interpretation A bad combination: - Just applies math to numbers - Creates meaningless units (dollar-days per employee) - Is hard to explain ## 5. Escaping Conventional Thinking Traps ### Trap 1: "This is a [field type], so I should..." **Wrong**: "This is price data, so I should calculate moving averages" **Right**: "This is a time series of transaction values—what patterns exist?" **Escaping method**: Pretend you don't know the field name or domain. Just look at: - Data type (number, category, date) - Update frequency - Distribution - Missingness pattern **Ask**: What would a data scientist from a different field see? ### Trap 2: "Everyone uses [conventional feature], so I will too" **Wrong**: Building P/E, moving averages, RSI because "that's what you do" **Right**: Asking "What does this ratio truly mean? Is there a better way to express that concept?" **Example with P/E**: - Conventional: P/E = price / earnings ("valuation metric") - First principles: Compares market's forward-looking assessment to accounting record - Deeper question: Why do these diverge? What does divergence mean? - Better feature: Track divergence trend, not just level ### Trap 3: "Complexity = better" **Wrong**: Adding more variables, interactions, conditions to improve "sophistication" **Right**: Simpler is often more robust and interpretable **Test**: Can you explain the feature in one sentence to a non-expert? - If no → It's too complex - If yes → It might be valuable ### Trap 4: "Feature engineering is separate from domain knowledge" **Wrong**: Applying math without understanding what fields mean **Right**: Deep domain understanding → Better features **Process**: 1. Understand the business process that generates each field 2. Identify pain points and edge cases in that process 3. Build features that capture those nuances 4. Validate with domain experts ## 6. Feature Validation Checklist ### Before finalizing any feature, verify: #### □ Clear Definition - [ ] Can be explained in one sentence - [ ] Uses precise language - [ ] Avoids jargon and buzzwords #### □ Logical Meaning - [ ] Represents a real phenomenon or concept - [ ] Not just a mathematical operation - [ ] Has intuitive interpretation #### □ Business Relevance - [ ] Connects to real-world decision-making - [ ] Answers a meaningful question - [ ] Reveals actionable insight #### □ Directional Understanding - [ ] What does high value mean? - [ ] What does low value mean? - [ ] Is there an optimal range? #### □ Boundary Conditions - [ ] What do extreme values indicate? - [ ] What happens at zero/infinity? - [ ] Are there theoretical limits? #### □ Data Quality Awareness - [ ] What are sources of noise? - [ ] When might this be unreliable? - [ ] What biases could affect it? #### □ Novelty Check - [ ] Does this reveal something new? - [ ] Or just repackage existing information? - [ ] Would an expert learn something? ### Example Validation: **Feature**: Customer purchase velocity = total_purchases / account_age_days - **Clear definition**: "Average number of purchases per day since account creation" - **Logical meaning**: Measures purchase frequency over customer lifetime - **Business relevance**: Indicates customer engagement and habit formation - **Directional**: High = frequent buyer, Low = infrequent buyer - **Boundaries**: Zero = no purchases, Very high = possible data error or bulk buyer - **Data quality**: Affected by returns, multi-item orders, gift purchases - **Novelty**: Reveals engagement pattern beyond simple total purchases ## 7. Creative Thinking Techniques ### A. Lateral Thinking (Borrow from other domains) **Ask**: How would a physicist/biologist/sociologist approach this? **Example - Physics**: - Field: Customer usage frequency - Physics concept: Resonance frequency - Feature idea: "Natural usage cadence" = frequency with highest amplitude - **Meaning**: Inherent rhythm of customer behavior **Example - Biology**: - Field: Product adoption rates - Biology concept: Population growth - Feature idea: "Adoption growth model" = fit logistic growth curve - **Meaning**: Identify inflection point where growth slows **Exercise**: For each field, brainstorm 3 analogies from other disciplines ### B. Vertical Thinking (Keep asking "why?") **The 5 Whys exercise**: 1. Why do customers churn? → Because they stop using the product 2. Why do they stop using it? → Because they don't find value 3. Why don't they find value? → Because their needs changed 4. Why did needs change? → Because their business grew 5. Why did business growth matter? → Because the product didn't scale with them **Resulting feature**: "Scalability mismatch" = customer_growth_rate / product_capability **Process**: Don't stop at surface-level questions. Dig until you hit fundamental truths. ### C. Perspective Shifting (Change your viewpoint) **Time ↔ Space**: - If you have time series data, think about spatial patterns (clustering, distribution) - If you have spatial/cross-sectional data, think about evolution over time **Individual ↔ Collective**: - Zoom in: What does this mean for one entity? - Zoom out: What does this pattern mean for the group? **Quantitative ↔ Qualitative**: - What would the qualitative description be? - How do you quantify that description? ### D. Constraint-Based Creativity (Add restrictions) **Artificial constraints force creative solutions**: - "You can only use one field" → Forces focus on that field's nuances - "You can only use addition/subtraction" → Simplifies relationships - "You must include time" → Adds temporal dimension - "You must be able to explain to a 5-year-old" → Forces simplicity **Example**: "Explain customer value using only purchase timestamps" - Feature: Time-based engagement depth (weighted recency/frequency) - **Meaning**: Recent, frequent purchases = high engagement ## 8. From Concepts to Implementations ### Bridging the Gap: **Concept**: "Customer engagement momentum" (from "What is changing?") - **Meaning**: Is engagement increasing or decreasing in intensity? - **Implementation**: Δ(engagement_score) over time, with acceleration **Steps**: 1. Define engagement_score (purchase frequency × recency_weight) 2. Calculate change: engagement_today - engagement_last_week 3. Calculate acceleration: change_today - change_last_week 4. **Result**: Positive = increasing momentum, Negative = losing momentum ### Common Implementation Patterns: **For stability**: Rolling coefficient of variation, autocorrelation, entropy **For change**: Differences, log differences, second differences **For anomalies**: Z-scores, isolation forest scores, deviation from predicted **For interactions**: Products, ratios, conditional means **For structure**: Component ratios, hierarchical decompositions **For accumulation**: Running sums, exponentially weighted sums, integration **For relativity**: Percentiles, z-scores, min-max scaling **For essence**: Factor analysis, PCA, simple base components ### Quality Metrics for Implementation: **Coverage**: What percentage of entities have data? **Stability**: Does the feature behave consistently across time periods? **Interpretability**: Can you explain the value meaningfully? **Actionability**: Does it suggest a clear action? ## Summary: The Mindset in Seven Words **"Understand deeply, question assumptions, express meaningfully"** --- *This document provides thinking tools, not formulas. True feature engineering happens when you combine deep data understanding with creative questions about what that data means.*