--- name: brain-data-feature-engineering description: >- Automatically analyzes BRAIN dataset fields and generates feature engineering ideas for alpha creation. Input: data category, delay, region parameters; Output: markdown document with deep feature engineering suggestions. The skill performs autonomous analysis based on dataset and field information, proposing meaningful feature concepts. allowed-tools: - Read - Grep - Glob - Write - mcp__brain-mcp__get_datasets - mcp__brain-mcp__get_datafields - mcp__brain-mcp__get_dataset_details --- # BRAIN Data Feature Engineering Workflow **Purpose**: Automatically transform BRAIN dataset fields into deep, meaningful feature engineering ideas. **For Detailed Mindset Patterns**: See `reference.md` for feature engineering philosophy. **For Implementation Examples**: See `examples.md` for case studies. ## Input Requirements ### Required Parameters: - **data_category**: Dataset category (e.g., "fundamental", "analyst", "news", "model") - **delay**: Data delay setting (0 or 1) - **region**: Market region (e.g., "USA", "EUR", "ASI") ### Optional Parameters: - **universe**: Trading universe (default: "TOP3000") - **dataset_id**: Specific dataset ID (if known, skips discovery phase) ## Workflow Overview ### Step 1: Dataset Discovery **Autonomous Action:** - Call `mcp__brain-mcp__get_datasets` with parameters (category, delay, region, universe) - If dataset_id provided: Validate and use it - If dataset_id not provided: Select the most relevant dataset based on metadata analysis - **Output**: Locked dataset_id for analysis ### Step 2: Field Extraction and Deconstruction **Autonomous Action:** - Call `mcp__brain-mcp__get_datafields` for the selected dataset - For each field, extract: id, description, dataType, update frequency, coverage - **Deconstruct each field's meaning**: * What is being measured? (the entity/concept) * How is it measured? (collection/calculation method) * Time dimension? (instantaneous, cumulative, rate of change) * Business context? (why does this field exist?) * Generation logic? (reliability considerations) - **Build field profiles**: Structured understanding of each field's essence ### Step 3: Autonomous Thinking and Analysis **The skill performs deep analysis based on collected information:** **A. Field Relationship Mapping** - Analyze logical connections between fields - Identify: independent fields, related fields, complementary fields - Map the "story" the dataset tells - **Key question**: What relationships are implied by these fields? **B. Question-Driven Feature Generation (Internal Process)** The skill asks itself these questions and generates feature concepts: 1. **"What is stable?"** → Look for invariants - Which fields or combinations remain relatively constant? - What stability measures make sense? 2. **"What is changing?"** → Analyze change patterns - Rate of change, acceleration, volatility - Trend vs. noise separation 3. **"What is anomalous?"** → Identify deviations - Outliers, unusual patterns, breaks from normal - Deviation magnitude and significance 4. **"What is combined?"** → Examine interactions - How fields interact, amplify, or offset each other - Synthesis creates new meaning 5. **"What is structural?"** → Study compositions - Constituent parts, proportional relationships - Structural changes over time 6. **"What is cumulative?"** → Explore accumulation effects - Building up over time, decay effects - Memory and persistence in data 7. **"What is relative?"** → Make comparisons - Relative positioning, ranking, normalization - Context within dataset 8. **"What is essential?"** → Distill to core meaning - First principles thinking - Strip away assumptions, get to essence **C. Feature Concept Generation** For each relevant question-field combination: - Formulate feature concept that answers the question - Define the concept clearly - Identify the logical meaning - Consider directionality (what high/low values mean) - Identify boundary conditions - Note potential issues/limitations ### Step 4: Feature Documentation **For each generated feature concept, document:** - **Concept Name**: Clear, descriptive name - **Definition**: One-sentence definition - **Logical Meaning**: What phenomenon/concept does it represent? - **Why It's Meaningful**: Why does this feature make sense? - **Directionality**: Interpretation of high vs. low values - **Boundary Conditions**: What extremes indicate - **Data Requirements**: What fields are used and any constraints - **Potential Issues**: Known limitations or concerns ### Step 5: Output Generation **Generate structured markdown report including:** 0. **Write the report to ./output_report/region_delay_datasetID_ideas.md** in the following format: 1. **Dataset Understanding** - Dataset description and characteristics - Field inventory (count, types, update patterns) - Key observations about data structure 2. **Field Deconstruction Analysis** - For each field: what it truly measures and why - Logical relationships between fields - "Story" the data tells 3. **Feature Engineering Suggestions by Question Type** **3.1 Stability Features** - Concepts for measuring stability/invariance - Why stability matters in this dataset - Example implementations **3.2 Change Features** - Concepts for capturing change patterns - Rate, acceleration, volatility measures - Temporal dynamics **3.3 Anomaly Features** - Deviation and outlier detection concepts - Normal vs. abnormal identification - Significance measures **3.4 Interaction Features** - Cross-field interaction concepts - Amplification, offset, synthesis effects - Combined meaning creation **3.5 Structure Features** - Composition and relationship concepts - Proportional analysis - Structural change detection **3.6 Cumulative Features** - Accumulation and decay concepts - Memory/persistence measures - Time-weighted effects **3.7 Relative Features** - Comparison and normalization concepts - Ranking and percentile measures - Context-relative positioning **3.8 Essential Features** - First-principles derived concepts - Core meaning extraction - Fundamental measures 4. **Implementation Considerations** - Data quality notes - Coverage considerations - Computational complexity - Potential improvements/extensions 5. **Critical Questions for Further Exploration** - What aspects weren't covered? - What additional data would be helpful? - What assumptions should be challenged? ## Core Analysis Principles 1. **From Data Essence**: Start with what data truly means, not what it's traditionally used for 2. **Autonomous Reasoning**: Skill performs all thinking, no user input required 3. **Question-Driven**: Internal question bank guides feature generation 4. **Meaning Over Patterns**: Prioritize logical meaning over conventional combinations 5. **Transparency**: Show reasoning process in output ## Example Output Structure When analyzing dataset 'BEME' (Balance Sheet and Market Data), the output would include: ### Dataset Understanding **Fields Analyzed**: book_value, market_cap, book_to_market, etc. **Key Observations**: Dataset compares accounting values with market valuations ### Field Deconstruction - **book_value**: Accountant's calculation of net asset value (quarterly, audited, historical cost-based) - **market_cap**: Market participants' valuation (continuous, forward-looking, sentiment-influenced) - **book_to_market**: Ratio comparing these two valuation perspectives ### Feature Concepts Generated **From "What is stable?"** - "Market reevaluation stability": Rolling coefficient of variation of book_to_market - **Logic**: Measures whether market opinion is stable or volatile - **Meaning**: Stable values suggest consensus, volatile values suggest disagreement/uncertainty **From "What is changing?"** - "Value creation vs. market reevaluation decomposition": Separate book_value growth from market_cap growth - **Logic**: Distinguish fundamental value creation from market sentiment changes - **Meaning**: Which component drives changes in book_to_market? **From "What is combined?"** - "Intangible value proportion": (market_cap - book_value) / enterprise_value - **Logic**: Quantify proportion of value from intangibles (brand, growth, etc.) - **Meaning**: What percentage of valuation isn't captured on the balance sheet? **(Additional question-based features would follow...)** ## Implementation Notes ### The skill should: 1. **Analyze first, then generate**: Fully understand dataset before proposing features 2. **Show reasoning**: Explain why each feature concept makes sense 3. **Be specific**: Reference actual field names and their characteristics 4. **Be critical**: Question assumptions and identify limitations 5. **Be creative**: Look beyond traditional financial metrics ### The skill should NOT: 1. **Ask users to think**: All thinking is internal to the skill 2. **Provide generic templates**: Each analysis should be specific to the dataset 3. **Rely on conventional wisdom**: Challenge traditional approaches 4. **Output patterns without meaning**: Every suggestion must have clear logic ## Quality Assurance **Self-Check Process:** - [ ] All fields analyzed, not just skimmed - [ ] Field meanings understood beyond descriptions - [ ] Multiple question types explored - [ ] Each feature has clear logical meaning - [ ] Reasoning is explicit, not implicit - [ ] Limitations are acknowledged - [ ] Output is dataset-specific, not generic **Validation Questions:** - Would this analysis help someone truly understand the data? - Are feature concepts novel yet meaningful? - Is the reasoning process transparent? - Does it avoid conventional thinking traps? --- *This skill performs autonomous deep analysis of BRAIN datasets, generating meaningful feature engineering concepts based on data essence and logical reasoning.*