9.9 KiB
| name | description | allowed-tools |
|---|---|---|
| brain-data-feature-engineering | Automatically analyzes BRAIN dataset fields and generates feature engineering ideas for alpha creation. Input: data category, delay, region parameters; Output: markdown document with deep feature engineering suggestions. The skill performs autonomous analysis based on dataset and field information, proposing meaningful feature concepts. | [Read Grep Glob Write mcp__brain-mcp__get_datasets mcp__brain-mcp__get_datafields mcp__brain-mcp__get_dataset_details] |
BRAIN Data Feature Engineering Workflow
Purpose: Automatically transform BRAIN dataset fields into deep, meaningful feature engineering ideas.
For Detailed Mindset Patterns: See reference.md for feature engineering philosophy.
For Implementation Examples: See examples.md for case studies.
Input Requirements
Required Parameters:
- data_category: Dataset category (e.g., "fundamental", "analyst", "news", "model")
- delay: Data delay setting (0 or 1)
- region: Market region (e.g., "USA", "EUR", "ASI")
Optional Parameters:
- universe: Trading universe (default: "TOP3000")
- dataset_id: Specific dataset ID (if known, skips discovery phase)
Workflow Overview
Step 1: Dataset Discovery
Autonomous Action:
- Call
mcp__brain-mcp__get_datasetswith parameters (category, delay, region, universe) - If dataset_id provided: Validate and use it
- If dataset_id not provided: Select the most relevant dataset based on metadata analysis
- Output: Locked dataset_id for analysis
Step 2: Field Extraction and Deconstruction
Autonomous Action:
- Call
mcp__brain-mcp__get_datafieldsfor the selected dataset - For each field, extract: id, description, dataType, update frequency, coverage
- Deconstruct each field's meaning:
- What is being measured? (the entity/concept)
- How is it measured? (collection/calculation method)
- Time dimension? (instantaneous, cumulative, rate of change)
- Business context? (why does this field exist?)
- Generation logic? (reliability considerations)
- Build field profiles: Structured understanding of each field's essence
Step 3: Autonomous Thinking and Analysis
The skill performs deep analysis based on collected information:
A. Field Relationship Mapping
- Analyze logical connections between fields
- Identify: independent fields, related fields, complementary fields
- Map the "story" the dataset tells
- Key question: What relationships are implied by these fields?
B. Question-Driven Feature Generation (Internal Process) The skill asks itself these questions and generates feature concepts:
-
"What is stable?" → Look for invariants
- Which fields or combinations remain relatively constant?
- What stability measures make sense?
-
"What is changing?" → Analyze change patterns
- Rate of change, acceleration, volatility
- Trend vs. noise separation
-
"What is anomalous?" → Identify deviations
- Outliers, unusual patterns, breaks from normal
- Deviation magnitude and significance
-
"What is combined?" → Examine interactions
- How fields interact, amplify, or offset each other
- Synthesis creates new meaning
-
"What is structural?" → Study compositions
- Constituent parts, proportional relationships
- Structural changes over time
-
"What is cumulative?" → Explore accumulation effects
- Building up over time, decay effects
- Memory and persistence in data
-
"What is relative?" → Make comparisons
- Relative positioning, ranking, normalization
- Context within dataset
-
"What is essential?" → Distill to core meaning
- First principles thinking
- Strip away assumptions, get to essence
C. Feature Concept Generation For each relevant question-field combination:
- Formulate feature concept that answers the question
- Define the concept clearly
- Identify the logical meaning
- Consider directionality (what high/low values mean)
- Identify boundary conditions
- Note potential issues/limitations
Step 4: Feature Documentation
For each generated feature concept, document:
- Concept Name: Clear, descriptive name
- Definition: One-sentence definition
- Logical Meaning: What phenomenon/concept does it represent?
- Why It's Meaningful: Why does this feature make sense?
- Directionality: Interpretation of high vs. low values
- Boundary Conditions: What extremes indicate
- Data Requirements: What fields are used and any constraints
- Potential Issues: Known limitations or concerns
Step 5: Output Generation
Generate structured markdown report including:
-
Write the report to ./output_report/region_delay_datasetID_ideas.md in the following format:
-
Dataset Understanding
- Dataset description and characteristics
- Field inventory (count, types, update patterns)
- Key observations about data structure
-
Field Deconstruction Analysis
- For each field: what it truly measures and why
- Logical relationships between fields
- "Story" the data tells
-
Feature Engineering Suggestions by Question Type
3.1 Stability Features
- Concepts for measuring stability/invariance
- Why stability matters in this dataset
- Example implementations
3.2 Change Features
- Concepts for capturing change patterns
- Rate, acceleration, volatility measures
- Temporal dynamics
3.3 Anomaly Features
- Deviation and outlier detection concepts
- Normal vs. abnormal identification
- Significance measures
3.4 Interaction Features
- Cross-field interaction concepts
- Amplification, offset, synthesis effects
- Combined meaning creation
3.5 Structure Features
- Composition and relationship concepts
- Proportional analysis
- Structural change detection
3.6 Cumulative Features
- Accumulation and decay concepts
- Memory/persistence measures
- Time-weighted effects
3.7 Relative Features
- Comparison and normalization concepts
- Ranking and percentile measures
- Context-relative positioning
3.8 Essential Features
- First-principles derived concepts
- Core meaning extraction
- Fundamental measures
-
Implementation Considerations
- Data quality notes
- Coverage considerations
- Computational complexity
- Potential improvements/extensions
-
Critical Questions for Further Exploration
- What aspects weren't covered?
- What additional data would be helpful?
- What assumptions should be challenged?
Core Analysis Principles
- From Data Essence: Start with what data truly means, not what it's traditionally used for
- Autonomous Reasoning: Skill performs all thinking, no user input required
- Question-Driven: Internal question bank guides feature generation
- Meaning Over Patterns: Prioritize logical meaning over conventional combinations
- Transparency: Show reasoning process in output
Example Output Structure
When analyzing dataset 'BEME' (Balance Sheet and Market Data), the output would include:
Dataset Understanding
Fields Analyzed: book_value, market_cap, book_to_market, etc. Key Observations: Dataset compares accounting values with market valuations
Field Deconstruction
- book_value: Accountant's calculation of net asset value (quarterly, audited, historical cost-based)
- market_cap: Market participants' valuation (continuous, forward-looking, sentiment-influenced)
- book_to_market: Ratio comparing these two valuation perspectives
Feature Concepts Generated
From "What is stable?"
- "Market reevaluation stability": Rolling coefficient of variation of book_to_market
- Logic: Measures whether market opinion is stable or volatile
- Meaning: Stable values suggest consensus, volatile values suggest disagreement/uncertainty
From "What is changing?"
- "Value creation vs. market reevaluation decomposition": Separate book_value growth from market_cap growth
- Logic: Distinguish fundamental value creation from market sentiment changes
- Meaning: Which component drives changes in book_to_market?
From "What is combined?"
- "Intangible value proportion": (market_cap - book_value) / enterprise_value
- Logic: Quantify proportion of value from intangibles (brand, growth, etc.)
- Meaning: What percentage of valuation isn't captured on the balance sheet?
(Additional question-based features would follow...)
Implementation Notes
The skill should:
- Analyze first, then generate: Fully understand dataset before proposing features
- Show reasoning: Explain why each feature concept makes sense
- Be specific: Reference actual field names and their characteristics
- Be critical: Question assumptions and identify limitations
- Be creative: Look beyond traditional financial metrics
The skill should NOT:
- Ask users to think: All thinking is internal to the skill
- Provide generic templates: Each analysis should be specific to the dataset
- Rely on conventional wisdom: Challenge traditional approaches
- Output patterns without meaning: Every suggestion must have clear logic
Quality Assurance
Self-Check Process:
- All fields analyzed, not just skimmed
- Field meanings understood beyond descriptions
- Multiple question types explored
- Each feature has clear logical meaning
- Reasoning is explicit, not implicit
- Limitations are acknowledged
- Output is dataset-specific, not generic
Validation Questions:
- Would this analysis help someone truly understand the data?
- Are feature concepts novel yet meaningful?
- Is the reasoning process transparent?
- Does it avoid conventional thinking traps?
This skill performs autonomous deep analysis of BRAIN datasets, generating meaningful feature engineering concepts based on data essence and logical reasoning.