9.9 KiB

Raw Blame History

name	description	allowed-tools
brain-data-feature-engineering	Automatically analyzes BRAIN dataset fields and generates feature engineering ideas for alpha creation. Input: data category, delay, region parameters; Output: markdown document with deep feature engineering suggestions. The skill performs autonomous analysis based on dataset and field information, proposing meaningful feature concepts.	[Read Grep Glob Write mcp__brain-mcp__get_datasets mcp__brain-mcp__get_datafields mcp__brain-mcp__get_dataset_details]

BRAIN Data Feature Engineering Workflow

Purpose: Automatically transform BRAIN dataset fields into deep, meaningful feature engineering ideas.

For Detailed Mindset Patterns: See reference.md for feature engineering philosophy. For Implementation Examples: See examples.md for case studies.

Input Requirements

Required Parameters:

data_category: Dataset category (e.g., "fundamental", "analyst", "news", "model")
delay: Data delay setting (0 or 1)
region: Market region (e.g., "USA", "EUR", "ASI")

Optional Parameters:

universe: Trading universe (default: "TOP3000")
dataset_id: Specific dataset ID (if known, skips discovery phase)

Workflow Overview

Step 1: Dataset Discovery

Autonomous Action:

Call mcp__brain-mcp__get_datasets with parameters (category, delay, region, universe)
If dataset_id provided: Validate and use it
If dataset_id not provided: Select the most relevant dataset based on metadata analysis
Output: Locked dataset_id for analysis

Step 2: Field Extraction and Deconstruction

Autonomous Action:

Call mcp__brain-mcp__get_datafields for the selected dataset
For each field, extract: id, description, dataType, update frequency, coverage
Deconstruct each field's meaning:
- What is being measured? (the entity/concept)
- How is it measured? (collection/calculation method)
- Time dimension? (instantaneous, cumulative, rate of change)
- Business context? (why does this field exist?)
- Generation logic? (reliability considerations)
Build field profiles: Structured understanding of each field's essence

Step 3: Autonomous Thinking and Analysis

The skill performs deep analysis based on collected information:

A. Field Relationship Mapping

Analyze logical connections between fields
Identify: independent fields, related fields, complementary fields
Map the "story" the dataset tells
Key question: What relationships are implied by these fields?

B. Question-Driven Feature Generation (Internal Process) The skill asks itself these questions and generates feature concepts:

"What is stable?" → Look for invariants
- Which fields or combinations remain relatively constant?
- What stability measures make sense?
"What is changing?" → Analyze change patterns
- Rate of change, acceleration, volatility
- Trend vs. noise separation
"What is anomalous?" → Identify deviations
- Outliers, unusual patterns, breaks from normal
- Deviation magnitude and significance
"What is combined?" → Examine interactions
- How fields interact, amplify, or offset each other
- Synthesis creates new meaning
"What is structural?" → Study compositions
- Constituent parts, proportional relationships
- Structural changes over time
"What is cumulative?" → Explore accumulation effects
- Building up over time, decay effects
- Memory and persistence in data
"What is relative?" → Make comparisons
- Relative positioning, ranking, normalization
- Context within dataset
"What is essential?" → Distill to core meaning
- First principles thinking
- Strip away assumptions, get to essence

C. Feature Concept Generation For each relevant question-field combination:

Formulate feature concept that answers the question
Define the concept clearly
Identify the logical meaning
Consider directionality (what high/low values mean)
Identify boundary conditions
Note potential issues/limitations

Step 4: Feature Documentation

For each generated feature concept, document:

Concept Name: Clear, descriptive name
Definition: One-sentence definition
Logical Meaning: What phenomenon/concept does it represent?
Why It's Meaningful: Why does this feature make sense?
Directionality: Interpretation of high vs. low values
Boundary Conditions: What extremes indicate
Data Requirements: What fields are used and any constraints
Potential Issues: Known limitations or concerns

Step 5: Output Generation

Generate structured markdown report including:

Write the report to ./output_report/region_delay_datasetID_ideas.md in the following format:
Dataset Understanding
- Dataset description and characteristics
- Field inventory (count, types, update patterns)
- Key observations about data structure
Field Deconstruction Analysis
- For each field: what it truly measures and why
- Logical relationships between fields
- "Story" the data tells
Feature Engineering Suggestions by Question Type

3.1 Stability Features
- Concepts for measuring stability/invariance
- Why stability matters in this dataset
- Example implementations
3.2 Change Features
- Concepts for capturing change patterns
- Rate, acceleration, volatility measures
- Temporal dynamics
3.3 Anomaly Features
- Deviation and outlier detection concepts
- Normal vs. abnormal identification
- Significance measures
3.4 Interaction Features
- Cross-field interaction concepts
- Amplification, offset, synthesis effects
- Combined meaning creation
3.5 Structure Features
- Composition and relationship concepts
- Proportional analysis
- Structural change detection
3.6 Cumulative Features
- Accumulation and decay concepts
- Memory/persistence measures
- Time-weighted effects
3.7 Relative Features
- Comparison and normalization concepts
- Ranking and percentile measures
- Context-relative positioning
3.8 Essential Features
- First-principles derived concepts
- Core meaning extraction
- Fundamental measures
Implementation Considerations
- Data quality notes
- Coverage considerations
- Computational complexity
- Potential improvements/extensions
Critical Questions for Further Exploration
- What aspects weren't covered?
- What additional data would be helpful?
- What assumptions should be challenged?

Core Analysis Principles

From Data Essence: Start with what data truly means, not what it's traditionally used for
Autonomous Reasoning: Skill performs all thinking, no user input required
Question-Driven: Internal question bank guides feature generation
Meaning Over Patterns: Prioritize logical meaning over conventional combinations
Transparency: Show reasoning process in output

Example Output Structure

When analyzing dataset 'BEME' (Balance Sheet and Market Data), the output would include:

Dataset Understanding

Fields Analyzed: book_value, market_cap, book_to_market, etc. Key Observations: Dataset compares accounting values with market valuations

Field Deconstruction

book_value: Accountant's calculation of net asset value (quarterly, audited, historical cost-based)
market_cap: Market participants' valuation (continuous, forward-looking, sentiment-influenced)
book_to_market: Ratio comparing these two valuation perspectives

Feature Concepts Generated

From "What is stable?"

"Market reevaluation stability": Rolling coefficient of variation of book_to_market
Logic: Measures whether market opinion is stable or volatile
Meaning: Stable values suggest consensus, volatile values suggest disagreement/uncertainty

From "What is changing?"

"Value creation vs. market reevaluation decomposition": Separate book_value growth from market_cap growth
Logic: Distinguish fundamental value creation from market sentiment changes
Meaning: Which component drives changes in book_to_market?

From "What is combined?"

"Intangible value proportion": (market_cap - book_value) / enterprise_value
Logic: Quantify proportion of value from intangibles (brand, growth, etc.)
Meaning: What percentage of valuation isn't captured on the balance sheet?

(Additional question-based features would follow...)

Implementation Notes

The skill should:

Analyze first, then generate: Fully understand dataset before proposing features
Show reasoning: Explain why each feature concept makes sense
Be specific: Reference actual field names and their characteristics
Be critical: Question assumptions and identify limitations
Be creative: Look beyond traditional financial metrics

The skill should NOT:

Ask users to think: All thinking is internal to the skill
Provide generic templates: Each analysis should be specific to the dataset
Rely on conventional wisdom: Challenge traditional approaches
Output patterns without meaning: Every suggestion must have clear logic

Quality Assurance

Self-Check Process:

All fields analyzed, not just skimmed
Field meanings understood beyond descriptions
Multiple question types explored
Each feature has clear logical meaning
Reasoning is explicit, not implicit
Limitations are acknowledged
Output is dataset-specific, not generic

Validation Questions:

Would this analysis help someone truly understand the data?
Are feature concepts novel yet meaningful?
Is the reasoning process transparent?
Does it avoid conventional thinking traps?

This skill performs autonomous deep analysis of BRAIN datasets, generating meaningful feature engineering concepts based on data essence and logical reasoning.

9.9 KiB Raw Blame History