You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

9.9 KiB

name description allowed-tools
brain-data-feature-engineering Automatically analyzes BRAIN dataset fields and generates feature engineering ideas for alpha creation. Input: data category, delay, region parameters; Output: markdown document with deep feature engineering suggestions. The skill performs autonomous analysis based on dataset and field information, proposing meaningful feature concepts. [Read Grep Glob Write mcp__brain-mcp__get_datasets mcp__brain-mcp__get_datafields mcp__brain-mcp__get_dataset_details]

BRAIN Data Feature Engineering Workflow

Purpose: Automatically transform BRAIN dataset fields into deep, meaningful feature engineering ideas.

For Detailed Mindset Patterns: See reference.md for feature engineering philosophy. For Implementation Examples: See examples.md for case studies.

Input Requirements

Required Parameters:

  • data_category: Dataset category (e.g., "fundamental", "analyst", "news", "model")
  • delay: Data delay setting (0 or 1)
  • region: Market region (e.g., "USA", "EUR", "ASI")

Optional Parameters:

  • universe: Trading universe (default: "TOP3000")
  • dataset_id: Specific dataset ID (if known, skips discovery phase)

Workflow Overview

Step 1: Dataset Discovery

Autonomous Action:

  • Call mcp__brain-mcp__get_datasets with parameters (category, delay, region, universe)
  • If dataset_id provided: Validate and use it
  • If dataset_id not provided: Select the most relevant dataset based on metadata analysis
  • Output: Locked dataset_id for analysis

Step 2: Field Extraction and Deconstruction

Autonomous Action:

  • Call mcp__brain-mcp__get_datafields for the selected dataset
  • For each field, extract: id, description, dataType, update frequency, coverage
  • Deconstruct each field's meaning:
    • What is being measured? (the entity/concept)
    • How is it measured? (collection/calculation method)
    • Time dimension? (instantaneous, cumulative, rate of change)
    • Business context? (why does this field exist?)
    • Generation logic? (reliability considerations)
  • Build field profiles: Structured understanding of each field's essence

Step 3: Autonomous Thinking and Analysis

The skill performs deep analysis based on collected information:

A. Field Relationship Mapping

  • Analyze logical connections between fields
  • Identify: independent fields, related fields, complementary fields
  • Map the "story" the dataset tells
  • Key question: What relationships are implied by these fields?

B. Question-Driven Feature Generation (Internal Process) The skill asks itself these questions and generates feature concepts:

  1. "What is stable?" → Look for invariants

    • Which fields or combinations remain relatively constant?
    • What stability measures make sense?
  2. "What is changing?" → Analyze change patterns

    • Rate of change, acceleration, volatility
    • Trend vs. noise separation
  3. "What is anomalous?" → Identify deviations

    • Outliers, unusual patterns, breaks from normal
    • Deviation magnitude and significance
  4. "What is combined?" → Examine interactions

    • How fields interact, amplify, or offset each other
    • Synthesis creates new meaning
  5. "What is structural?" → Study compositions

    • Constituent parts, proportional relationships
    • Structural changes over time
  6. "What is cumulative?" → Explore accumulation effects

    • Building up over time, decay effects
    • Memory and persistence in data
  7. "What is relative?" → Make comparisons

    • Relative positioning, ranking, normalization
    • Context within dataset
  8. "What is essential?" → Distill to core meaning

    • First principles thinking
    • Strip away assumptions, get to essence

C. Feature Concept Generation For each relevant question-field combination:

  • Formulate feature concept that answers the question
  • Define the concept clearly
  • Identify the logical meaning
  • Consider directionality (what high/low values mean)
  • Identify boundary conditions
  • Note potential issues/limitations

Step 4: Feature Documentation

For each generated feature concept, document:

  • Concept Name: Clear, descriptive name
  • Definition: One-sentence definition
  • Logical Meaning: What phenomenon/concept does it represent?
  • Why It's Meaningful: Why does this feature make sense?
  • Directionality: Interpretation of high vs. low values
  • Boundary Conditions: What extremes indicate
  • Data Requirements: What fields are used and any constraints
  • Potential Issues: Known limitations or concerns

Step 5: Output Generation

Generate structured markdown report including:

  1. Write the report to ./output_report/region_delay_datasetID_ideas.md in the following format:

  2. Dataset Understanding

    • Dataset description and characteristics
    • Field inventory (count, types, update patterns)
    • Key observations about data structure
  3. Field Deconstruction Analysis

    • For each field: what it truly measures and why
    • Logical relationships between fields
    • "Story" the data tells
  4. Feature Engineering Suggestions by Question Type

    3.1 Stability Features

    • Concepts for measuring stability/invariance
    • Why stability matters in this dataset
    • Example implementations

    3.2 Change Features

    • Concepts for capturing change patterns
    • Rate, acceleration, volatility measures
    • Temporal dynamics

    3.3 Anomaly Features

    • Deviation and outlier detection concepts
    • Normal vs. abnormal identification
    • Significance measures

    3.4 Interaction Features

    • Cross-field interaction concepts
    • Amplification, offset, synthesis effects
    • Combined meaning creation

    3.5 Structure Features

    • Composition and relationship concepts
    • Proportional analysis
    • Structural change detection

    3.6 Cumulative Features

    • Accumulation and decay concepts
    • Memory/persistence measures
    • Time-weighted effects

    3.7 Relative Features

    • Comparison and normalization concepts
    • Ranking and percentile measures
    • Context-relative positioning

    3.8 Essential Features

    • First-principles derived concepts
    • Core meaning extraction
    • Fundamental measures
  5. Implementation Considerations

    • Data quality notes
    • Coverage considerations
    • Computational complexity
    • Potential improvements/extensions
  6. Critical Questions for Further Exploration

    • What aspects weren't covered?
    • What additional data would be helpful?
    • What assumptions should be challenged?

Core Analysis Principles

  1. From Data Essence: Start with what data truly means, not what it's traditionally used for
  2. Autonomous Reasoning: Skill performs all thinking, no user input required
  3. Question-Driven: Internal question bank guides feature generation
  4. Meaning Over Patterns: Prioritize logical meaning over conventional combinations
  5. Transparency: Show reasoning process in output

Example Output Structure

When analyzing dataset 'BEME' (Balance Sheet and Market Data), the output would include:

Dataset Understanding

Fields Analyzed: book_value, market_cap, book_to_market, etc. Key Observations: Dataset compares accounting values with market valuations

Field Deconstruction

  • book_value: Accountant's calculation of net asset value (quarterly, audited, historical cost-based)
  • market_cap: Market participants' valuation (continuous, forward-looking, sentiment-influenced)
  • book_to_market: Ratio comparing these two valuation perspectives

Feature Concepts Generated

From "What is stable?"

  • "Market reevaluation stability": Rolling coefficient of variation of book_to_market
  • Logic: Measures whether market opinion is stable or volatile
  • Meaning: Stable values suggest consensus, volatile values suggest disagreement/uncertainty

From "What is changing?"

  • "Value creation vs. market reevaluation decomposition": Separate book_value growth from market_cap growth
  • Logic: Distinguish fundamental value creation from market sentiment changes
  • Meaning: Which component drives changes in book_to_market?

From "What is combined?"

  • "Intangible value proportion": (market_cap - book_value) / enterprise_value
  • Logic: Quantify proportion of value from intangibles (brand, growth, etc.)
  • Meaning: What percentage of valuation isn't captured on the balance sheet?

(Additional question-based features would follow...)

Implementation Notes

The skill should:

  1. Analyze first, then generate: Fully understand dataset before proposing features
  2. Show reasoning: Explain why each feature concept makes sense
  3. Be specific: Reference actual field names and their characteristics
  4. Be critical: Question assumptions and identify limitations
  5. Be creative: Look beyond traditional financial metrics

The skill should NOT:

  1. Ask users to think: All thinking is internal to the skill
  2. Provide generic templates: Each analysis should be specific to the dataset
  3. Rely on conventional wisdom: Challenge traditional approaches
  4. Output patterns without meaning: Every suggestion must have clear logic

Quality Assurance

Self-Check Process:

  • All fields analyzed, not just skimmed
  • Field meanings understood beyond descriptions
  • Multiple question types explored
  • Each feature has clear logical meaning
  • Reasoning is explicit, not implicit
  • Limitations are acknowledged
  • Output is dataset-specific, not generic

Validation Questions:

  • Would this analysis help someone truly understand the data?
  • Are feature concepts novel yet meaningful?
  • Is the reasoning process transparent?
  • Does it avoid conventional thinking traps?

This skill performs autonomous deep analysis of BRAIN datasets, generating meaningful feature engineering concepts based on data essence and logical reasoning.