26 KiB

Raw Blame History

analyst4 Feature Engineering Analysis Report

Dataset: analyst4
Category: Analyst
Region: USA
Analysis Date: 2026-03-30
Fields Analyzed: 653

Executive Summary

Primary Question Answered by Dataset: What do analysts expect for company financials (sales, EPS, cash flow, balance sheet) across quarterly and annual horizons, and how do actual results compare to these expectations?

Key Insights from Analysis:

The dataset captures the full lifecycle of analyst expectations: guidance ranges, point estimates (consensus), and actual reported numbers.
For each financial item, multiple summary statistics (mean, median, low, high, standard deviation, count) are provided, enabling rich uncertainty and disagreement measures.
The inclusion of both quarterly (qfv4) and annual (afv4) horizons allows for temporal gradient analysis (short‑term vs. long‑term expectations).
Recommendation data (buy/hold/sell counts, consensus score) adds a separate dimension of market sentiment.

Critical Field Relationships Identified:

Actual vs. Expected: Actual values are paired with pre‑event estimates, allowing surprise calculation (e.g., actual_eps_value_quarterly vs. anl4_qfv4_eps_mean).
Guidance Ranges: Min and max guidance values define an uncertainty corridor (e.g., min_sales_guidance_value vs. max_sales_guidance_value).
Consensus Disagreement: The spread between high and low estimates (e.g., anl4_qfv4_eps_high – anl4_qfv4_eps_low) quantifies analyst divergence.

Most Promising Feature Concepts:

Guidance Uncertainty Spread – because the width of guidance ranges captures management’s confidence (or uncertainty) about future performance.
Earnings Surprise Momentum – tracking how recent actual surprises alter subsequent estimate revisions captures slow diffusion of information.
Consensus Disagreement – high dispersion in analyst estimates often precedes stock price corrections when actuals are released.

Dataset Deep Understanding

Dataset Description

The analyst4 dataset provides analyst forecasts and actuals for US equities. It covers a wide array of financial statement items (sales, EPS, cash flow, balance sheet metrics) for both quarterly and annual horizons. Key components include:

Guidance values (min/max) – management’s forward‑looking ranges.
Estimate aggregations (mean, median, low, high, standard deviation, count) – derived from individual analyst reports.
Actual reported values – historical realizations against which estimates are compared.
Recommendation data – broker ratings and consensus scores.
Broker‑level detail – individual analyst estimates and revisions (available but not required for core features).

Field Inventory (Representative Sample)

Field ID	Description	Data Type	Update Frequency	Coverage
`sales_estimate_average_quarterly`	Sales – mean of quarterly estimations	Continuous	Event‑based (estimate updates)	High
`actual_sales_value_quarterly`	Sales – actual reported value	Continuous	Event‑based (earnings)	Moderate
`min_sales_guidance_value`	Sales – minimum guidance (annual)	Continuous	Event‑based (guidance)	Moderate
`max_sales_guidance_value`	Sales – maximum guidance (annual)	Continuous	Event‑based (guidance)	Moderate
`earnings_per_share_average`	EPS – mean of estimations (quarterly)	Continuous	Event‑based (estimate updates)	High
`actual_eps_value_quarterly`	EPS – actual reported value	Continuous	Event‑based (earnings)	High
`anl4_afv4_eps_mean`	EPS – mean of annual estimates	Continuous	Event‑based (estimate updates)	High
`anl4_buy`	Number of buy recommendations	Integer	Daily (?)	High
`anl4_hold`	Number of hold recommendations	Integer	Daily (?)	High
`anl4_sell`	Number of sell recommendations	Integer	Daily (?)	High
`anl4_mark`	Recommendation consensus score	Continuous	Daily (?)	High
`total_goodwill_reported_value`	Total Goodwill – actual reported	Continuous	Event‑based (earnings)	Moderate
`net_debt_actual_value`	Net debt – actual reported	Continuous	Event‑based (earnings)	Moderate
`est_eps`	EPS – mean of annual estimates	Continuous	Event‑based	High
`anl4_ebitda_mean`	EBITDA – mean of annual estimates	Continuous	Event‑based	Moderate

Note: All estimate and actual fields are event‑based – they are populated only on days when new data becomes available (e.g., estimate revisions, earnings announcements).

Field Deconstruction Analysis

`sales_estimate_average_quarterly`

What is being measured?: The average (consensus) quarterly sales forecast across contributing analysts.
How is it measured?: Arithmetic mean of individual analyst forecasts; updated when new estimates are published.
Time dimension: Point‑in‑time snapshot of consensus at each update.
Business context: Represents the market’s expected top‑line performance for the upcoming quarter.
Generation logic: Aggregated from analyst reports; may be backfilled for days without updates.
Reliability considerations: Susceptible to outliers; the count of estimates provides confidence.

`actual_eps_value_quarterly`

What is being measured?: The actual earnings per share reported by the company for the quarter.
How is it measured?: Direct from financial statements (GAAP or adjusted).
Time dimension: Instantaneous on announcement date.
Business context: The ultimate outcome against which forecasts are judged.
Generation logic: Collected from regulatory filings or press releases.
Reliability considerations: Highly reliable but may be restated later.

`anl4_qfv4_eps_high`

What is being measured?: The highest quarterly EPS estimate among all analysts.
How is it measured?: Maximum of individual analyst forecasts.
Time dimension: Point‑in‑time snapshot.
Business context: Reflects the most optimistic outlook.
Generation logic: Derived from the same set of estimates as the mean.
Reliability considerations: May be driven by a single outlier; use in conjunction with count.

`anl4_buy`

What is being measured?: Number of analysts with a “buy” (or equivalent) rating.
How is it measured?: Count of recommendation records.
Time dimension: Updated when analysts change ratings.
Business context: Proxy for bullish sentiment.
Generation logic: Aggregated from broker reports.
Reliability considerations: Definitions of “buy” may vary across brokers.

Field Relationship Mapping

The Story This Data Tells: The dataset traces the journey of market expectations from management guidance through analyst estimates to actual reported results. It reveals how consensus forms, how uncertainty evolves (via range widths and dispersion), and how actuals trigger revisions.

Key Relationships Identified:

Guidance → Estimates: Min/max guidance likely anchors the distribution of analyst estimates; the consensus mean tends to fall within the guidance range.
Estimates → Actuals: The difference between actual reported values and pre‑announcement consensus (surprise) drives subsequent estimate revisions.
Disagreement → Revision: High dispersion (high‑low spread) often precedes large estimate changes when new information arrives.
Recommendations ↔ Estimates: Changes in consensus score may correlate with estimate revisions, especially for EPS.

Missing Pieces That Would Complete the Picture:

Timing of estimate revisions relative to earnings announcement dates (event windows).
Analyst‑level identity to track individual revision momentum.
Historical actuals for the same fiscal period to compute multi‑year trends.

Feature Concepts by Question Type

Q1: “What is stable?” (Invariance Features)

Concept: Guidance Uncertainty Width

Sample Fields Used: min_sales_guidance_value, max_sales_guidance_value
Definition: (max_guidance - min_guidance) / abs(mean_guidance)
Why This Feature: Narrow guidance indicates high management confidence; wide guidance signals uncertainty or volatile business outlook.
Logical Meaning: Relative size of the “cone of uncertainty” around forward guidance.
Is filling nan necessary: No – guidance fields only have values when guidance is issued. NaN indicates no guidance available. We should not fill because absence itself is informative.
Directionality: High values = more uncertainty; low values = more precision.
Boundary Conditions: May be extreme if mean guidance is near zero; use winsorization or truncation.
Implementation Example: divide(subtract(max_sales_guidance_value, min_sales_guidance_value), abs(sales_estimate_average_annual))

Concept: Consensus Stability

Sample Fields Used: anl4_qfv4_eps_std, anl4_qfv4_eps_mean
Definition: std / abs(mean) – coefficient of variation of EPS estimates.
Why This Feature: Low CV indicates high analyst agreement, suggesting the stock is well‑understood and less prone to surprise.
Logical Meaning: Normalized dispersion of analyst opinions.
Is filling nan necessary: The standard deviation may be NaN when fewer than two estimates exist. Use if_else with is_nan to set a default (e.g., 0).
Directionality: High CV = high disagreement, possibly leading to volatile price reactions.
Boundary Conditions: If mean is near zero, the ratio may explode – use with caution.
Implementation Example: divide(anl4_qfv4_eps_std, abs(anl4_qfv4_eps_mean))

Q2: “What is changing?” (Dynamics Features)

Concept: Estimate Revision Momentum

Sample Fields Used: anl4_qfv4_eps_mean
Definition: ts_delta(anl4_qfv4_eps_mean, 1) / abs(ts_delay(anl4_qfv4_eps_mean, 1)) – percentage change in consensus from previous update.
Why This Feature: Captures the speed and direction of consensus adjustments; large positive changes may signal under‑reaction.
Logical Meaning: Relative revision intensity.
Is filling nan necessary: The consensus field may be NaN on non‑update days. Use ts_backfill to carry forward the last value before applying ts_delta.
Directionality: Positive = upward revision (bullish); negative = downward revision (bearish).
Boundary Conditions: Zero denominator can occur if previous value is zero; use if_else to handle.
Implementation Example: divide(ts_delta(ts_backfill(anl4_qfv4_eps_mean, lookback=90, k=1), 1), abs(ts_delay(ts_backfill(anl4_qfv4_eps_mean, lookback=90, k=1), 1)))

Concept: Guidance Range Narrowing

Sample Fields Used: min_sales_guidance_value, max_sales_guidance_value
Definition: ts_delta( max_guidance - min_guidance, 1 ) – change in absolute guidance width over time.
Why This Feature: Narrowing range indicates management’s increasing confidence as the fiscal period approaches.
Logical Meaning: Temporal tightening of the forecast corridor.
Is filling nan necessary: Guidance values are event‑based; backfill to get last known range before measuring change.
Directionality: Negative change = increased precision.
Boundary Conditions: May be zero if no update.
Implementation Example: ts_delta( subtract(ts_backfill(max_sales_guidance_value, lookback=180, k=1), ts_backfill(min_sales_guidance_value, lookback=180, k=1)), 1 )

Q3: “What is anomalous?” (Deviation Features)

Concept: Earnings Surprise

Sample Fields Used: actual_eps_value_quarterly, anl4_qfv4_eps_mean
Definition: (actual - consensus) / abs(consensus) – standardized surprise.
Why This Feature: Positive surprises often trigger immediate price jumps; the magnitude determines subsequent drift.
Logical Meaning: Unexpected earnings relative to market expectation.
Is filling nan necessary: The actual field is NaN most days. The feature is only meaningful on the announcement date. No filling needed.
Directionality: Positive = beat; negative = miss.
Boundary Conditions: Consensus may be zero; use if_else to avoid division by zero.
Implementation Example: divide( subtract(actual_eps_value_quarterly, anl4_qfv4_eps_mean), abs(anl4_qfv4_eps_mean) )

Concept: Analyst Over‑optimism

Sample Fields Used: anl4_qfv4_eps_high, anl4_qfv4_eps_mean
Definition: (high - mean) / abs(mean) – relative gap between the most optimistic estimate and consensus.
Why This Feature: A persistently large gap may indicate a subset of overly optimistic analysts; their future revisions can drive price.
Logical Meaning: Skew in analyst expectations toward the upside.
Is filling nan necessary: Backfill high and mean to last known values.
Directionality: High values = optimism skew; low values = consensus closer to the top.
Boundary Conditions: May be undefined if mean is zero.
Implementation Example: divide( subtract(ts_backfill(anl4_qfv4_eps_high, lookback=90, k=1), ts_backfill(anl4_qfv4_eps_mean, lookback=90, k=1)), abs(ts_backfill(anl4_qfv4_eps_mean, lookback=90, k=1)) )

Q4: “What is combined?” (Interaction Features)

Concept: Valuation‑Adjusted Surprise

Sample Fields Used: actual_eps_value_quarterly, anl4_qfv4_eps_mean, book_value_per_share_reported_value
Definition: (actual - consensus) / book_value_per_share – surprise scaled by balance sheet anchor.
Why This Feature: Same surprise magnitude may have different impact depending on firm’s asset base.
Logical Meaning: Earnings surprise relative to the book value – a more fundamental scaling.
Is filling nan necessary: Book value is reported quarterly (event field). Backfill to latest available.
Directionality: Positive = beat relative to asset size.
Boundary Conditions: Book value may be negative; use absolute value or treat as special.
Implementation Example: divide( subtract(actual_eps_value_quarterly, anl4_qfv4_eps_mean), abs(ts_backfill(book_value_per_share_reported_value, lookback=180, k=1)) )

Concept: Cash Flow Quality Adjustment

Sample Fields Used: actual_eps_value_quarterly, actual_cashflow_per_share_value_quarterly
Definition: (eps - cashflow) / abs(eps) – difference between earnings and cash flow per share.
Why This Feature: Large divergence may signal accounting discretion or unsustainable earnings.
Logical Meaning: Accrual component of earnings.
Is filling nan necessary: Both fields are event‑based; combine only when both have values (e.g., on earnings day).
Directionality: Negative = cash flow stronger than earnings (higher quality).
Boundary Conditions: EPS may be zero; use absolute denominator.
Implementation Example: divide( subtract(actual_eps_value_quarterly, actual_cashflow_per_share_value_quarterly), abs(actual_eps_value_quarterly) )

Q5: “What is structural?” (Composition Features)

Concept: Tangible Asset Intensity

Sample Fields Used: total_goodwill_reported_value, total_assets_reported_value
Definition: goodwill / total_assets – proportion of assets that is goodwill (intangible).
Why This Feature: High ratio indicates reliance on past acquisitions; such firms may be more volatile post‑earnings.
Logical Meaning: Degree of asset “intangibility”.
Is filling nan necessary: Balance sheet items are reported quarterly; backfill to last known.
Directionality: High = more goodwill‑heavy, potentially riskier.
Boundary Conditions: Total assets must be non‑zero.
Implementation Example: divide( ts_backfill(total_goodwill_reported_value, lookback=180, k=1), ts_backfill(total_assets_reported_value, lookback=180, k=1) )

Concept: Net Debt to Equity

Sample Fields Used: net_debt_actual_value, shareholders_equity_actual_value
Definition: net_debt / equity – leverage measure.
Why This Feature: High leverage amplifies sensitivity to earnings surprises and interest rate changes.
Logical Meaning: Financial risk exposure.
Is filling nan necessary: Backfill both to latest reported.
Directionality: High = more leveraged.
Boundary Conditions: Equity could be negative; use absolute or treat as special.
Implementation Example: divide( ts_backfill(net_debt_actual_value, lookback=180, k=1), ts_backfill(shareholders_equity_actual_value, lookback=180, k=1) )

Q6: “What is cumulative?” (Accumulation Features)

Concept: Cumulative Estimate Revisions Over Quarter

Sample Fields Used: anl4_qfv4_eps_mean
Definition: ts_sum( ts_delta(backfilled_eps, 1), 90 ) – sum of daily consensus changes over the last 90 days.
Why This Feature: Captures net revision pressure as the earnings date approaches.
Logical Meaning: Net momentum of analyst expectations over a quarter.
Is filling nan necessary: Backfill consensus before calculating deltas.
Directionality: Positive = net upward revisions.
Boundary Conditions: Use ts_delta with backfilled series to avoid noise from non‑update days.
Implementation Example: ts_sum( ts_delta( ts_backfill(anl4_qfv4_eps_mean, lookback=120, k=1), 1 ), 90 )

Concept: Rolling Guidance Width Accumulation

Sample Fields Used: min_sales_guidance_value, max_sales_guidance_value
Definition: ts_sum( max_guidance - min_guidance, 180 ) – total uncertainty “exposure” over past 180 days.
Why This Feature: Repeated wide guidance may indicate chronic uncertainty.
Logical Meaning: Integrated management ambiguity.
Is filling nan necessary: Backfill guidance ranges before summation.
Directionality: High = persistently uncertain outlook.
Boundary Conditions: Sum may become very large; consider normalization by number of non‑NaN days.
Implementation Example: ts_sum( subtract( ts_backfill(max_sales_guidance_value, lookback=200, k=1), ts_backfill(min_sales_guidance_value, lookback=200, k=1) ), 180 )

Q7: “What is relative?” (Comparison Features)

Concept: Sector‑Relative Disagreement

Sample Fields Used: anl4_qfv4_eps_std, anl4_qfv4_eps_mean, group (e.g., industry)
Definition: group_zscore( anl4_qfv4_eps_std / abs(anl4_qfv4_eps_mean), industry ) – how analyst disagreement deviates from industry norm.
Why This Feature: Abnormal disagreement may indicate stock‑specific uncertainty.
Logical Meaning: Relative uncertainty vs. peers.
Is filling nan necessary: Backfill standard deviation and mean; apply group_zscore only when group is defined.
Directionality: Positive = more disagreement than industry average.
Boundary Conditions: Group must be available; use group_mean to compute industry average if needed.
Implementation Example: group_zscore( divide( ts_backfill(anl4_qfv4_eps_std, lookback=90, k=1), abs( ts_backfill(anl4_qfv4_eps_mean, lookback=90, k=1) ) ), group )

Concept: Consensus Score Deviation from Recommendation Counts

Sample Fields Used: anl4_mark, anl4_buy, anl4_total_rec
Definition: anl4_mark - (anl4_buy / anl4_total_rec) – residual of consensus score relative to simple buy ratio.
Why This Feature: Consensus scores often weight ratings non‑linearly; deviation captures nuanced sentiment.
Logical Meaning: Unexplained sentiment (e.g., strong “sell” recommendations dragging down score).
Is filling nan necessary: Recommendation fields are daily (continuous). Use directly.
Directionality: Positive = score higher than buy ratio suggests.
Boundary Conditions: Avoid division by zero if total_rec = 0.
Implementation Example: subtract( anl4_mark, divide( anl4_buy, anl4_total_rec ) )

Q8: “What is essential?” (Essence Features)

Concept: Surprise‑Adjusted Revision Response

Sample Fields Used: actual_eps_value_quarterly, anl4_qfv4_eps_mean, anl4_qfv4_eps_mean (future)
Definition: ( future_consensus - pre_actual_consensus ) / abs( surprise ) – the revision magnitude following a surprise.
Why This Feature: Measures how quickly and strongly analysts incorporate new information.
Logical Meaning: Information diffusion speed.
Is filling nan necessary: This feature requires aligning the pre‑actual consensus and the consensus after the earnings announcement (e.g., 5 days later). Use ts_delay to capture post‑actual consensus.
Directionality: Large response to small surprise indicates over‑reaction.
Boundary Conditions: Surprise may be zero; use conditional.
Implementation Example: divide( subtract( ts_delay( anl4_qfv4_eps_mean, -5 ), anl4_qfv4_eps_mean ), abs( subtract(actual_eps_value_quarterly, anl4_qfv4_eps_mean) ) )
Note: Negative delay is not allowed; implement as ts_delay( anl4_qfv4_eps_mean, 5 ) after aligning dates.

Concept: Fundamental Persistence Score

Sample Fields Used: est_eps, est_cashflow_ps, est_ebitda
Definition: ts_corr( est_eps, est_cashflow_ps, 12 ) – rolling correlation between annual EPS and cash flow estimates.
Why This Feature: High correlation suggests earnings are driven by cash flow (higher quality); low correlation may indicate accounting distortions.
Logical Meaning: Earnings quality persistence.
Is filling nan necessary: Estimate fields are event‑based; backfill to daily before correlation.
Directionality: High = stable, high‑quality earnings.
Boundary Conditions: Need at least two non‑NaN points for correlation.
Implementation Example: ts_corr( ts_backfill(est_eps, lookback=180, k=1), ts_backfill(est_cashflow_ps, lookback=180, k=1), 12 )

Implementation Considerations

Data Quality Notes

Coverage: Estimate fields have high coverage for large‑cap stocks, but may be sparse for smaller names. Use anl4_qfv4_eps_number to gauge reliability.
Timeliness: Actuals and guidance are updated with a delay of 1 day (delay=1). Estimate revisions may appear with the same delay.
Accuracy: Consensus aggregates are robust, but individual broker‑level fields are more volatile and may contain errors.
Potential Biases: Analysts tend to be overly optimistic; the “optimism gap” feature can help neutralize this bias.

Computational Complexity

Lightweight features: Simple ratios (e.g., guidance width, buy ratio) – minimal computation.
Medium complexity: Rolling changes, backfill operations – moderate cost.
Heavy computation: ts_corr over backfilled series, group operations across many stocks – consider caching backfilled values.

Recommended Prioritization

Tier 1 (Immediate Implementation):

Earnings Surprise – high predictive power for post‑earnings drift.
Guidance Uncertainty Width – captures management confidence and correlates with future volatility.
Consensus Disagreement (CV) – standard measure of information divergence.

Tier 2 (Secondary Priority):

Estimate Revision Momentum – dynamic signal that often precedes price moves.
Sector‑Relative Disagreement – improves cross‑sectional comparison by removing industry effects.
Cash Flow Quality Adjustment – adds fundamental quality dimension.

Tier 3 (Requires Further Validation):

Surprise‑Adjusted Revision Response – needs careful event‑window alignment; may be computationally intensive.
Fundamental Persistence Score – requires sufficient history; may be noisy for stocks with short estimate history.

Critical Questions for Further Exploration

Unanswered Questions:

How does the dataset treat estimate revisions that occur after the fiscal period end but before the earnings announcement? (Pre‑announcement windows)
Are there adjustments for stock splits or other corporate actions that affect per‑share metrics?
What is the exact mapping between quarterly (qfv4) and annual (afv4) fields – are they independent or do quarterly estimates feed into annual?

Recommended Additional Data:

Daily price data to directly test feature predictive power.
Macroeconomic variables (e.g., sector‑level growth) to contextualize guidance widths.
Market microstructure data (volume, turnover) to measure attention around earnings events.

Assumptions to Challenge:

Analysts update estimates daily – in reality, many estimates remain unchanged for weeks; backfilling may smooth too aggressively.
“Mean” consensus is the best representation of market expectation; median might be more robust to outliers.
Guidance ranges are uniformly distributed – in practice, ranges may be skewed (e.g., management aiming low).

Methodology Notes

Analysis Approach: This report was generated by:

Deep field deconstruction to understand data essence – categorizing fields by item type, frequency, and event nature.
Question‑driven feature generation using 8 fundamental questions (stability, dynamics, anomalies, interactions, structure, accumulation, comparison, essence).
Logical validation of each feature concept – ensuring each answers a specific question and has clear economic interpretation.
Transparent documentation of reasoning, including data quality notes and boundary conditions.

Design Principles:

Focus on logical meaning over conventional patterns – features are grounded in information diffusion and fundamental analysis.
Every feature must answer a specific question (e.g., “What is changing?” → revision momentum).
Clear documentation of “why” for each suggestion, linking to the dataset’s unique strengths (guidance ranges, consensus statistics, actuals).
Emphasis on data understanding over prediction – features are designed to reveal underlying dynamics, not merely maximize backtest performance.

Report generated: 2026-03-30
Analysis depth: Comprehensive field deconstruction + 8‑question framework
Next steps: Implement Tier 1 features, validate assumptions, gather additional data as needed

26 KiB Raw Blame History