Data Processing and Normalization
Data Retrieval
The initial step in the data processing pipeline is retrieving raw financial data from Polygon.io, a leading provider of financial market data. Polygon.io offers a comprehensive API that allows access to a wide range of financial data, including publicly available filings such as 10-K and 10-Q reports in XBRL format.
At BullishBeat, we focus on analyzing financial data on a rolling 12-month basis to provide up-to-date insights into a company’s performance. To achieve this, we leverage Polygon.io’s JSON API to retrieve the relevant data for our analysis.
The data retrieval process involves the following steps:
- Authenticating and connecting to the Polygon.io API using the provided credentials.
- Querying the API to fetch the desired financial data, specifically the 10-K and 10-Q filings in XBRL format, for the companies of interest.
- Retrieving the JSON-formatted data returned by the API, which contains the translated XBRL data in a structured format.
- Extracting the relevant sections from the JSON data, including the balance sheet, income statement, and cash flow statement, for the rolling 12-month period.
Once the raw financial data is retrieved, it is then processed to calculate various financial ratios and indicators that provide valuable insights into a company’s financial health and performance.
The calculated ratios and indicators are then stored in a database as part of a comprehensive financial profile for each company. This profile serves as a foundation for further analysis and comparison across different companies and industries.
By automating the data retrieval process using the Polygon.io API, we ensure a consistent and efficient flow of financial data into our analysis pipeline. This enables us to maintain up-to-date financial profiles for a large number of companies and perform timely analysis based on the most recent data available.
Data Manipulation and Handling Missing Values
For data manipulation and analysis, we primarily use the Pandas library in Python. Pandas provides powerful data structures and functions for efficient data manipulation and analysis.
When handling missing values in the financial data, we have established the following approach:
- Missing values are populated with an empty string (“”) to indicate the absence of data.
- During the analysis phase, the treatment of missing values depends on the specific module being built and the nature of the comparison being performed.
- In some cases, missing values are filled with a standard value to ensure consistency across the dataset.
- In other cases, companies with missing values for specific metrics are excluded from the analysis to maintain data integrity.
Since our focus is on US-based companies, currency conversions are not necessary, as all financial values are assumed to be in USD.
Data Validation and Manual Audit
To ensure data consistency and reliability, we employ a combination of automated validation checks and manual audits.
- Automated validation scripts are run to identify any anomalies, outliers, or inconsistencies in the calculated financial ratios and indicators.
- Flagged data points are manually reviewed by the blogger(me) to determine the appropriate course of action, such as correcting errors or excluding invalid data.
- Periodic manual audits are conducted on a sample of the data to verify the accuracy and completeness of the information stored in the database.
By incorporating these data validation and audit processes, we maintain a high level of data quality and integrity, which is crucial for generating reliable financial insights and comparisons.