Research Computing >> Resources & reference >> Databases biases and errors


This page provides references for articles that study specific aspects of CRSP, Compustat and other popular sources of data used by researchers at Kellogg. If you know of any additional references, please e-mail

  • Abarbanell, Jeffrey and Reuven Lehavy (2000). "Differences in Commercial Database Reported Earnings: Implications for Inferences Concerning Analyst Forecast Rationality, the Association between Prices and Earnings, and Firm Reporting Discretion." Working paper, October. (available for download from SSRN).
    Abstract: Significant changes in mean and median analysts' forecasts errors documented in recent studies are not synchronized across commercial forecast databases over time and are, in large part, a function of the definitions and procedures that determine the reported earnings component of earnings surprises. In this study we describe a number of complications researchers face in drawing inferences from forecast error metrics that are based on reported earnings numbers supplied by the major forecast data providers (FDPs) First Call, Zacks and I/B/E/S rather than metrics based on commonly used definitions of Compustat reported earnings. We show how differences across and intertemporal changes in FDP practices can have an important impact on inferences drawn in studies concerning analyst rationality, the association between stock prices and earnings news, firm earnings management, and stock market mispricing. We illustrate the importance of researchers' choice of reported earnings data on inferences with examples from the literatures on earnings response coefficients (Bradshaw, Moberg and Sloan [2000]), analyst forecast rationality (Matsumoto [1999] and Brown [1999]), and value relevance of earnings (Collins, Maydew and Weiss [1997]).

  • Anderson, Ronald C. and D. Scott Lee (1997). "Ownership Studies: The Data Source Does Matter" Journal of Financial and Quantitative Analysis 32(3), pp. 311-29. (Full text available through ABI/Inform)
    Abstract: We examine the fit between the ownership data provided by four surrogate databases and the data collected from proxy statements. We discover an unambiguous pecking order among the surrogates relative to the benchmark ownership statistics of corporate proxy statements. Corporate Text is first, followed in descending order by Compact Disclosure, Value Line, and Spectrum. Further tests show that reporting discrepancies in the Value Line and Spectrum databases could affect economic inferences drawn from regressions using their ownership data. A field guide describing each data source's reporting conventions, formats, and strategies for data aggregation may be downloaded from the Journal of Financial and Quantitative Analysis' web site (see following entry).

  • Anderson, Ronald C. and D. Scott Lee (1997). "Field Guide for Research Using Ownership Data" Journal of Financial and Quantitative Analysis 32(3), supplement. (Full text available through the JFQA web site,
    Abstract: In this field guide, we discuss some advantages of using the aggregate holdings of all officers and directors statistic. We illustrate the reporting conventions and idiosyncrasies of the alternate sources by reproducing and discussing the information provided by each source for a single firm, Baldor Electric Company. The problems described are typical of those encountered when using the four alternative databases. We also tabulate attributes on which the alternative databases differ, including the years they initiate their coverage, the sources from which they collect data, the extent to which they process that data, their display formats, their storage media, and subscription rates. These differences, along with accuracy, are all relevant when considering rival databases.

  • Ball, Ray; Kothari, S. P. and Wasley, Charles E. (1995). "Can We Implement Research on Stock Trading Rules?" Journal of Portfolio Management; 21(2), pages 54-63.
    Abstract: Research on trading rule profitability typically simulates trading on historical data. These data usually are obtained from files such as CRSP, which estimate closing prices as of the last trade (at the closing bid or the closing ask, or neither), or the bid-ask average (in the absence of a last trade). A trading rule could not normally be implemented at these prices, for even a small number of shares. A simulated contrarian strategy transforms "noise" in closing price estimates into return biases, by buying at predominantly bid prices and shorting at ask, which is not doable for most investors. The bias in estimated contrarian portfolio returns is severe. For example, when returns are calculated from successive bid prices of Nasdaq stocks, short-term contrarian profits largely disappear.

  • Bennin, Robert (1980). "Error Rates in CRSP and COMPUSTAT: A Second Look." Journal of Finance; 35(5), pages 1267-71. (Full text available through Jstor)
    Abstract: Rosenberg and Houglet's (RH) did an error rate comparison study using the Compustat(R) Industrial Tape and the CRSP (Center for Research in Security Prices) Monthly Return Tape. To update their work, this study adds the new Price-Dividend-Earnings (PDE) Tape that has been released by Compustat(R). In this tape, price information is of primary importance. The monthly return for each NYSE company on the PDE tape was matched against its CRSP monthly return; returns including all distributions were used. The period covered was January 1962 through July 1978. The overall error rate was only 1/3 the rate reported in the earlier study, but the number of discrepancies does reaffirm the usefulness of proposed data comparison by a user group. Compustat(R) errors drop markedly after 1970, indicating improved data collection, although CRSP stands out as a more reliable service prior to 1975. Since 1975, the 2 services have been equally reliable, suggesting that Compustat(R) will become as accurate as CRSP.
  • Bhojraj Sanjeev, Charles M. C. Lee and Derek Oler (2003). "What's My Line? A Comparison of Industry Classification Schemes for Capital Market Research." Journal of Accounting Research; 41(5), pages 745-774. (Full text available through Ingenta)
    Abstract: This study compares four broadly available industry classification schemes in a variety of applications common to capital market research. Standard Industrial Classification (SIC) codes have been available since 1939 but are being replaced by North American Industry Classification System (NAICS) codes. The Global Industry Classifications Standard (GICS)SM system, jointly developed by Standard & Poor's and Morgan Stanley Capital International (MSCI), is popular among financial practitioners, whereas the Fama and French [1997] algorithm is used primarily by academics. Our results show that GICS classifications are significantly better at explaining stock return comovements, as well as cross-sectional variations in valuation multiples, forecasted and realized growth rates, research and development expenditures, and various key financial ratios. The GICS advantage is consistent from year to year and is most pronounced among large firms. The other three methods differ little from each other in most applications.

  • Canina, Linda et al. (1998). "Caveat Compounder: A Warning about Using the Daily CRSP Equal-Weighted Index to Compute Long-Run Excess Returns." Journal of Finance; 53(1), pages 403-16. (Full text available through Ingenta)
    Abstract: This paper issues a warning that compounding daily returns of the Center for Research in Security Prices (CRSP) equal-weighted index can lead to surprisingly large biases. The differences between the monthly returns compounded from the daily tapes and the monthly CRSP equal-weighted indices is almost 0.43 percent per month, or 6 percent per year. This difference amounts to one-third of the average monthly return and is large enough to reverse the conclusions of a paper using the daily tape to compute the return on the benchmark portfolio. The authors also investigate the sources of these biases and suggest several alternative strategies to avoid them. Coauthors are Roni Michaely, Richard Thaler, and Kent Womack.

  • Cadman, Brian, Sandy Klasa and Steve Matsunaga (2006). "Evidence of how systematic differences between ExecuComp and non-ExecuComp firms can affect empirical research resutls." (Full text available through Kellogg's Faculty Publication System)
    Abstract: We provide empirical evidence on the implications of using the ExecuComp database to investigate empirical corporate governance and compensation issues by: a) documenting systematic differences in compensation and governance structures between ExecuComp and non-ExecuComp firms; b) providing examples where inferences drawn from tests conducted on ExecuComp samples do not generalize to broader samples; and c) demonstrating that expanding the sample to include non-ExecuComp firms increases the heterogeneity of the sample thereby allowing researchers to uncover previously hidden conditional or nonlinear relations. Overall, our findings indicate that there are important differences in the characteristics of ExecuComp and non-ExecuComp samples beyond firm size. Our study enhances the ability of researchers to draw valid inferences from empirical tests using ExecuComp data and suggests that expanding samples to include non-ExecuComp firms can generate a richer understanding of the underlying economic phenomena of interest.

  • Center for Research in Security Prices (2001). CRSP delisting returns. CRSP White Paper (April).
    Abstract: This paper discusses CRSP Delisting Returns and the concept of a delisting return bias, recent historical research and the guidelines CRSP researchers follow when conducting delisting research, and the methodology we follow to calculate delisting returns and partial-month returns. The concept of replacing missing delisting return codes with a single-replacement value is also discussed.

  • Cohen, Randolph B. and Christopher K. Polk (1998). COMPUSTAT Selection Bias in Tests of the Sharpe-Lintner-Black CAPM. CRSP, University of Chicago, Graduate School of Business
    Abstract: A recent paper of Kothari, Shanken, and Sloan (1995) (KSS) examines the argument of Fama and French (FF) (1992) that, contrary to the Sharpe-Lintner-Black (SLB) model, book-to-market ratio plays an important role in expected asset returns while market beta does not. KSS claim that part of the discrepancy between the FF empirical results and the SLB theory may be caused by bias in the FF study induced by their use of only COMPUSTAT-listed stocks. This paper uses a methodology that enables us to obtain cross-sectional variation in distress level for non-COMPUSTAT as well as COMPUSTAT firms. We find little if any evidence of COMPUSTAT selection bias, and no evidence that any bias that might exist is related to book-to-market ratio.

  • Courtenay, Stephen M. and Keller, Stuart B. (1994). "Errors in Databases Revisited: An Examination of the CRSP Shares-Outstanding Data." Accounting Review; 69(1), January 1994, pages 285-91. (Full text available through JSTOR)
    Abstract: This article considers errors in databases commonly used by researchers. Specifically, the study investigates the manner in which the Center for Research in Security Prices (CRSP) tapes adjusts prices for stock distributions, i.e., stock dividends and stock splits. All distributions reported by CRSP during the period January 1 through December 31, 1989 were reviewed for proper distribution code, record date, ex-dividend date, and distribution date by comparison to a secondary data source, Moody's Dividend Record (MDR). Additionally, the accuracy of the factor used by CRSP to adjust the price and number of shares outstanding for the stock distribution was verified by comparison to MDR. Completeness of CRSP reporting was tested by tracing all stock distributions reported by MDR back into the CRSP tapes. In all, 718 distributions were examined. All differences (142, or 20 percent of the distributions) were reconciled where possible (11 could not be resolved) by examination of the respective companies' annual reports. Coding differences (91, or 64 percent of the differences) arose because of CRSP policy of designating as stock dividends all distributions that are less than or equal to 20 percent of the then-existing shares while all others are stock splits. This coding policy may affect research that uses stock distributions to test signalling by management. The remaining 51 exceptions (36 percent) included 20 ex-date differences that were perceived to be of little consequence, and 31 dissimilarities (22 percent) that are considered significant. In 27 of these variances, CRSP was found to be in error, which ranged from 75 percent to 100 percent when compared to primary data sources. In a second phase, the 1990 CRSP tape was examined to follow-up the status of 44 errors observed in the data review of previous work by these authors utilizing 1981 and 1982 data and of errors detected in phase one of this study. The evidence demonstrates that some errors remain undetected for long periods of time. Finally, the degree to which errors may affect the statistical testing of volume variables showed no significant bias.

  • Elton, Edwin J., Martin J. Gruber, and Christopher R. Blake (2001). "A First Look at the Accuracy of the CRSP Mutual Fund Database and a Comparison of the CRSP and Morningstar Mutual Fund Databases." Journal of Finance, 56(6), pp. 2415-30 (Full text available through Ingenta)
    Abstract: This paper examines problems in the CRSP Survivor Bias Free U.S. Mutual Fund Database (CRSP, 1998) and compares returns contained in it to those in Morningstar. The CRSP database has an omission bias that has the same effects as survivorship bias. Although all mutual funds are listed in CRSP, return data is missing for many and the characteristics of these funds differ from the populations. The CRSP return data is biased upward and merger months are inaccurately recorded about half the time. Differences in returns in Morningstar and CRSP are a problem for older data and small funds.

  • Davis, James L. (1996). "The Cross-Section of Stock Returns snd Survivorship Bias: Evidence from Delisted Stocks" Quarterly Review of Economics & Finance; Fall 1996, 36(3), pp. 365-375 (Full text available through Business Source Premier)
    Abstract: Studies the effect of the COMPUSTAT survivorship bias on the explanatory power of book-to-market equity, earnings yield and cash flow yield with respect to realized stock returns during the period from July, 1963 to June 1978. How the COMPUSTAT is augmented; Method used to conduct study; Results of study; Summary of firms included on CRSP, COMPUSTAT and the augmented COMPUSTAT database from 1963-1978.

  • García Lara, Juan Manuel, Beatriz García Osma and Belén Gill de Albornoz Noguer (2006). "Effects of Database Choice on International Accounting Research". Abacus, 42(3/4), pages 426-454. (Full text available through Blackwell Synergy)
    Abstract: Data availability is one of the traditional obstacles confronting researchers carrying out international empirical studies in accounting. In recent years several databases have claimed to offer comprehensive coverage of accounting and financial data of firms worldwide. We analyse whether the choice of database has an effect on the results of empirical studies. We find that the results of a simple empirical adaptation of the Ohlson (1995) model for fourteen member states of the European Union change significantly depending on the database chosen (Datastream, Global Vantage (Compustat Global), Company Analysis, Worldscope, Thomson Financial, Extel Financials and BvD Osiris). These differences are mainly attributable to differences in the samples across databases. When we match observations across all databases the differences persist but are much less pronounced. Our main conclusion is that database choice matters, as it leads to different results when the same research design is used.

  • Guenther, David A. and Andrew J. Rosman (1994). "Differences between COMPUSTAT and CRSP SIC Codes and Related Effects on Research". Journal of Accounting and Economics, 18(1), pages 115-28.
    Abstract: Differences between SIC codes assigned to companies by COMPUSTAT and CRSP are examined. Large differences are observed at two-, three-, and four-digit levels. Correlations of intra-industry monthly stock returns are larger and variances of intra-industry financial ratios are smaller for industries based on codes. Replication of a portion of Freeman and Tse (1992) produces significant results using COMPUSTAT codes, consistent with the original research, but insignificant results for CRSP codes.

  • Kahle, Kathleen M and Ralph A. Walkling (1996). "The impact of industry classifications on financial research". Journal of Financial and Quantitative Analysis, 31(3), pages 309-35. (Full text available through ABI/Inform)
    Abstract: Using approximately 10,000 firms jointly covered by Compustat and CRSP from 1974 to 1993, a study finds substantial differences in the SIC codes designated by the 2 databases. More than 36% of the classifications disagree at the 2-digit level and nearly 80% disagree at the 4-digit level. The study examines the impact of these differences upon financial research in several ways: 1. It is shown that classification of utilities, financial firms, and conglomerate acquisitions are affected by the choice of CRSP versus Compustat SIC codes. 2. It is shown that industry classification matters in financial research by illustrating that size- and industry-matched comparisons are more powerful than pure size matches. 3. The specification and power of Compustat versus CRSP classifications are tested by simulating a typical financial experiment in which sample firms are matched to control firms by industry. The results include: 1. Compustat matched samples are more powerful than CRSP matched samples in detecting abnormal performance. 2. Nonparametric tests outperform parametric tests.

  • Kern, Beth B. and Michael H. Morris (1994). "Differences in the COMPUSTAT and Expanded Value Line Databases and the Potential Impact on Empirical Research" Accounting Review, 69(1), pages 274-284. (Full text available through JSTOR)
    Abstract: The Value Line Investment Survey has recently expanded its database of financial information on public firms from approximately 1,600 to over 4,000 companies to be more competitive in the financial database market. In addition, recent research (Philbrick and Ricks 1991) has shown that in determining earnings surprise, Value Line is a better source for actual EPS data. Since most research has been based on samples drawn from the COMPUSTAT database, increasing attention to the Value Line database leads one to question the effects of database choice on empirical research. One purpose of this study is to examine the differences in financial data between COMPUSTAT and Value Line along with the differences in the distribution of the size of firms between the two databases. Significant differences are found between COMPU-STAT and Value Line in the types of financial data reported for commonly used data items such as sales and total assets. In addition, the distribution of the size of firms in the databases has shifted over time. Prior to 1985 the bottom three quartiles of firms were significantly larger in Value Line. Beginning in 1985 the COMPUSTAT database had significantly larger firms across all quartiles. A second purpose is to demonstrate how the differences in the two databases can materially affect inferences about the population of firms. A study analyzing effective tax rates provides a good example because the use of the different databases produces very different results. This study extends prior research by examining which differences in the databases generate these different results. Much of the difference in the results is attributable to the different firms in the two databases. However, after controlling for common firms, the remaining significant differences can only be attributed to differences in the manner in which the financial accounting data are assimilated into the databases.

  • Kinney, Michael R. and Edward P. Swanson (1993). "The accuracy and adequacy of tax data in COMPUSTAT." Journal of the American Taxation Association, 31(3), pages 309-35. (Full text available through Business Source Elite)
    Abstract: The accuracy of the amounts reported by Standard & Poor's Compustat Services Inc. in 19 tax fields is examined. The error rate varies widely, although it is generally higher for items reported in the footnotes than for items reported on the income statement or balance sheet. The error rates are higher for utilities and when special items are reported on the income statement below income from continuing operations. These special items include the use of an net operating loss (NOL) carryforward, discontinued operations, cumulative adjustments, and extraordinary items. In addition to errors, researchers should be aware of COMPUSTAT's coding policies. Under some circumstances, a field may indicate an amount is missing when an amount is reported in the financial statements. This is particularly true for the breakdown of current and deferred taxes into federal, state, and foreign components. The usefulness of the NOL carryforward field is also limited by COMPUSTAT's coding policies and errors.

  • Krishnan, Jayanthi and Eric Press (2003). "The North American Industry Classification System and Its Implications for Accounting Research" Contemporary Accounting Research, 20(4), pages 685-717. (Full text available through Business Source Elite)
    Abstract: Industry classification is an important component of the methodological infrastructure of accounting research. Researchers have generally used the Standard Industrial Classification (SIC) system for assigning firms to industries. In 1999, the major statistical agencies of Canada, Mexico, and the United States began implementing the North American Industry Classification System (NAICS). The new scheme changes industry classification by introducing production as the basis for grouping firms, creating 358 new industries, extensively rearranging SIC categories, and establishing uniformity across all NAFTA nations. We examine the implications of the change for accounting research. We first assess NAICS's effectiveness in forming industry groups. Following Guenther and Rosman 1994, we use financial ratio variances to measure intra-industry homogeneity and find that NAICS offers some improvement over the SIC system in defining manufacturing, transportation, and service industries. We also evaluate whether NAICS might have an impact on empirical research by reproducing part of Lang and Lundholm's 1996 study of information-transfer and industry effects. Using SIC delineations, they focus on whether industry conditions or the level of competition is the main source of uncertainty resolved by earnings announcements. Across all levels of aggregation, we find inferences are similar using either SIC or NAICS. However, we also observe that the regression coefficients in Lang and Lundholm's model show smaller intra-industry dispersion for NAICS, relative to SIC, definitions. Overall, the results suggest that NAICS definitions lead to more cohesive industries. Because of this, researchers may encounter some differences in using NAICS-industry definitions, rather than SIC, but these will depend on research design and industry composition of the sample. [Note: The authors use NAICS and SIC codes assigned by COMPUSTAT to carry out their tests.]

  • Ljungqvist, Alexander, Christopher Malloy and Felicia Marston (2009). "Rewriting History." Journal of Finance, 64(4): 1935-1960. (Full text available through C. Malloy's web site)
    Abstract: We document widespread changes to the historical I/B/E/S analyst stock recommendations database. Across seven I/B/E/S downloads, obtained between 2000 and 2007, we find that between 6,580 (1.6%) and 97,582 (21.7%) of matched observations are different from one download to the next. The changes include alterations of recommendations, additions and deletions of records, and removal of analyst names. These changes are nonrandom, clustering by analyst reputation, broker size and status, and recommendation boldness, and affect trading signal classifications and back-tests of three stylized facts: profitability of trading signals, profitability of consensus recommendation changes, and persistence in individual analyst stock-picking ability. [Note: A Kellogg doctoral student has reported similar changes in Zacks.]

  • Mills, Lillian F., Kaye J.Newberry, and Garth F. Novack (2003). "How Well Do Compustat NOL Data Identify Firms with U.S. Tax Return Loss Carryovers?". Journal of the American Taxation Association, 25(2), pages 55-56. (Full text available through JSTOR)
    Abstract: Corporate tax incentives play an important role in many accounting and tax research settings. Although Compustat net operating loss (NOL) carryforward data is an underlying component of most proxies of corporate tax incentives, there is little empirical evidence regarding how this data item relates to firms' tax-loss positions per their U.S. tax returns. We find that the use of additional Compustat data for U.S. current income tax or total pretax income works well in reducing misclassification errors associated with Compustat's reporting of an NOL carry-forward balance where no tax NOL exists. Our results suggest that Compustat NOL carryforward data are informative. Our findings contribute to accounting and tax research by providing insights to researchers as they construct corporate tax measures in specific research contexts.

  • Mutchler, Jane and Philip Shane (1995). "A Comparative Analysis of Firms Included in and Excluded from the NAARS Database". Journal of Accounting Research, 33(1), pages 193-202. (Full text available through JSTOR)
    Abstract: An analysis provides evidence on the representativeness of the NAARS database by comparing various characteristics of COMPUSTAT/CRSP domestic industrial firms included in the database (included firms) to characteristics of otherwise similar firms that are not included (excluded firms). The analysis covers both 1985 and 1990 and, although somewhat sensitive to the year analyzed and exchange listing, the results suggest that excluded firms tend to be smaller, with higher probabilities of both bankruptcy and qualified opinions. In addition, excluded firms are also more likely to be audited by a non-big-8 firm.

  • Payne, Jeff L. and Wayne B, Thomas (2003). "The Implications of Using Stock-Split Adjusted I/B/E/S Data in Empirical Research.". Accounting Review, 78(4), 1049-1067. (Full text available through Business Source Premier)
    Abstract: The purpose of this study is to highlight issues of interest to researchers employing the I/B/E/S earnings and forecast data. I/B/E/S has traditionally provided per share data on a split-adjusted basis, rounded to the nearest penny. In doing so, per share amounts are comparable over time. However, because not all prior forecasts and earnings per share amounts divide precisely to a penny, adjusting for stock splits and rounding to the nearest penny can cause a loss of information. Researchers are prohibited in many cases from determining the amounts actually reported in prior years, leading to misclassified observations. We obtain actual (unadjusted) earnings and forecast data from I/B/E/S and compare results to those generated using the adjusted I/B/E/S data. We replicate prior studies and find that conclusions are affected when using the actual I/B/E/S data.

  • Philbrick, Donna R. and William E. Ricks (1991). "Using Value Line and IBES Analyst Forecasts in Accounting Research." Journal of Accounting Research 29(2), pages 397-417. (Full text available through ABI/Inform)
    Abstract: Descriptive data on standard sources of analyst forecasts used in accounting research - the Value Line Investment Survey, the Institutional Brokers Estimate System (IBES), the Standard & Poor's Earnings Forecaster, and Zacks Investment Research - are provided. Using various combinations of Value Line, IBES, and COMPUSTAT actual and forecast quarterly earnings per share (EPS) data, the forecast error metrics are compared in terms of accuracy and level of assocation with announcement-period excess returns. Ehen forecasts are taken from the use of Value Line (IBES), the smallest forecast errors result from the use of Value Line (IBES) actual EPS. In comparison, the use of COMPUSTAT reported data produces significantly larger absolute forecast errors. If COMPUSTAT actual EPS data are used, adjusting COMPUSTAT data for the effects of above-the-line special items produces greater accuracy. Comparing Value Line forecasts and actuals to IBES forecasts and actuals, the former produce smaller absolute forecast errors.

  • Rosenberg, Barr and Houglet, Michel (1974). Error Rates in CRSP and Compustat Data Bases and their Implications." Journal of Finance; 29(4), pages 1303-10. (available for download through JStor).
    Abstract: The presence of erroneous data can destroy a research effort and seriously damage the management decisions based upon research. A comparison of monthly price relatives for nyse listed stocks available on both center for research in security prices - CRSP - and investors management sciences COMPUSTAT data bases show that large errors are relatively infrequent. Nevertheless, there are a few large errors in both data bases, and these few errors are sufficient to change sharply the apparent nature of the data. The results suggest some cautions that should be observed in using these data. Perhaps more importantly, the method employed in this article could be institutionalized as a means of quality control for competing data bases, thereby virtually assuring a great improvement in the reliability of data available to the general user.

  • San Miguel, Joseph G. (1977). "The Reliability of R&D Data in COMPUSTAT and 10-K Reports." Accounting Review 52(3), pages 638-41. (available for download through JStor).
    Abstract: The purpose of this paper is to report on the findings from a study that compared the research and development information in COMPUSTAT with the information contained in Form 10-K reports to the Securities and Exchange Commission. The comparisons indicated that numerous, and sometimes significant, inaccuracies existed between the two widely used sources of financial information. Also, in a number of cases, the information disclosed in 10-K reports was deficient in explaining the differences. Although this investigation was limited to a specific item of financial data for a single period, there are a number of implications for researchers planning to make use of computerized data bases and for those responsible for evaluating their research. To overcome the potential errors in financial information gathered for research purposes, several recommendations are discussed.

  • Sarig, Oded and Arthur Warga (1989). "Bond Price Data and Bond Market Liquidity." Journal of Financial and Quantitative Analysis 24(3), pages 367-378. (available for download through JStor).
    Abstract: An attempt is made to characterize liquidity-driven noise in the CRSP Government Bond price data set by comparing these price records to the independently collected Shearson Lehman Brothers Bond Data Base. Discrepancies between the data sets are largely a result of liquidity-driven price errors. It is shown that these discrepancies are systematically related to certain bond characteristics. On the other hand, such discrepancies are small in size and are approximately mean zero. Data filters are examined that are based on observable bond characteristics. It is shown that these filters can reduce the noise in price records while preserving their mean zero nature. The impact of these errors on performance evaluation is studied by comparing results using filtered and unfiltered data. It is found that the elimination of data from bonds that have been trading for more than 3 years appears to be an effective way to both reduce noise and avoid introducing a selection bias.

  • Schwert, G. William (1990). "Indexes of U.S. Stock Prices from 1802 to 1987." Journal of Business 63(3), pages 399-426. (available for download through JStor).
    Abstract: Monthly stock returns from Smith and Cole (1935), Macaulay (1938), and Cowles (1939) are compared and contrasted with the returns to the CRSP value-weighted portfolios of New York Stock Exchange (NYSE) stocks. Daily stock returns from the Dow Jones Index in 1972 and Standard & Poor's Composite Index in 1986 are compared and contrasted with the returns to the CRSP value-weighted portfolios of NYSE and American Stock Exchange stocks. Effects of dividend, nonsynchronous trading, and time averaging are analyzed. Estimates of means and standard deviations of returns by month for the period 1802-1987 and for 20-year subperiods are given, as well as means and standard deviations of returns by day of the week. The combined series of monthly returns from 1802-1987 and daily returns from 1885-1987 provide a long historical record of stock price behavior. The estimates and the plots of volatility show remarkable homogeneity for these series through time. Moreover, the seasonal patterns of monthly and daily stock returns are similar in the 19th and 20th centuries. Deficiencies of some of the early indexes of stock prices are identified and corrected.

  • Shumway, Tyler (1997). "The Delisting Bias in CRSP Data." Journal of Finance 52(1), pages 327-40. (available for download through JStor).
    Abstract: The author documents a delisting bias in the stock return data base maintained by the Center for Research in Security Prices. He finds that delists for bankruptcy and other negative reasons are generally surprises and that correct delisting returns are not available for most of the stocks that have been delisted for negative reasons since 1962. Using over-the-counter price data, the author shows that the omitted delisting returns are large. Implications of the bias are discussed.

  • Shumway, Tyler and Vincent A. Warther (1999). "The Delisting Bias in CRSP's Nasdaq Data and its Implications for Interpretation of the Size Effect." Journal of Finance, 54(6), pages 2361-2379.
    Abstract: We investigate the bias in CRSP data due to missing returns for many of the stocks delisted from Nasdaq. We find that missing returns are far more common when the delisting is for reasons of poor performance, and we find the missing returns to be large and negative on average. This implies a bias for studies using Nasdaq data which is 4.7 times larger than the delisting bias previously documented for CRSP's NYSE/AMEX data. We estimate that using a corrected return of -55 percent wherever a performance-related delisting return is missing will correct the bias. We revisit previous work which finds a size effect in Nasdaq data and find that when the data are corrected for the delisting bias, the evidence for a size effect in Nasdaq data disappears.

  • Skantz, Terrance R. and Barbara G. Pierce (1997). "Value Line and I/B/E/S Earnings Forecasts in the Presence of Nonrecurring Gains and Losses: Evidence of Inconsistency." Working paper, Florida Atlantic University Graduate School of Business, November.
    Abstract: Earnings are increasingly affected by "one-time" gains and losses; consequently, knowing whether analysts forecasts include or exclude such gains and losses is important to our understanding of forecast errors. Using I/B/E/S and Value Line data, this paper presents evidence that quarterly earnings forecasts and the associated "actual" earnings numbers reported by forecast services are inconsistent in the treatment of special gains and losses. Moreover, the fact that the actual earnings number reported by forecast services includes the nonrecurring item does not insure that the related earnings forecast does likewise.The paper examines two types of nonrecurring items, restructuring losses and equity carve-out gains. We find that the "actual" earnings numbers reported by Value Line and I/B/E/S include (exclude) these nonrecurring items in 37% (63%) and 38% (62%) of the cases, respectively. While the treatment of losses in "actual" earnings numbers provides a signal about the forecast target, the treatment of gains does not. Forecasts are reliably related to losses only if the losses are included in actual earnings. And forecast revisions show only a very weak relation to losses if actual earnings excluded those losses, and a significantly stronger relation to losses that are included in earnings. Conversely, forecasts are significantly related to gains whether or not the gain is included; and, forecast revisions are unrelated (or, at best only marginally related) to nonrecurring gains, even when actual earnings numbers include the gain. The findings suggest that the existence of nonrecurring items can make problematic the calculation of an earnings forecast error that is an unbiased measure of unexpected ordinary income.

  • Ulbricht, Niels and Christian Weiner (2005). "Worldscope meetsCompustat:A Comparison of Financial Databases." Humboldt-Universität zu Berlin, SFB 649 Discussion Paper 2005-064. (Available for download through Humboldt-Universität zu Berlin, SFB 649).
    Abstract: With this study we are the first to systematically compare today’s two major counterparts as a source of accounting and financial data for researchers: Compustat North America by Standard and Poor’s and Worldscope by Thomson Financial. This investigation is conducted for U.S. and partly Canadian data over an extensive period from 1985 to 2003. We examine more than 650 data items available in both databases and address the question of whether or not the decision for one or the other source may have an impact on the outcome of research projects. It is probably commonly assumed that this impact is minor, but it also leaves room to question certain results. We show that the use of both databases should lead to comparable results, but also find that if, e.g. a size bias, is not treated with care the quality of results may differ considerable. Furthermore after 1998 the number of firms covered by Worldscope exceeds the one covered by Compustat by about one fourth.

  • Vasarhelyi, Miklos A. and Yang, David C.H. (1986). "Financial Accounting Databases: Methodological Implications of Using the Compustat and Value Line Databases" Columbia First Boston Series in Money, Economics and Finance Working Paper: FB-86-30, May 1986.
    Abstract: This paper compares two commonly used financial databases -- Value Line (VL) and Compustat (CMP) in their qualitative and quantitative features. Data is examined using seven variables through eleven years. Data differences found are further analyzed for 1981 data where a sample is compared to figures directly drawn from financial statements. Substantial data differences are found, most of which are attributable to definitional discrepancies and others to direct measurement error. For example 39.5 per cent of the depreciation figures and 23.2 per cent of the inventory numbers were discrepant by more than 1 per cent of the absolute value of the measure. The paper also provides suggestions on the selection and usage of financial databases, and discusses shortcomings that should be expected in using accounting databases. Finally, recommendations are presented for dealing with these problems to preparers of databases as well as standard setters

  • Venkatesh, P. C. (1992). "Empirical Evidence on the Impact of the Bid-Ask Spread on the Characteristics of CRSP Daily Returns." Journal of Financial Research; 15(2), pages 113-25..
    Abstract: It is widely recognized that Center for Research in Security Prices (CRSP) returns may differ from "true" returns because of the bid-ask effect. Using a large sample of New York Stock Exchange and American Stock Exchange securities, the author confirms a discernible bid-ask effect, the magnitude and importance of which decrease with the security's price level (increase with the spread). The author finds volatility estimates using CRSP returns to be greater than those based on quote returns. However, market model properties, such as ß and R², are generally unaffected. Bid-ask effects are clearly apparent in event studies, but because of certain offsetting effects commonly used test statistics remain unaffected. Low-priced stocks (below $2.00) do not conform to these patterns. Finally, the evidence raises the possibility that the existing literature on filter rule tests may underestimate the bid-ask spread component of transaction costs.

  • Wood, Robert A. (2000). "Market microstructure research databases: History and projections." Journal of Business & Economic Statistics; 18(2), pages 140-145. (Datasets mentioned: TAQ and ISSM). (Full text available through ABI/Inform)
    Abstract: A partial history of the development of transactions databases of securities prices is presented. The availability of these databases, which in part were funded by the National Science Foundation, has fueled the rapid growth of market micro-structure research as a discipline within financial economics. Furthermore, the accelerating growth of securities trading and the implication of this growth for empirical micro-structure research are examined.

  • Yang, David C., Miklos A Vasarhelyi, Caixing Liu.(2003). "Industrial Management + Data Systems."; 103 (3/4), 140-145. (Datasets mentioned: Compustat and ValueLine). (Full text available through ABI/Inform Global)
    Abstract: This paper identifies potential data problems when using accounting databases. To examine data errors, two commonly used accounting databases, Value Line and Compustat, are compared in their qualitative and quantitative features. Data are examined using seven variables over a period of 11 years. Substantial data differences are found, most of which are attributable to definitional discrepancies and others to direct measurement error. Finally, recommendations for dealing with these problems are presented, to preparers of databases as well as standard setters.

© 2001-2010 Kellogg School of Management, Northwestern University