Big Data: Content and Technology

Big Data: Content and Technology

BY Gio Wiederhold, May 2016

The processing of voluminous data, now primarily found on the Internet, has been making rapid strides. Relationships among diverse sources are routinely established, and no longer requires experts to embody their knowledge into formalized and awkward schemas. Individual entries are linked with a fair amount of confidence using entity resolution technologies. Those data have varied provenance, and can produce exciting results [deSaRRSWWZ:16].

The particular issue I am addressing here is the use of Big Data by folk engaged in analyzing data relevant to national economics and then giving advice to private and public agencies on what should be done to help national and international economies. Having `Big Data’ raises high expectations [JagadishEA:14]. The objectives are wonderful: help our economies grow, decrease unemployment, and spread comfort and happiness worldwide.

Imbalanced data

Still, the results of Big Data technology depend on the provenance of the data. And the data available to our economists are woefully imbalanced.   Specifically, data needed to measure the processes relevant to high technology enterprises are sparse. That matters, because our high technology industry is a major driver of the current economy. I am concerned about how this imbalance leads to poor advice in governmental decision-making.

Economists have depended mainly on financial data to measure the economy. Corporate production and cost data are aggregated for their think tanks [Brown:09]. Such data are reported down to the pennies by accountants and required to be presented to the world in annual reports. Much of the financial assets of those high-technology multinational corporations are held outside of the U.S. in taxhavens. However, for high-tech enterprises operating globally, these `booked’ values tend to be a fraction, about 20% on the average, of the market value that investors assign to the corporations. Economists will also use governmental sources, as income data from tax revenues. However, because taxation is imbalanced those data mislead as well.  Investors have insights that are broader.

The economy of the 20th century depended on much labor and substantial financial capital. Building aircraft, automobiles, as well as the steel mills and machine shops that supplied them were tangible evidence of economic prowess. These industries were associated with known locations, and their products were costly to ship. Geography was an important factor.

Even studies that purport to analyze innovation mislead. A recent study, cited in Science, intended to provide guidance to U.S. policies, was based on patent data [NagerHEA:16] . Patents are the means for established industries to protect themselves. Ongoing innovation relies on trade secrets [Wiederhold 13, Chap.3 ]. It is no surprise that this study is interpreted to show that established industries are very innovative, that women and Asians contribute little, and not to “think of Bill Gates” as an example [Malakoff:16].

The world has changed

The post-industrial economy is based on intellectual capital. The Apples, Microsofts, Googles and the many smaller, hipper players that create an ever larger fraction of the goods that people purchase are not strapped for financial capital. Furthermore, the GE’s, Intel’s and similar enterprises that do require costly factories have moved much of labor-intensive production of their tangible products overseas. The critical intangibles embedded in chips, phones, computers, are transmitted to production facilities from far away. Much research, development, testing, and prototyping, and the equally important market research and promotion activities remain in the US, complemented with laboratories in the EU and Asia.

Intangible products can be copied at negligible costs and shipped freely worldwide over the Internet. Such transfers are not obvious in the Big Data being mined. Containerized shipping has similarly reduced the costs of distributing the high-technology tangible products. It costs only about $0.50 each to ship a pallet of iPads anywhere in the world. Computerized logistics minimizes inventory investments. On-line payment systems allow revenues from world-wide sales to be collected anywhere, preferably in locations that don’t insist on excessive reporting to their government agencies.

An evidence for the mismatch is the difference of the valuations companies show on their books – based on financial information, and what investors consider the value of the company to be – the market capitalization (the share price x the number of shares on the market). For a traditional enterprise, say a railroad, the two assessments are close. For a high-technology company, the additional market value due to its intellectual capital is typically 4 times the book value. Check it, but subtract the excess cash held in taxhavens first!

Data missing now

To model and give advice for modern enterprises economists need data about the resources and the flow of intellectual capital: the people that create and exploit intellectual property (IP), and the IP itself. Those are the factors that drive modern industry.

If data about the intellectual capital that drives modern enterprises is so important, why don’t the economists that give advice go looking for it? The cycle of data availability and demand is stuck.

Little is being recorded in accessible form by industry, because reporting regulations ignore intellectual property and employee capabilities.

Our leading economists have grown up and been educated in a time where financial capital and cheap labor was the crucial contributor to growth [Nasar:11].

The effect is that economic analyses cannot measure the impact of the intellectual capital, the experts and IP, the factors that drive modern industry. Ignoring its contribution in decision-making leads to selection bias [KobieluZ:16]. The effect is that the needed infrastructure including education, training, and levels of immigration, as well as protection against external threats, is short-changed, since it there is no documentable path of such investments to the outputs of modern industry. There are many anecdotes, but these cannot be placed into a broad coherent economic model.

All inputs to the modern economy need intellectual capital. But the prominent economists, those that have risen to the level of providing advice to governments, continue to focus on financial capital for their metrics and tools [FurmanO:15]. They struggle to explain the rise in income inequality while only using goodwill, booked when companies are purchased for more than their book value, which is certainly a miserable surrogate for intellectual capital. Still, without including goodwill the return for the top companies is over 90% now, while when goodwill is included the returns for the best companies are less than 30% – still great. And those great companies, earning super-normal returns are the ones that rely on intellectual capital. Other commentators missed that point while reviewing this and the work of many economists. They concluded, that since in the past those best performing companies obtained returns on capital of about 25%, the shift is a sign of growing unfair income distribution [Ip:16].

It is clear that by focusing on financial capital a policy as keeping interest rates low helps primarily the traditional segments of industry, but does very little for high-technology enterprises.   Those policy makers fail to realize that conclusions they derive from the historical financial corporate data are ignored by smart investors. Investors in high-technology businesses value enterprise according to future expectations, not by past and current costs. They count on future income due to the smart people and the intellectual property (IP) they generate and exploit to make attractive products [Wiederhold:13]. Predicting the future remains risky, but is critical. Avoiding the collection data relevant to modern industries because of risks and imprecision is not acceptable.

Relevant data for modern enterprises

What data can be collected to drive future analyses? Amounts spent on research and development to create Intellectual property (IP), maintenance of such IP, and marketing are available within businesses. The background, experience, and education level of staff helps in assessing the future of a company.

Reporting it consistently can provide useful aggregations by industry. The maturity of an enterprise should be taken into account, since a Snapchat is bound to present a different profile than a Microsoft. Venture capitalists do estimate the overall leverage of their investments and develop useful insights, although those are rarely shared beyond their peers. Prices of startup exits and mergers prices reflect rational expert opinions. Stock market prices represent the wisdom of the crowds. While much of such data are not based on verifiable accounting data, in the aggregate they are as realistic as values for the tangibles listed.

To make relevant data available computer scientists and technological workers have to play a role. Computer scientists do express concerns about disturbing trends in progress and jobs [Vardi:15]. Similar discussions address traditional engineering disciplines [Charette:13]. An early study sponsored by the ACM was based on opinions, rather than on data [AsprayMV:06]. My response was [Wiederhold:11].

Computer professionals are at the center of the storm that surrounds the industry. They are willing to advocate for more education, ubiquitous Internet access, and job security. A complicating issue that some computer scientists advocate that software should be free. That implies that they expect to be supported by public funds or maybe by tax-deductible donations. But their industry is able hold capital in taxhavens forever as being `subject to management’s decision to indefinitely reinvest those earnings’ [Fleischer:12]. Politicians may argue about getting it back somehow, but have no idea of the role of IP rights that got there in the first place. Keeping U.S. capital costs low discourages repatriation of those funds for investment in the U.S. [Damodaran:13].

Concerned professionals should not just observe the effects, but try to provide data, analyses, and mechanisms so that they will affect the world around them. Some modern economists will be pleased if more data become available [Damodaran:15].

As long as computer scientists and technologists do not contribute the big data needed to make fair decisions about their livelihood, their needs will be ignored. The rights to their work is now being shipped to taxhavens, and the resulting profits are not available for growth. Without support from professional experts little change in national policies that affect modern industry can be expected [Gibbs:09].


We should be concerned about how poorly the role of intellectual capital is understood in governmental decision-making. The lack of relevant data is a major reason. The individuals that create value in modern industries should take an active role. That role goes beyond complaining and signing petitions, but must include supplying the data from which information about their industries can be extracted. As long as the data available are imbalanced, studies will favor traditional industries. If the only big data available support obsolete policies, we cannot expect change.


This blog was triggered by responses to a Viewpoint presenter earlier this year {Wiederhold:16] I have to thank my colleagues for pointing me to recent publications.

[AsprayMV:06] William Aspray, Fred Mayadas, and Moshe Y.Vardi (eds.): Globalization and Offshoring of Software; A Report of the ACM Job Migration Task Force, ACM, 2006.

[AviYonah:12] Reuven S. Avi-Yonah: Statement to Congress; University of Michigan School of Law, Permanent Subcommittee on Investigations, U.S. Congress, 20 Sep.2012.

[Brown:09] Jeffrey Brown (ed): NBER Book Series Tax Policy and the Economy; NBER, 2009-ongoing.

[Charette:13] The STEM crisis Is a Myth; IEEE Spectrum, 30 August 2013.

[Damodaran:13] Aswath Damodaran: “Unlike the US tax code, Apple is perfectly rational”, Financial Times, 7 May 2013.

[Damodaran:15] Aswath Damodaran: “The Aging of the Tech Sector: The Pricing Divergence of Young and Old Tech Companies”, Musings on Markets,, 26 Feb 2015

[Fleischer:12] Victor Fleischer: “Overseas Cash and the Tax Games Multinationals Play”; New York Times, 3 Oct. 2012.

[FurmanO:15] Jason Furman and Peter Orszag: “A Firm-Level Perspective on the Role of Rents in the Rise in Inequality”; Presentation at “A Just Society” Centennial Event in Honor of Joseph Stiglitz, Columbia University, 16 Oct 2015, available at

[Gibbs:09] Robert Gibbs: Leveling the Playing Field: Curbing Tax Havens and Removing Tax Incentives For Shifting Jobs Overseas; The White House, 4 May 2009.

[Ip:15] Greg Ip: What’s Driving Inequality: CEO Pay or Company Success?; The Wall Street Journal, 5 Nov. 2015.

[JagadishEA:14] H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Shahabi: “Big Data and Its Technical Challenges”; Communications of the ACM (CACM), Vol.57, no.7, July 2014, pp. 86-94.

[Kocieniewski:16] David Kocieniewsk: “The Sharing Economy Doesn’t Share the Wealth, As Airbnb and Uber inch toward profits, tax authorities worry”; Bloomberg Businessweek, 6 April, 2016; with a video

[Malakoff:16] David Malakoff: What’s the face of U.S. innovation? Don’t think Bill Gates; review in Science, 3 Mar. 2016

[MillerVC:10] Keith W. Miller, Jeffrey Voas, and Tom Costello: “Free and Open Source Software”; IT Professional, IEEE, Nov.2010, p.14-17.

[NagerHEA:16] Adams Nager, David M. Hart, Stephen Ezell, and Robert D. Atkinson; The Demographics of Innovation in the United States; ITIF, 24 Feb.2016.

[Nasar:11] Sylvia Nasar: Grand Pursuit: The Story of Economic Genius; Simon & Schuster, 2011.

[deSaRRSWWZ:16] Christopher De Sa, Alex Ratner, Christopher Ré, Jaeho Shin, Feiran Wang, Sen Wu, Ce Zhang: “DeepDive: Declarative Knowledge Base Construction”; SIGMOD Record 2016 (to appear)

[Sullivan:13] Martin A. Sullivan: Tax Policy in a Knowledge-based Economy;, 21 Oct.2013

[Summers:88] Lawrence H. Summers, ed.: Tax Policy and the Economy 2; MIT Press, 1988.

[Vardi:15] Moshe Vardi: “What do we do when the jobs are gone?”; CACM, Vol.58 no.2, Feb. 2015.

[Wiederhold:11] Gio Wiederhold: “Follow the IP: How does Industry pay Programmers’ Salaries when the required Intellectual Property is offshored?”; CACM, Vol.54 No.1, Jan.2011, pp.65-74.

[Wiederhold:13] Gio Wiederhold: Valuing Intellectual Capital, Multinationals and Taxhavens; series Management for Professionals, Springer Verlag, 2013.

[Wiederhold:16] Gio Wiederhold: “Unbalanced Data Leads to Obsolete Economic Advice; Viewpoint, CACM, Jan 2016, pp.45-46.

[Zicari:16] Roberto Zicari: On Big Data and Data Science. Interview with James Kobielus; ODBMS Blog, 16 Apr. 2016.


You may also like...