Please wait a minute...
Journal of Data and Information Science  2017, Vol. 2 Issue (3): 19-36    DOI: 10.1515/jdis-2017-0012
Expert Review     
Big Metadata, Smart Metadata, and Metadata Capital: Toward Greater Synergy Between Data Science and Metadata
Jane Greenberg ()
College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, USA
Download: PDF (1602 KB)      HTML  
Export: BibTeX | EndNote (RIS)      


Purpose: The purpose of the paper is to provide a framework for addressing the disconnect between metadata and data science. Data science cannot progress without metadata research. This paper takes steps toward advancing the synergy between metadata and data science, and identifies pathways for developing a more cohesive metadata research agenda in data science.

Design/methodology/approach: This paper identifies factors that challenge metadata research in the digital ecosystem, defines metadata and data science, and presents the concepts big metadata, smart metadata, and metadata capital as part of a metadata lingua franca connecting to data science.

Findings: The “utilitarian nature” and “historical and traditional views” of metadata are identified as two intersecting factors that have inhibited metadata research. Big metadata, smart metadata, and metadata capital are presented as part of a metadata lingua franca to help frame research in the data science research space.

Research limitations:There are additional, intersecting factors to consider that likely inhibit metadata research, and other significant metadata concepts to explore.

Practical implications: The immediate contribution of this work is that it may elicit response, critique, revision, or, more significantly, motivate research. The work presented can encourage more researchers to consider the significance of metadata as a research worthy topic within data science and the larger digital ecosystem.

Originality/value: Although metadata research has not kept pace with other data science topics, there is little attention directed to this problem. This is surprising, given that metadata is essential for data science endeavors. This examination synthesizes original and prior scholarship to provide new grounding for metadata research in data science.

Key wordsMetadata research      Data science      Big metadata      Smart metadata      Metadata capital     
Published: 25 August 2017
Cite this article:

Jane Greenberg. Big Metadata, Smart Metadata, and Metadata Capital: Toward Greater Synergy Between Data Science and Metadata. Journal of Data and Information Science, 2017, 2(3): 19-36.

URL:     OR

Figure 1. Visual Business Intelligence: A blog by Stephen Few (January 23, 2017).
Five Vs Definition
Volume The quantity and usefulness of metadata generated daily confirms the existence of big metadata. At times metadata is less than or equal to the extent of the data it describes in size (bytes). During other times the metadata exceeds the data being described or tracked, due to the complexity of the data lifecycle activity. Linked data offers an example, with metadata renderings that can be larger than the volume of data object(s) being represented. Like big data, not all big metadata is useful, and a challenge is to identify the big metadata that is useful for data science and analytic endeavors.
Velocity Metadata is generated via automatic processes at immense speed correlating with rate of digital transactions. For example, searching Google, answering an email, purchasing an item online, and day-to-day office activities such as word processing of all log data, as well as associated metadata.
Variety Metadata reflects the wide variety of data formats, types, and genres along with the extensive range of data and metadata lifecycles. In addition, the different types of metadata (e.g. discovery, technical, preservation, etc.) as well as unique domain specific metadata requirements intensify the variety.
Variability There is an unmistakable unevenness of metadata across the digital ecosystem. Lack of uniformity is extensive for data descriptions across different domains, systems, and processes. This unevenness can even be profound within domains, given economic factors supporting metadata generation, competing standards, or, simply, differing adoption policies. For example, two organizations may use the same metadata standard, but have different implementation practices. Even when standardization is imposed, an organization, process, and human activity can contribute to inconsistencies.
Value If data is the new black gold*—akin to petroleum requiring purification, but also a money maker, then metadata is the new platinum—a malleable substance that keeps its toughness, and can serve as a catalyst, sparking a reaction.
Metadata, as the new platinum, can be modified, while remaining a strong, independent data type. Metadata stands as a durable data object that triggers various functions—the catalyst, and achieves results—a reaction. Metadata is vital to accurate data interpretation and use by both humans and machines, and the value of metadata for data science endeavors cannot be overstated or diminished.
Table 1 The five Vs of big metadata.
Figure 2. Smart metadata matrix of principles.
[1]   Abbasi M., Vassilopoulou P., & Stergioulas L. (2017). Technology roadmap for the creative industries. Creative Industries Journal, 10(1), 40-58.
doi: 10.1080/17510694.2016.1247627
[2]   Beall,J (2004). Dublin Core: An obituary. Library Hi Tech News, 21(8), 40-41.
[3]   Beall,J (2014). Dublin Core is still dead. Library Hi Tech News, 31(9), 11-13.
doi: 10.1108/LHTN-07-2014-0058
[4]   Bruce,T.R., & Hillmann,D.I (2004. The continuum of metadata quality: Defining, expressing, exploiting. ALA Editions. Retrieved on July 31, 2017, from .
[5]   Coleman,A.S (2005). From cataloging to metadata: Dublin Core records for the library catalog. Cataloging & Classification Quarterly, 40(3-4), 153-181.
doi: 10.1300/J104v40n03_08
[6]   Contractor D., Negi S., Popat K., Ikbal S., Prasad B., Kakaraparthy S., Sengupta B., Vedula S., & Kumar V. (2015). Smarter learning content management using the Learning Content Hub. IBM Journal of Research and Development, 59(6), 3:1-3:9.
doi: 10.1147/JRD.2015.2455691
[7]   Data Science Association (DSA). (2017. About data science. Retrieved on June 18, 2017, from .
[8]   DCMI (2003. Special session: Smart metadata. In 2003 Dublin Core Conference: Supporting Communities of Discourse and Practice-Metadata Research & Applications, Seattle, Washington.Retrieved on June 30, 2017, from .
[9]   Dhar,V (2013). Data science and prediction. Communications of the ACM, 56(12), 64.
doi: 10.1145/2500499
[10]   Dimitrova,N. (October-December, 2004). Is it time for a moratorium on metadata? IEEE Multimedia, 11(4), 10-17.
doi: 10.1109/MMUL.2004.29
[11]   Doctorow,C (2001. Metacrap: Putting the torch to seven straw-men of the meta-utopia . Retrieved on June 28, 2017, from.
[12]   Dong R., Su F., Yang S., Xu L., Cheng X., & Chen, W. (2016, September). Design and application on metadata management for information supply chain. In the 16th International Symposium on Communications and Information Technologies (ISCIT) (pp. 393-396). Washington,DC: IEEE Computer Society Press.
[13]   ERAC Secretariat. (2016. European Research Area and Innovation Committee. European Union. Brussels, February 3, 2016 . Retrieved on June 18, 2017, from .
[14]   Fatima A., Luca C., & Wilson, G. (2014, March). New framework for semantic search engine. In 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation (UKSim) (pp. 446-451). Washington,DC: IEEE Computer Society Press.
[15]   Few,S (2017). Visual business intelligence: A blog by Stephen Few. There is no science of data, January 23, 2017. Retrieved on July 7, 2017, from2560.
[16]   Gaitanou P., Gergatsoulis M., Spanoudakis D., Bountouri L., & Papatheodorou, C. (2016). Mapping the hierarchy of EAD to VRA Core 4.0 through CIDOC CRM. In the 10th International Conference on Metadata and Semantics Research (MTSR 2016) (pp. 193-204). Cham, Switzerland: Springer International Publishing.
[17]   Greenberg,J (2005). Understanding metadata and metadata schemes. Cataloging & Classification Quarterly, 40(3-4), 17-36.
[18]   Greenberg,J (2009). Theoretical considerations of lifecycle modeling: An analysis of the dryad repository demonstrating automatic metadata propagation, inheritance, and value system adoption. Cataloging & Classification Quarterly, 47(3-4), 380-402.
doi: 10.1080/01639370902737547
[19]   Greenberg,J. (2009). Metadata and digital information. In M.J. Bates & M.N. Maack (Eds.), Encyclopedia of Library and Information Sciences (pp. 3610-3623). Boca Raton,FL: CRC Press.
[20]   Greenberg,J (2014). Metadata capital: Raising awareness, exploring a new concept. Bulletin of the Association for Information Science and Technology, 40(4), 30-33.
doi: 10.1002/bult.2014.1720400412
[21]   Greenberg J.,& Garoufallou, E.(2013). Change and a future for metadata.In MTSR-2013: Proceedings of the 7th Metadata and Semantics Research Conference (pp. 1-5) . Cham, Switzerland: Springer International Publishing.
[22]   Greenberg J., Murillo A.P., Ogletree A., Boyles R., Martin N., & Romeo, C. (2014a). Metadata capital: Automating metadata workflows in the NIEHS Viral Vector Core Laboratory. In MTSR-2014: Proceedings of the 8th Metadata and Semantics Research Conference (pp. 1-13). Cham, Switzerland: Springer International Publishing.
[23]   Greenberg J., Ogletree A., Murillo A.P., Caruso T.P., & Huang, H. (2014b). Metadata capital: Simulating the predictive value of self-generated health information (SGHI). In 2014 IEEE International Conference on Big Data (pp. 31-36). Washington, DC: IEEE Computer Society Press.
[24]   Greenberg J., Swauger S., & Feinstein E.M. (2013). Metadata capital in a data repository. In DC-2013: the International Conference on Dublin Core and Metadata Applications (pp. 140-150). Lisbon, Portugal: Dublin Core metadata initiative.
[25]   Greenwald,G (2013). Edward Snowden: The whistleblower behind the NSA surveillance revelations. The Guardian. Retrieved on June 18, 2017, from
[26]   Hey T., Tansley S., & Tolle K. (2009). The fourth paradigm. Redmond, Washington:Microsoft Research.
[27]   Ilevbare I., Athanassopoulou I., & Wooldridge J. (2017. UK Workshop on Data Metrology and Standards. The National Physical Laboratory and partners at the University of Huddersfield and University of Cambridge.March, 2017. Retrieved on June 18, 2017, from .
[28]   Kogan D.E., Miller P.C., & Schobbe G.A. (2007. Techniques to manage metadata fields for a taxonomy system. US 20080301096 A1. (Also published as WO2008150619A1).Retrieved on June 28, 2017, from .
[29]   Kunze,J (2001. A metadata kernel for electronic permanence.In International Conference on Dublin Core and Metadata Applications, North America, DC2001. Retrieved on July 31, 2017, from .
[30]   Kunze J., Calvert S., DeBarry J., Hanlon M., Janée G., & Sweat S. (2016a. Persistence statements: Describing digital stickiness. California Digital Library. Retrieved on July 20, 2017, from .
[31]   Kunze J., DeBarry J., Hanlon M., Scout C., & Sweat S. (2016b A vocabulary for persistence.In SciDataCon 2016. September 11-13, 2016, Denver Colorado. Retrieved on July 21, 2017, from.
[32]   Li,C., & Sugimoto,S (2017). Provenance description of metadata vocabularies for the long-term maintenance of metadata. Journal of Data and Information Science, 2(2), 41-55.
[33]   Lytras M.D., Sicilia M.á.,& Cechinel, C. (2013).The value and cost of metadata (chapter I. 3). In M.A. Sicilia (Ed.), Handbook of Metadata, Semantics and Ontologies (pp. 41-62). Hackensack, N.J., World Scientific Publishing Company.
[34]   Manian,D (2011, Nov.11). Our pointless pursuit of semantic value. Retrieved on June 29, 2017, from
[35]   Marr,B. (2014). Big data: The 5 Vs everyone must know.LinkedIn:Big data. Retrieved on June 18, 2017, from
[36]   Méndez E.,& van Hooland, S.(2013).Metadata typology and metadata uses (chapter I.2). In M.A. Sicilia (Ed.), Handbook of Metadata, Semantics and Ontologies (pp. 9-40). Hackensack, N.J., World Scientific Publishing Company.
[37]   NITRD (2016). The Federal Big Data Research and Development Strategic Plan. The Networking and Information Technology Research and Development Program, May 2016. Retrieved on June 15, 2017, from
[38]   Oh S.G., Yi M., & Jang W. (2015). Deploying linked open vocabulary (lov) to enhance library linked data. Journal of Information Science Theory and Practice, 2(2), 6-15.
doi: 10.1633/JISTaP.2015.3.2.1
[39]   Riley,J. (2017). Understanding metadata. Bethesda, MD: NISO Press.
[40]   Shankaranarayanan,G., & Even,A (2006). The metadata enigma. Communications of the ACM, 49(2), 88-94.
[41]   Shirky,C (2005. Ontology is overrated: Categories, links, and tags. Economics & Culture, Media & Community.Retrieved on June 20, 2017, from .
[42]   Simon,P (2013). Too big to ignore: The business case for big data (Vol. 72). Hoboken, NJ: John Wiley & Sons.
[43]   Singh,A (2013. Is big data the new black gold? Wired.Retrieved on July 7, 2017, from .
[44]   Smith,A (1776). An inquiry into the nature and causes of the wealth of nations. London: W. Strahan and T. Cadell.
[45]   Smith K., Seligman L., Rosenthal A., Kurcz C., Greer M., Macheret C., .. & Eckstein A. (2014). Big metadata: The need for principled metadata management in big data ecosystems. In Proceedings of Workshop on Data Analytics in the Cloud (pp. 1-4). New York: ACM.
doi: 10.1145/2627770.2627776
[46]   Stanton,J.M (2012). Introduction to data science. Syracuse University. Retrieved on June 6, 2017, from
[47]   Sugimoto S., Li C., Nagamori M., & Greenberg J. (2016. Permanence and temporal interoperability of metadata in the linked open data environment. In Proceedings of the International Conference on Dublin Core and Metadata Applications 2016 (pp. 45-54). Retrieved on June 28, 2017, from .
[48]   Tennant,R (2002). MARC must die. Library Journal, 127(17), 26-27.
[49]   Thyagaraju,G.S., & Kulkarni,U.P (2011). Family aware TV program and settings recommender. International Journal of Computer Applications, 29(4), 1-18.
doi: 10.5120/3556-4889
[50]   UK Data Archive. (2012. Research data lifecycle. Retrieved on June 15, 2017, from .
[51]   Vaduva A.,& Dittrich, K.R.(2001).Metadata management for data warehousing:Between vision and reality In 2001 International Symposium on Database Engineering and Applications (pp 129-135) Washington, DC: IEEE Computer Society Press Between vision and reality. In 2001 International Symposium on Database Engineering and Applications (pp. 129-135). Washington, DC: IEEE Computer Society Press.
[52]   van der Aalst,W.(2016). Process mining: Data science in action. Berlin: Springer-Heidelberg.
[53]   van Hemel,S.,Paepen, B., & Engelen, J. (2003). Smart search in newspaper archives using topic maps. In Proceedings of the 7th ICCC/IFIP International Conference on Electronic Publishing. Retrieved on June 29, 2017, from.
[54]   Vlachidis A., Binding C., May K., & Tudhope D. (2013). Automatic metadata generation in an archaeological digital library: Semantic annotation of grey literature. In Computational Linguistics (pp. 187-202). Berlin: Springer-Heidelberg.
doi: 10.1007/978-3-642-34399-5_10
[55]   White H., Willis C., & Greenberg J. (2014). HIVEing: The effect of a semantic web technology on inter-indexer consistency. Journal of Documentation, 70(3), 307-329.
doi: 10.1108/JD-07-2012-0083
[56]   Zavalina O.L.(2011.
[57]   Zeng,M.L. (2017). Smart data for digital humanities. Journal of Data and Information Science, 2(1), 1-12.
[58]   Zeng,M.L., & Qin,J. (2016). Metadata. New York: Neal-Schuman Publishers, Inc.
[1] Il-Yeol Song , Yongjun Zhu. Big Data and Data Science: Opportunities and Challenges of iSchools[J]. Journal of Data and Information Science, 2017, 2(3): 1-18.