Data catalogs as enablers for software certification based on data intelligence

Manu Cohen-Yashar
4 min readApr 3, 2020

Software certification is a standard procedure implemented in almost all modern companies before adopting a software tool or package. The process verifies that new pieces of software will not break the security and privacy policies of the company. Security is a complex matter which deals with a wide range of aspects related to any given software product. From identity, through cryptography to software development practices, the list of things to check when certifying software is never-ending. Intelligent data management is at the core of any security and privacy policy, yet when integrating external software, it introduces significant challenges. To be able to ensure compliance with security and privacy policies, one needs to answer a long list of complicated questions. For example, What is the data that a system is using? Where is it stored? How is it handled? Which processes are involved? What is the data trajectory in the system? and the list goes on and on. Unfortunately, many of those details are not available in the product documentation provided by vendors, so integration and certification professionals have no idea what will be the actual effect of approving the software as its practically a black box with respect to data. Data Catalogs are designed to uncover the real data related behavior of systems in real-time and answer all the questions mentioned above. Using a data catalog in the certification process can ensure that certified software complies with the company’s high security and privacy standards.

Automatic Data Discovery

Discovery is at the core of data intelligence, insight, and analysis — and needs to be both capable and automated in order to successfully address the volume and type of data that organizations collect. Effective and sustainable privacy, security, and governance programs require discovery in-depth: empowering organizations to scratch more than just the surface of their data. That means not only finding and identifying more types of sensitive and personal data with greater accuracy, but being able to apply context, insight, and perspective to that data — which then helps inform policy and controls. It’s no longer enough only to be able to identify regular expressions and common types of sensitive data (like credit card numbers or social security identifiers). Privacy regulations like the CCPA and GDPR have transformed the very definition of personal data — extending it to a much broader set of data, taking into consideration things like geolocation, friendly names, online activity, and more. Unlike earlier regulations, today’s data privacy initiatives focus on data that can be related to an individual, which means that data discovery solutions need to be able to identify personal data not just by type, but from contextual clues and relationships to other data points. Furthermore, organizations are now responsible for not only protecting that data, but monitoring and reporting on whose information it is, where it came from, and where it’s going. Privacy-centric data discovery (a must for data privacy and cybersecurity in today’s environment) requires a multi-pronged strategy to identify, classify, correlate and catalog all types of sensitive & personal data in an organization — and that strategy starts with automatic discovery.

What is a Data Catalog

Data Catalog is the primary tool for organizing the thousands or millions of an organization’s data sets per business need. With data catalogs, users can search for specific data and understand its context and flow. Data catalogs are the core of any data management strategy; they enable data-driven decision making and are often essential for regulatory compliance. For example, only with a data catalog, large organizations can comply with GDPR requirements such as the ability the find and delete all instances of specific customer’s information within a short period of time.

What is data Linage

Data lineage traces the origins, movements, and joins of your data to provide insight into its quality. Data lineage tools often use a graphical interface to show the data’s journey, from inception to how it’s used (ETL, databases, business intelligence, etc.); its dependencies, to where it’s joined with other data, to whether or not it has been changed or updated. Data lineage tools give you more control over your data by allowing error tracking and adjustments when needed. Also, these tools can facilitate process changes, metadata management, self-service analytics, and data governance.

Leading vendors

There are many players in the space of metadata management. The following are a few leading solutions that provide a complete automated solution for in-depth data discovery, classification, correlation, catalog, and metadata visualization for enterprise data at a large scale.

· BigId

· Informatica Enterprise Data Catalog

· Alation Data Catalog

· Waterline

To summarize

Software certification is never complete without analyzing its real behavior with respect to data. Modern intelligent data catalogs perform real-time in-depth analytics of software data operations and, as such, should be used as are a vital component of the certification process.

--

--