Skip to main content

Command Palette

Search for a command to run...

Without Unity Catalog in Databricks

Updated
โ€ข4 min read
Without Unity Catalog in Databricks
N

I am a Tech Enthusiast having 13+ years of experience in ๐ˆ๐“ as a ๐‚๐จ๐ง๐ฌ๐ฎ๐ฅ๐ญ๐š๐ง๐ญ, ๐‚๐จ๐ซ๐ฉ๐จ๐ซ๐š๐ญ๐ž ๐“๐ซ๐š๐ข๐ง๐ž๐ซ, ๐Œ๐ž๐ง๐ญ๐จ๐ซ, with 12+ years in training and mentoring in ๐’๐จ๐Ÿ๐ญ๐ฐ๐š๐ซ๐ž ๐„๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ , ๐ƒ๐š๐ญ๐š ๐„๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ , ๐“๐ž๐ฌ๐ญ ๐€๐ฎ๐ญ๐จ๐ฆ๐š๐ญ๐ข๐จ๐ง ๐š๐ง๐ ๐ƒ๐š๐ญ๐š ๐’๐œ๐ข๐ž๐ง๐œ๐ž. I have ๐’•๐’“๐’‚๐’Š๐’๐’†๐’… ๐’Ž๐’๐’“๐’† ๐’•๐’‰๐’‚๐’ 10,000+ ๐‘ฐ๐‘ป ๐‘ท๐’“๐’๐’‡๐’†๐’”๐’”๐’Š๐’๐’๐’‚๐’๐’” and ๐’„๐’๐’๐’…๐’–๐’„๐’•๐’†๐’… ๐’Ž๐’๐’“๐’† ๐’•๐’‰๐’‚๐’ 500+ ๐’•๐’“๐’‚๐’Š๐’๐’Š๐’๐’ˆ ๐’”๐’†๐’”๐’”๐’Š๐’๐’๐’” in the areas of ๐’๐จ๐Ÿ๐ญ๐ฐ๐š๐ซ๐ž ๐ƒ๐ž๐ฏ๐ž๐ฅ๐จ๐ฉ๐ฆ๐ž๐ง๐ญ, ๐ƒ๐š๐ญ๐š ๐„๐ง๐ ๐ข๐ง๐ž๐ž๐ซ๐ข๐ง๐ , ๐‚๐ฅ๐จ๐ฎ๐, ๐ƒ๐š๐ญ๐š ๐€๐ง๐š๐ฅ๐ฒ๐ฌ๐ข๐ฌ, ๐ƒ๐š๐ญ๐š ๐•๐ข๐ฌ๐ฎ๐š๐ฅ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง๐ฌ, ๐€๐ซ๐ญ๐ข๐Ÿ๐ข๐œ๐ข๐š๐ฅ ๐ˆ๐ง๐ญ๐ž๐ฅ๐ฅ๐ข๐ ๐ž๐ง๐œ๐ž ๐š๐ง๐ ๐Œ๐š๐œ๐ก๐ข๐ง๐ž ๐‹๐ž๐š๐ซ๐ง๐ข๐ง๐ . I am interested in ๐ฐ๐ซ๐ข๐ญ๐ข๐ง๐  ๐›๐ฅ๐จ๐ ๐ฌ, ๐ฌ๐ก๐š๐ซ๐ข๐ง๐  ๐ญ๐ž๐œ๐ก๐ง๐ข๐œ๐š๐ฅ ๐ค๐ง๐จ๐ฐ๐ฅ๐ž๐๐ ๐ž, ๐ฌ๐จ๐ฅ๐ฏ๐ข๐ง๐  ๐ญ๐ž๐œ๐ก๐ง๐ข๐œ๐š๐ฅ ๐ข๐ฌ๐ฌ๐ฎ๐ž๐ฌ, ๐ซ๐ž๐š๐๐ข๐ง๐  ๐š๐ง๐ ๐ฅ๐ž๐š๐ซ๐ง๐ข๐ง๐  new subjects.

Before the introduction of Unity Catalog in Databricks, managing data and controlling access within the platform required different approaches. These methods, though effective in certain scenarios, lacked the unified governance, simplified access management, and centralized metadata handling that Unity Catalog now provides. Below is a summary of how things were done before Unity Catalog in the context of data governance, access control, and data management:

1. Data Governance Without Unity Catalog

Prior to Unity Catalog, data governance in Databricks involved managing governance policies manually across different data systems. Key governance tasks such as data lineage, metadata management, and auditing had to be handled through custom implementations and separate tools.

Manual Lineage Tracking:

  • Lineage refers to the tracking of the dataโ€™s origin and how it is transformed over time.

  • Before Unity Catalog, data lineage was manually tracked using external tools like Apache Atlas or custom scripts. These tools were not fully integrated with Databricks and required extensive configuration to map out how data flows across different jobs, pipelines, and transformations.

Decentralized Metadata Storage:

  • Metadata (e.g., table structures, schema definitions, and data types) had to be managed independently for each workspace. There wasnโ€™t a centralized location to store and access metadata across Databricks environments.

  • Databricks workspaces would store metadata locally, often requiring users to maintain and update their own metadata catalogs or rely on external metadata repositories that were not always synchronized with the platform.

Compliance and Auditing:

  • Auditing and ensuring compliance with data policies and standards were done manually using logs and external tools.

  • There was no unified mechanism for tracking who accessed or modified data. Organizations had to configure their own logging mechanisms (often through Apache Spark or external systems) to keep track of user activities and changes to the data.


2. Access Control Before Unity Catalog

Without Unity Catalog, managing access to data in Databricks was done using a combination of workspace-level permissions, cluster configurations, and external tools.

Workspace-Level Permissions:

  • Users and groups were assigned permissions at the Databricks workspace level rather than the data level.

  • Permissions controlled access to the overall workspace but were not fine-grained enough for individual data objects, such as tables, schemas, or views. This made it challenging to manage who could access specific data assets within the workspace.

Cluster-Level Permissions:

  • Users who had cluster access could potentially access all data within the workspace, unless additional restrictions were applied.

  • This broad level of access made it harder to implement least-privilege access, where users are only given the minimum level of access necessary for their work.

External Identity and Permissions Management:

  • For managing user identities and access, Databricks integrated with external systems like Active Directory or OAuth. However, this integration was typically applied to workspace-level access rather than data-level access.

  • Managing fine-grained permissions on individual tables or views often required custom configurations and complex workflows.

3. Data Management Without Unity Catalog

Data management before Unity Catalog was often fragmented and required manual setup and integration.

Manual Organization of Data:

  • Data Organization in Databricks was based on workspaces and the DBFS (Databricks File System), where users would create directories and store data in files. Managing large volumes of data in this decentralized structure required additional tools and scripting.

  • There was no standard way to manage data objects like tables, views, and schemas across workspaces.

Limited Integration with External Data Sources:

  • Although Databricks supported integration with various data sources (e.g., Amazon S3, Azure Blob Storage, Delta Lake), the management and organization of this external data were not fully centralized. Each workspace would independently manage connections to external sources, making it harder to apply consistent governance or permissions across different sources.

Metadata Management:

  • Managing metadata was a separate task that required integration with external tools or custom scripts.

  • Without Unity Catalogโ€™s centralized Metastore, metadata for tables, schemas, and other objects was often scattered across different workspaces or metadata repositories, requiring synchronization between various systems.

Challenges Before Unity Catalog:

  • Fragmented Governance: There was no single, unified framework to ensure consistent data governance across the platform. Different tools were needed for lineage tracking, auditing, and managing metadata, often requiring significant overhead to integrate and maintain.

  • Complex Access Management: Managing fine-grained data access at the level of individual tables, schemas, or data assets was difficult without Unity Catalogโ€™s centralized access controls. Permissions had to be manually configured for each workspace or cluster, which increased the potential for errors and security risks.

  • Lack of Metadata Centralization: Storing and managing metadata across multiple workspaces was cumbersome. Users had to rely on external systems to manage data definitions, lineage, and schema information, leading to inconsistencies and challenges with synchronization.

Summary of Key Differences:

More from this blog

Naveen P.N's Tech Blog

94 posts