Data distribution measures the occurrence of a certain value in a column. Since producing a distribution of records from potentially millions of rows of data is no trivial task, distribution is only available for indexed columns headers. Indexing a column is essentially flagging it so the underlying data is stored for on-the-fly calculations.
Fig: Data distribution for a specific column in a dataset
Data distribution allows us to detect any changes in the underlying dataset that can be a proxy for quality issues. To view data distribution
- Locate the dataset from the list in the Datasets page. Next jump over to the Dashboard tab
- Scroll to the bottom of the page. Use the page filter to navigate to the correct page. You can ignore this filter if there are no other options in the filter
- Choose ‘indexed columns’ from the columns dropdown which will only show said column type
- Click on the graph icon next to the column name. This will open a pop-up for data distribution
- You can use the graph to see how distribution is trending. You can also click on different datasets in the graph to load their value distribution
The data distribution graph only shows 10 most occurring values in the column.
Topics in this section: