Glossary of terms

Frequently used terms and their meaning

Account: An account represents an individual customer account, a business or even a partner organization whom we do business with. Accounts are classified as

- Multi-recurring
- Recurring
- One-time

Account Owner: An Account Owner is a designated point of contact from Grepsr responsible for delivery, support and account expansion. An Account Owner is reserved for certain account types only.

Platform: Platform or Grepsr’s Data Platform is Grepsr’s proprietary, enterprise-grade system for data project management. It comprises of two complementary pieces:

- Data Infrastructure: A state-of-the-art data infrastructure for crawler execution, and data acquisition
- Data Management Platform: A purpose-built, web-based platform for data project management

Project: A project is a vehicle through which customer requirements are translated into workable data and value is delivered. Project requirements are of two types

- Data requirements: URLs and data points to extract and additional instructions required to pull data
- Delivery requirements: Frequency of delivery and delivery destinations

Report: Project requirements are grouped into disparate sets called Reports. A Report represents a use case, or a granular set of data and delivery requirements that can be executed at once and delivered together.

Each Report is associated with a set of programmatic instructions to source data known as a Crawler (or Service). A Report is associated with a unique Crawler version. A successful Project has at least 1 Report.

Crawler (or Service): A Crawler programmatically opens and interacts with a website and parses content to extract data. Crawlers are versioned to reflect changes in data scope over time. Although multiple versions of a Crawler may exist, only a single active version is associated with a Report at any given time.

Crawlers are internally referred to as a Service.

Run: A Run is a crawler execution. It can also be used as a noun to represent the act of executing a run.

Dataset: Data output resulting from a Run is referred to as a Dataset.

Page: Pages in a Dataset are akin to sheets in a spreadsheet. Each Dataset has at least 1 Page. Different Pages are used per customer requirement often to normalize the final output like in a relational database or for separation of concerns.

Each Page contains a data grid consisting of different Columns & Rows. Seldom, content of a Page can also be structured in a JSON format.

Columns: The extracted fields in a Dataset (or a Page in a Dataset) is organized in different headers referred to as a Column.

Indexed Column: Indexing a Column implies the generated data output for the said Column is stored in such a way that allows us to filter, sort and search across millions of records without delay.

Rows: Each line of record in a Dataset is referred to as a Row.

Object: A Row of record in a JSON output is referred to as an Object. A Row is flat, one-dimensional whereas an Object can be layered.

Quality: Quality is an umbrella term to quantitatively and qualitatively measure the overall health of a Report. Higher quality infers that sourced data is per requisite standards and properly represents the source. Quality is measured using various factors such as Accuracy, Completeness, Data Distribution, Rows and Requests.

Accuracy: A numeric score in percentage that measures if sourced data complies with the expected data format. Compliance is validated using rules assigned to different Columns in a Dataset. Full compliance results in 100% Accuracy. If no rules are assigned to any of the Columns in a Dataset, then the resulting Accuracy is null.

Accuracy is measured per Column and aggregated for the entire Dataset. Accuracy is used as a Quality indicator in case the score digresses from the norm.

Completeness: Completeness is a state where the data contains all the information available to extract from the source. Completeness is measured using Fill Rate.

Fill Rate: A numeric score in percentage that measures data density. An empty cell in a Dataset means the Fill Rate for said cell is 0 (or 0%). On the contrary, a cell with data means a Fill Rate of 1 (or 100%). The aggregated score for the Column is the average across all cells in the Dataset. Likewise, the aggregated score for the entire Dataset is the average across all cells.

Fill Rate is used as an indicator for Quality in case the score, for a Column or the entire Dataset, deviates from expected values.

Data Distribution: Measures the occurrence of a certain value in a Column. Data Distribution is only available for Indexed Columns. Used as a proxy for Quality in case distribution deviates from the norm.

Requests: A Request is an HTTP request made to the server to retrieve content. The Crawler makes a series of Requests to load and interact with a web page in order to extract necessary data. A Request is either

Successful: meaning the requested content is served by the server
Failed: meaning the server returned an error and could not open the requested content

Number of Failed Requests is also used as a proxy for Quality. While failures are expected, a sudden increase in Failed Request can imply the sourced data is not per standard.

Retries: Retry is a subsequent Request made to the server when a previous request fails. Number of Retries are capped in case of persistent failures.

Team: A set of users belonging to the same Account is referred to as a Team. Individuals in a Team can have one of two system roles

- Team Manager - with administrative rights and access to all Projects in the Account
- Viewer - with limited rights and access to only added Projects in the Account