
Variables and data types
All analytics begins with recognizing what type of data you are dealing with.
A categorical variable classifies into groups: section, device, country, content type.
A numerical variable expresses quantities: reading time, page views, number of users, conversions.
Within numerical variables, a distinction can be made between discrete variables, which take countable values, and continuous variables, which can take a broader range of values.
This distinction matters because not all data is summarized or represented in the same way. A device type cannot be meaningfully averaged; reading time can be summarized with certain measures.
Levels of measurement
It is also useful to know, at least at a basic level, the levels of measurement:
• Nominal: categories without order, such as country or device type
• Ordinal: categories with order, such as satisfaction level or priority
• Interval: scales with meaningful differences but without an absolute zero
• Ratio: numerical variables with a real zero, such as time, users, or revenue
It is not necessary to turn this into an overly theoretical lesson. It is enough for the student to understand that the type of variable determines which operations and charts make sense.
Frequency tables
A frequency table summarizes how many times each value or category appears. It is one of the simplest and most useful tools for exploring a dataset.
For example, if 100 news articles are classified by section, a frequency table may show how many belong to Politics, Sports, Local, Culture, or Economy. That table already allows the detection of compositions, production biases, or concentration areas.
In numerical variables, frequencies also help visualize distributions by grouping values into ranges.
Measures of central tendency
Measures of central tendency help summarize a dataset into a representative value.
The mean is the arithmetic average. It is useful but can be affected by extreme values.
The median is the central value of an ordered distribution. It is especially useful when there are outliers or asymmetric distributions.
The mode is the most frequent value. It is mainly interesting in categorical variables or in distributions where repetition matters.
In editorial consumption data, the median is often very valuable because many user behaviors do not follow balanced distributions. A small group of pieces or users may concentrate a disproportionate share of total volume.
Measures of dispersion
Knowing the central value is not enough. It is also useful to know how much the data varies.
The range shows the distance between the minimum and maximum value.
Variance and standard deviation express how far values deviate from the mean.
In practical terms, dispersion helps answer questions such as:
Are results relatively stable or highly heterogeneous?
Are reading times similar across pieces or do they vary widely?
Does a high average reflect general behavior or only a few extraordinary cases?
Percentiles and quartiles
Percentiles allow a value to be positioned within a distribution. Saying that an article is in the 90th percentile of reading time means it performs better than most pieces according to that metric.
This logic is very useful in editorial contexts because it allows comparisons without relying only on averages. It is also useful for defining thresholds: top 10%, upper quartile, lower half, and so on.
Basic representations
A histogram helps visualize how a numerical variable is distributed.
A box plot summarizes distribution, median, dispersion, and possible outliers.
A bar chart is useful for comparing categories.
A line chart works well for temporal evolution.
A scatter plot allows exploration of the relationship between two variables.
A cross table (or contingency table) is very useful for comparing categories with each other, for example section by device, traffic source by content type, or user segment by conversion.
Try it yourself
Below are the average reading times for 10 articles published last week:
2:10 · 1:45 · 3:20 · 2:05 · 12:45 · 1:50 · 2:40 · 2:15 · 1:55 · 2:10
(Tip: convert to seconds first to make the calculation easier — 2:10 = 130 seconds, and so on.)
Step 1 — Calculate the mean. Add all values, divide by 10.
Step 2 — Find the median. Sort the values from lowest to highest, then identify the middle value.
Step 3 — Compare.
Consider:
- Are your mean and median very different? Why?
- Which number better represents the “typical” article in this dataset — and why?
- Which single article is almost certainly responsible for the gap? What might explain it?
- If you were reporting to your editor on “how long people are reading,” which figure would you use — and how would you explain your choice?
(Mean ≈ 3:17 · Median = 2:10)
This is one of the most common misreadings in editorial analytics. A single outlier can make your data look very different from your actual reality.