0
$\begingroup$

My general rule of thumb is that histograms should be used for continuous data, and bar charts for categorical data. (obviously not my rule)

What about dates? They are non-continuous (unlike, say, datetimes) but are ordered. If grouping by month or year, I would opt to use a bar chart but unsure if by day.

$\endgroup$
6
  • 1
    $\begingroup$ Welcome to CV, Bob. I would suggest not using the type of measurement as a guide to any form of display or analysis. Focus instead on what information you want your graphic to convey and how it should do so. Dates in fact are a continuous form of measurement that have been discretized into 24-hour bins; but when you are analyzing a relatively large range of dates, you can safely treat them in the same way you would any continuous variable. Thus, the specific meaning of "continuous" in your question is itself ambiguous. $\endgroup$ Commented May 17 at 13:56
  • $\begingroup$ Note also that there is really not such a difference between bar chart and histogram; for years on the x axis, a bar chart is just a histogram with a bin size of 1 (with quarters on the x-axis, a bar chart by year is a histogram with a bin size of 4, etc...). And if you have dates, why not displaying the data simply as a time series? Last, not that the x-axis in a histogram usually is a random variable; dates are not random at all... $\endgroup$ Commented May 17 at 17:09
  • 1
    $\begingroup$ I'd put it quite the other way. A histogram is just a bar chart of special kind, in which the bins are intervals of values (possibly just one value in each) and the bars show frequency, proportion, percent, or density. Regardless of my preferred definitions, I can't follow what the question is here. $\endgroup$ Commented May 17 at 17:59
  • 2
    $\begingroup$ Yet further: what are dates precisely? I routinely see data for daily, weekly, monthly, quarterly, half-yearly, yearly dates, and yet others. Is the question singling out daily dates? And days can be subdivided: there is no difference in statistical or graphical principle because when we have (e.g.) hourly data, we tend to talk about times not dates. $\endgroup$ Commented May 17 at 18:03
  • 1
    $\begingroup$ Why do you insist on one or the other? Perhaps some other plot is better. If you tell us what you are trying to find out or to illustrate, we can give better advice. $\endgroup$ Commented May 17 at 22:21

1 Answer 1

1
$\begingroup$

Let me first start by saying that, all data we collect for statistical analysis is discrete. And by adding that it does not matter...

Why is there no actual continuous data? Because anything we measure, we can not measure with an infinite resolution. We may measure elapsed time in years, quarters, months, etc. but not in seconds. Even if we measured elapsed time in seconds, then we would not do so in femtoseconds (because for what we are studying, such a small resolution would be impractical, and of no value). And even if we measured down to femtoseconds, we would not measure down to Planck's time.

And it does not matter. If we measure date/time down to a resolution which matters for our problem, we can very safely treat it as continuous for all our tests/inferences/etc. You measured date/time to a resolution of 1 year; if that serves the purposes of your investigation, so be it; treat it as continuous all you want. And you could have measured date/time by decades, centuries, or quarters. months, days, etc. It is just the resolution of your measurement; if it gives you the resolution you need... And the resolution of our measurements does not affect the continuous/discrete nature of the underlying variable.

However, some variables are inherently continuous in nature (regardless of the resolutiom used to measure them), while others are inherently discrete. But even there, the distinction is, let's say, fluid (or more fluid than I think it should be...).

Many (most?) numerical variables are inherently continuous; temperature, length, weight, date/time (your case), duration, voltage, concentration of an analyte, etc. are continuous. We decide, as appropriate for our purposes, to measure them with a given resolution, but this does not change their continuous nature; we could have measured them with 100x the resolution (which would have been impractical, prohibitively expensive, etc. but we could have...).

And we have continuous statistical distributions to deal/test these variables (gaussian, t, chi-square, F, beta, gamma, Weibull, etc...).

Other numerical variables (much fewer of them) are inherently discrete; pretty much all of these are counts. For example a coin toss (Bernoulli), or the count of heads in $k$ coin tosses (Binomial), or the number of calls to a service center in an hour (Poisson), etc. Here, using 10th of counts, or 10's of counts does not make sense; the only rational increment is a single count.

And we have dedicated distributions and tests (binomial, Fisher exact, etc.) as well.

Does the distinction matter? Certainly more so than between continuous data (which is a figment of the imagination) and discrete data (which they all are). But maybe not as much as it should?

The "best" (??) counter example is the $\chi^2$ test of association; it is used to compare 2 (discrete) proportions, but rellies on a continuous distribution. Now, under some circumstances, it gives reasonable results. But it treats discrete variables as if they were continuous (normal approximation of the binomial). So much so, that some have developped continuity corrections for it (to get results which come closer to the "exact" discrete tests).

Some practitioners also sometimes discretize continuous variables (by binning the data - which is what a histogram does, btw), or even dichotomize it. This practice is generally frowned upon, as it loses a lot of information, but it can sometimes be a pragmatic approach (fwiw, I have dichotomized data from continuous variables, to obtain Tolerance Intervals, when the data was very non-normal. The price was of course larger sample sizes, but it was still a practical approach).

And it can even go further. If I were to compare, say, the number of events (accidents?) which occured on a certain stretch of road in 2 different years, now the year would be treated as categorical data.

So, is there a difference between continuous and discrete data? No (continuous data does not exist - and it does not matter). Is there a difference between discrete and continuous variables? Yes, but statisticians can play loose with that difference.

Bottom line; your date/time data is continuous, regardless of the resolution (days or seconds, or years or decades).

Now, can you get a histogram of data from a discrete variable? Of course. Below is a histogram from a sample out of a B(100,.3) discrete binomial distribution; Histogram of B(100,.3) sample of size 100

Could I also have taken a bar chart of it? Of course. See below Bar chart of B(100,.3)

Could I take a histogram of data from a continuous variable? Of course.
And a bar chart? Yes, I could, but given that there will be few "ties" (exactly equal values), it will have a lot of bars with height 1... Not very informative.

Note that bar charts (or pie charts) are mostly used for purely catagorical data (blood types, eye colors, ethnic origins, etc.), and not really for numerical data.

So your intuition that the appropriate chart depends on the nature of your either your data, or the variable is ill founded. It depends on what you are trying to get the data to say...

Note also that a bar chart (of years) can be thought of as a histogram (with bin width = 1), and a histogram can be thought of as a bar chart (where each category has the width of the histogram's bins). So there is really no fundamental difference.

Last, I am not even sure I understand plotting a histogram of years. Now, a histogram of a random variable X shows for each bin how many values of X fall in that bin. But date is usually not a random variable; we know when (year, month, day...) an event occurred. What may be random is how many such events occurred in that span of time. But then that is exactly a bar chart (with increments of years, decades, months, or minutes...). A histogram implies that the date (year, or month, or decades, ...) where "something" occurred is a random variable, and you are plotting how many of those "dates" (no matter what scale) occurred in the various bins. I struggle to think of practical, realistic, situations where that would make sense (maybe the years when my favorite sports team achieved something: won a trophy, scored more than $n$ goals, etc. That histogram will have a lot of 0s...).

Now a bar chart assumes that the x-axis is categorical. But years are at a minimum interval scale. So a bar chart is equally "odd", no matter the scale (year, day, decade...).

What dates usually are is just time-stamps of when a measurement was made. The proper plot for such data is a time series plot, where you plot a count/measurement as a function of time. And there are many methods to display/analyze these.

$\endgroup$
9
  • $\begingroup$ Your philosophy, IMHO, fails to distinguish between how we conceive of data, how we measure them, and how we record them. Those are three distinct things to which (often) the characteristics "continuous" and "discrete" are applied. This lack of distinction leads to confusing advice. $\endgroup$ Commented May 18 at 15:00
  • $\begingroup$ @whuber, could you provide an example of where my advice would lead to confusion? $\endgroup$ Commented May 19 at 0:00
  • $\begingroup$ For starters, you employ at least two different concepts of "continuous variables" in your introductory paragraph and that leads to a confusion between random variables and measurement. Next, I think "a histogram can be thought of as a bar chart" really misses the boat, because the two concepts do differ at bottom: one uses area while the other uses height to depict frequencies or probabilities. Histograms and bar charts appear to be the same things when using constant-width bins--but not otherwise. $\endgroup$ Commented May 19 at 13:05
  • $\begingroup$ For once I disagree with @whuber: There are bar charts with unequal bar width as well as varying bar heights that no statistical person would be likely to call a histogram. Bars for countries or regions with population on one axis and something per head on the other are a case in point. Bar area is then the total something (e.g. energy consumption or waste produced) for each country or region. $\endgroup$ Commented May 19 at 15:02
  • 1
    $\begingroup$ @whuber, thanks for the comment. You are indeed correct in your criticism, and, fwiw, I edited my answer accordingly. Thanks $\endgroup$ Commented May 20 at 16:36

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.