This document discusses pitfalls in benchmarking data stream classification and proposes ways to avoid them. It analyzes the electricity market dataset, a popular benchmark, and finds that it exhibits temporal dependence that favors classifiers that simply predict the previous value. It introduces new evaluation metrics like kappa plus that account for temporal dependence by comparing to a "no change" classifier. It also proposes a temporally aware classifier called SWT that incorporates previous labels into its predictions. Experiments on electricity and forest cover datasets show SWT and the new metrics better capture classifier performance on temporally dependent streaming data.