the reality of really real data
Nov. 18th, 2008 01:11 amI'm currently working with some really messy time-series data, about power consumption in office buildings. There's missing data, multiple periodicities (daily, weekly, yearly), freakish outliers (e.g. holidays), and bursty anomalies (summer days generally use less power, except during heat waves, when ACs use *tons* of power). The task is daunting.
There are some things I want to do with the data for which I have no probabilistic interpretation (e.g. filter out certain frequencies).
I've spent the first several days exploring the data, making scatterplots, etc. I've seen some weird patterns, puzzling clusters. Modeling these would entail non-parametric density estimation, but this wouldn't tell me what to do wrt making actual predictions.
I should get some basic predictions working.
But there are so many possible models! Even though I'm only considering past power usage! (I'm not even looking at temperature)
Here are some basic ideas that have been floated:
* model the function using Gaussian Processes (for some kernel(s))
* model [prev n hours, next k hours] as a multivariate Gaussian (maybe this is the same as the above idea)
* autoregressive models, e.g. ridge regression on a subset of past times (including polynomial basis expansion, etc.)
* nearest neighbor (for some geometry(s))
* parameterized functional forms: model variation in daily bumps as a parameterized family of bumps, e.g. height, fatness, tail skewness, etc., using splines
* State-Space Models, a.k.a. continuous-state HMMs, (for some family of functions)
* Gaussian Lilypads, (I need to read up on this)
* ... and of course, ensembles of the above.
Following the principle of starting really simple, I plan to start by modeling daily totals, rather than hourly data.
There are some things I want to do with the data for which I have no probabilistic interpretation (e.g. filter out certain frequencies).
I've spent the first several days exploring the data, making scatterplots, etc. I've seen some weird patterns, puzzling clusters. Modeling these would entail non-parametric density estimation, but this wouldn't tell me what to do wrt making actual predictions.
I should get some basic predictions working.
But there are so many possible models! Even though I'm only considering past power usage! (I'm not even looking at temperature)
Here are some basic ideas that have been floated:
* model the function using Gaussian Processes (for some kernel(s))
* model [prev n hours, next k hours] as a multivariate Gaussian (maybe this is the same as the above idea)
* autoregressive models, e.g. ridge regression on a subset of past times (including polynomial basis expansion, etc.)
* nearest neighbor (for some geometry(s))
* parameterized functional forms: model variation in daily bumps as a parameterized family of bumps, e.g. height, fatness, tail skewness, etc., using splines
* State-Space Models, a.k.a. continuous-state HMMs, (for some family of functions)
* Gaussian Lilypads, (I need to read up on this)
* ... and of course, ensembles of the above.
Following the principle of starting really simple, I plan to start by modeling daily totals, rather than hourly data.
(no subject)
Date: 2008-11-18 07:23 pm (UTC)(no subject)
Date: 2008-11-18 11:48 pm (UTC)(no subject)
Date: 2008-11-18 11:54 pm (UTC)It won't help you with predictionbut it should give you a getter idea of the characteristics of this type of data.
(no subject)
Date: 2008-11-20 06:14 am (UTC)(no subject)
Date: 2008-11-20 09:23 am (UTC)Many of these inputs you can't predict. Probably the best you can do is factor out some known effects - such as the tendency for power consumption to rise and fall in steady periods that correspond with the seasons. I think this is essentially what goes on at the power and gas supply companies - you can make some prediction at the large scale to predict overall usage over a month, a season or a year and keep coal and gas reserves at optimal levels for minimizing probabilty of power outages vs cost of holding reserves.
If you can get hold of power consumption plots over differeing scales like the ones in - the Dieker paper for computer network usage - you should be able to see some of the main FBM characteristics - self-similarity and long term dependence.
(no subject)
Date: 2008-11-19 03:06 am (UTC)(no subject)
Date: 2008-11-19 03:09 am (UTC)(no subject)
Date: 2008-11-19 07:36 am (UTC)(no subject)
Date: 2008-11-19 05:32 am (UTC)I think so. It was Kevin Murphy who mentioned it. Can you give me the link?
<< normal distribution emission models >>
does this mean that each hidden state corresponds to a different Gaussian cloud?
I find it a bit strange that he's suggesting discretizing the state-space, since it is naturally continuous and this only makes it harder to learn. But I guess this isn't so bad since we're in 1D.