Discussion:
[rrd-users] Percentile consolidation
(too old to reply)
Pablo Chacin
2015-10-24 10:43:48 UTC
Permalink
Greetings

Been able to pre-calculate an store certain data percentiles, like media an
95 percentile is a common requirement for any metrics database, as these
aggregation functions are much more stable and representative of data than
the average or maximun values.

I saw that the mean was recently included as an consolidation function in
rrdtool, but still there's no possible to calculate other arbitrary
percentiles. Interestingly, percentiles have been available when retrieving
data for graphs or reporting.

Is there any compiling reason not to include percentiles as consolidation
functions? Is there any plan to do so in the future?

Regards


---------------------------
Pablo Chacin
CTO
SenseFields SL
Tlf (+34) 93 250 45 98
Gran Via 674, principal 1º
08010 Barcelona, Spain
http://www.sensefields.com


This message was directed exclusively at the recipient and contains
privileged and confidential information. If you receive this message in
error, I beg to inform us immediately by reply email or by phone 0034 93
250 45 98, and proceed to their elimination.
Steve Shipway
2015-10-24 23:43:05 UTC
Permalink
Percentiles cannot be calculated incrementally; you need the entire dataset to deduce them whereas mean, max, min only require the last calculation result and possibly the number of samples so far. Hence you cannot have the percentile as a CF

However the RRDTool RPN functions include a percentile calculator, so you can still deduce this on the fly as you graph using the available samples. You would need to be careful to ensure that the data series over which you are aggregating is of maximum granularity though if you want to ensure maximum accuracy

Steve

Steve Shipway
University of Auckland ITS
UNIX Systems Design Lead
***@auckland.ac.nz<mailto:***@auckland.ac.nz>
Ph: +64 9 373 7599 ext 86487

________________________________
From: rrd-users [rrd-users-bounces+s.shipway=***@lists.oetiker.ch] on behalf of Pablo Chacin [***@sensefields.com]
Sent: Saturday, 24 October 2015 11:43 p.m.
To: rrd-***@lists.oetiker.ch
Subject: [rrd-users] Percentile consolidation

Greetings

Been able to pre-calculate an store certain data percentiles, like media an 95 percentile is a common requirement for any metrics database, as these aggregation functions are much more stable and representative of data than the average or maximun values.

I saw that the mean was recently included as an consolidation function in rrdtool, but still there's no possible to calculate other arbitrary percentiles. Interestingly, percentiles have been available when retrieving data for graphs or reporting.

Is there any compiling reason not to include percentiles as consolidation functions? Is there any plan to do so in the future?

Regards


---------------------------
Pablo Chacin
CTO
SenseFields SL
Tlf (+34) 93 250 45 98
Gran Via 674, principal 1º
08010 Barcelona, Spain
http://www.sensefields.com<http://www.sensefields.com/>


This message was directed exclusively at the recipient and contains privileged and confidential information. If you receive this message in error, I beg to inform us immediately by reply email or by phone 0034 93 250 45 98, and proceed to their elimination.
Donovan Baarda
2015-10-25 23:02:29 UTC
Permalink
Note that variance, and hence stddev, can be calculated incrementally (by
keeping a timeseries of the average rate squared; variance = (average
rate^2 - average^2), stddev=sqrt(variance)), and assuming a normal
distribution, 95th percentile = 2*stddev. The accuracy of this depends on
how closely your samples match a normal distribution and is not as
resilient to outliers as calculating a true 95th percentile from all the
samples, but it's a pretty good approximation. If you know your
distribution is closer to log-normal (which it often is for things like
latency), you can calculate a more accurate 95th percentile from the
average and variance like this;

mu = ln(avg) - ln(var/avg**2 + 1)/2
sigma = sqrt(ln(var/avg**2 + 1))
p95 = lognorminv(0.95, mu, sigma)

Unfortunately right now rrd doesn't support RRA's of type variance
(CF=VAR?) or mean value squared (CF=AVERAGE2?). However, if you were going
to request a feature, this is something that is definitely possible. A true
95th percentile RRA is definitely not. Another ugly approximation uses
bucketed distributions, but I wouldn't request that.

Note having an RRA of type CF=AVERAGE2 is useful for calculating the "root
mean square", something that is also useful for eg AC power calculations.
Also, stddev is actually the "root mean square" of the distance from the
mean.
Post by Steve Shipway
Percentiles cannot be calculated incrementally; you need the entire
dataset to deduce them whereas mean, max, min only require the last
calculation result and possibly the number of samples so far. Hence you
cannot have the percentile as a CF
However the RRDTool RPN functions include a percentile calculator, so you
can still deduce this on the fly as you graph using the available samples.
You would need to be careful to ensure that the data series over which you
are aggregating is of maximum granularity though if you want to ensure
maximum accuracy
Steve
*Steve Shipway*
University of Auckland ITS
*UNIX Systems Design Lead*
Ph: +64 9 373 7599 ext 86487
------------------------------
*From:* rrd-users [rrd-users-bounces+s.shipway=
*Sent:* Saturday, 24 October 2015 11:43 p.m.
*Subject:* [rrd-users] Percentile consolidation
Greetings
Been able to pre-calculate an store certain data percentiles, like media
an 95 percentile is a common requirement for any metrics database, as these
aggregation functions are much more stable and representative of data than
the average or maximun values.
I saw that the mean was recently included as an consolidation function in
rrdtool, but still there's no possible to calculate other arbitrary
percentiles. Interestingly, percentiles have been available when retrieving
data for graphs or reporting.
Is there any compiling reason not to include percentiles as consolidation
functions? Is there any plan to do so in the future?
Regards
---------------------------
Pablo Chacin
CTO
SenseFields SL
Tlf (+34) 93 250 45 98
Gran Via 674, principal 1º
08010 Barcelona, Spain
http://www.sensefields.com
This message was directed exclusively at the recipient and contains
privileged and confidential information. If you receive this message in
error, I beg to inform us immediately by reply email or by phone 0034 93
250 45 98, and proceed to their elimination.
_______________________________________________
rrd-users mailing list
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
--
Donovan Baarda <***@minkirri.apana.org.au>
Pablo Chacin
2015-10-26 16:59:55 UTC
Permalink
From the answers by Steve and Donovan, it seams like there's something
fundamental I'm not understanding about RRDTools.
Following is the explanation from Wikipedia
RRDtool assumes time-variable data in intervals of a certain length. This
interval, usually named step, is specified upon creation
of an RRD file and cannot be changed afterwards. Because data may not
always be available at just the right time, RRDtool will
automatically interpolate any submitted data to fit its internal
time-steps.
The value for a specific step, that has been interpolated, is named a
primary data point (PDP). Multiple PDPs may be consolidated
according to a consolidation function (CF) to form a consolidated data
point (CDP). Typical consolidation functions are average,
minimum, maximum.
After the data have been consolidated, the resulting CDP is stored in a
round-robin archive (RRA). A round-robin archive stores a
fixed number of CDPs and specifies how many PDPs should be consolidated
into one CDP and which CF to use.

My understanding was that when making the consolidation, al the primary
data points were available to the Consolidation function. Therefore, it
could be possible to calculate the percentile.

However, from what you explain, it looks as RRDTools re-calculates the
aggretate point with each arriving primary point. Is that correct?


Regards

---------------------------
Pablo Chacin
CTO
SenseFields SL
Tlf (+34) 93 250 45 98
Gran Via 674, principal 1º
08010 Barcelona, Spain
http://www.sensefields.com

En compliment del que disposa la Llei Orgànica de Protecció de Dades
15/1999 i el seu reglament, Sensefields, S.L. us informa que les vostres
dades personals seran tractades i incorporades als nostres sistemes
informàtics i documentals, dels quals és titular aquesta empresa. Si voleu
podeu exercir els drets d'accés, rectificació, cancel·lació i oposició
previstos a la llei, adreçant un escrit amb la fotocòpia del DNI a
Sensefields, S.L. Gran Via Corts Catalanes, 674 Principal 1ª - 08010
Barcelona (Barcelona) o bé per e.mail a ***@sensefields.com

Aquest missatge va dirigit, de manera exclusiva, al seu destinatari, i
conté informació confidencial i privilegiada. En cas de rebre aquest
missatge per error, prego que ens ho comuniquin de forma immediata
mitjançant resposta per correu electrònic, o a través del telÚfon 0034 93
250 45 98, i procedeixi a la seva eliminació.

En cumplimiento de lo dispuesto en la Ley Orgánica de Protección de Datos
15/1999 y su reglamento, Sensefields, S.L. le informa que sus datos
personales serán tratados e incorporados a nuestros sistemas informáticos y
documentales, de los que es titular esta empresa. Si desea puede ejercer
los derechos de acceso, rectificación, cancelación y oposición previstos
en la ley, dirigiendo un escrito con la fotocopia del DNI a Sensefields,
S.L. Gran Via Corts Catalanes, 674 Principal 1ª - 08010 Barcelona
(Barcelona) o bien por e.mail a ***@sensefields.com

Este mensaje va dirigido, de manera exclusiva, a su destinatario y contiene
información confidencial y privilegiada. En caso de recibir este mensaje
por error, ruego que nos lo comuniquen de forma inmediata mediant respuesta
por correo electrónico, o a través del teléfono 0034 93 250 45 98, y
proceda a su eliminación.

In compliance with The Law of Data Protection Act 15/1999 and its
regulations, Sensefields, S.L. informs you that your personal data will be
processed and stored in our computer systems and documentaries owned by
this company. If you can exercise your rights of access, rectification,
cancellation and opposition under the Act, by writing the photocopy of ID
to Sensefields, S.L. Gran Via Corts Catalanes 674 Pral 1ª 08010 -
(Barcelona) (Barcelona) or by email to ***@sensefields.com

This message was directed exclusively at the recipient and contains
privileged and confidential information. If you receive this message in
error, I beg to inform us immediately by reply email or by phone 0034 93
250 45 98, and proceed to their elimination.
Note that variance, and hence stddev, can be calculated incrementally (by
keeping a timeseries of the average rate squared; variance = (average
rate^2 - average^2), stddev=sqrt(variance)), and assuming a normal
distribution, 95th percentile = 2*stddev. The accuracy of this depends on
how closely your samples match a normal distribution and is not as
resilient to outliers as calculating a true 95th percentile from all the
samples, but it's a pretty good approximation. If you know your
distribution is closer to log-normal (which it often is for things like
latency), you can calculate a more accurate 95th percentile from the
average and variance like this;
mu = ln(avg) - ln(var/avg**2 + 1)/2
sigma = sqrt(ln(var/avg**2 + 1))
p95 = lognorminv(0.95, mu, sigma)
Unfortunately right now rrd doesn't support RRA's of type variance
(CF=VAR?) or mean value squared (CF=AVERAGE2?). However, if you were going
to request a feature, this is something that is definitely possible. A true
95th percentile RRA is definitely not. Another ugly approximation uses
bucketed distributions, but I wouldn't request that.
Note having an RRA of type CF=AVERAGE2 is useful for calculating the "root
mean square", something that is also useful for eg AC power calculations.
Also, stddev is actually the "root mean square" of the distance from the
mean.
Post by Steve Shipway
Percentiles cannot be calculated incrementally; you need the entire
dataset to deduce them whereas mean, max, min only require the last
calculation result and possibly the number of samples so far. Hence you
cannot have the percentile as a CF
However the RRDTool RPN functions include a percentile calculator, so you
can still deduce this on the fly as you graph using the available samples.
You would need to be careful to ensure that the data series over which you
are aggregating is of maximum granularity though if you want to ensure
maximum accuracy
Steve
*Steve Shipway*
University of Auckland ITS
*UNIX Systems Design Lead*
Ph: +64 9 373 7599 ext 86487
------------------------------
*From:* rrd-users [rrd-users-bounces+s.shipway=
*Sent:* Saturday, 24 October 2015 11:43 p.m.
*Subject:* [rrd-users] Percentile consolidation
Greetings
Been able to pre-calculate an store certain data percentiles, like media
an 95 percentile is a common requirement for any metrics database, as these
aggregation functions are much more stable and representative of data than
the average or maximun values.
I saw that the mean was recently included as an consolidation function in
rrdtool, but still there's no possible to calculate other arbitrary
percentiles. Interestingly, percentiles have been available when retrieving
data for graphs or reporting.
Is there any compiling reason not to include percentiles as consolidation
functions? Is there any plan to do so in the future?
Regards
---------------------------
Pablo Chacin
CTO
SenseFields SL
Tlf (+34) 93 250 45 98
Gran Via 674, principal 1º
08010 Barcelona, Spain
http://www.sensefields.com
This message was directed exclusively at the recipient and contains
privileged and confidential information. If you receive this message in
error, I beg to inform us immediately by reply email or by phone 0034 93
250 45 98, and proceed to their elimination.
_______________________________________________
rrd-users mailing list
https://lists.oetiker.ch/cgi-bin/listinfo/rrd-users
--
Donovan Baarda
2015-10-27 07:17:43 UTC
Permalink
From the answers by Steve and Donovan, it seams like there's something
fundamental I'm not understanding about RRDTools.
Following is the explanation from Wikipedia
[...]
My understanding was that when making the consolidation, al the primary
data points were available to the Consolidation function. Therefore, it
could be possible to calculate the percentile.
However, from what you explain, it looks as RRDTools re-calculates the
aggretate point with each arriving primary point. Is that correct?

Correct. The aggregation of PDP's into CDP's is done incrementally at each
input sample.

Note that even if this was not the case, calculating meaningful percentile
values would require large "steps" values on the corresponding RRA's. To
calculate a reasonable 95th percentile requires at least 20 points.

This thread made me look at the RRD docs again and I discovered it now
supports DS's of type COMPUTE. That means it can already support
calculating value^2 as another DS, making it possible to do this;

rrdtool create traffic.rrd \
--start now --step 1m \
DS:rate:COUNTER:2m:0:1000000 \
DS:rate2:COMPUTE:rate,rate,*
RRA:AVERAGE:0.5:1m:8d \
RRA:AVERAGE:0.5:1h:64d \
RRA:AVERAGE:0.5:1d:2y \

and then get an approximate 95 percentile in your graphs by calculating
2*stddev like this;

DEF:rate=/home/rrdtool/data/traffic.rrd:rate:AVERAGE
DEF:rate2=/home/rrdtool/data/traffic.rrd:rate2:AVERAGE
CDEF:variance=rate2,rate,rate,*,-
CDEF:stddev=variance,SQRT
CDEF:95ptile=stddev,2.0,*

Unfortunately it doesn't look possible to calculate an approx 95 percentile
assuming a log-normal distribution, because RRD doesn't seem to have the
functions necessary to calculate lognorminv(), at least not easily.

--
Donovan Baarda <***@minkirri.apana.org.au>

Loading...