|
我非常怀疑我是否以最有效的方式执行这个操作,这就是为什么我plpgsql这里有标记。一千个测量系统,* 我需要在 20亿行" p7 ?, F3 x5 l& y6 {8 L
上运行。*
9 Y' X: I4 `0 j+ c0 L' L6 G# f你有一些测量系统。当它们失去连接时,它们通常会报告以前的值,有时(但有时很长)会突然断开连接。你需要总结,但当你这样做时,你需要检查它的重复时间,并根据信息进行各种筛选。假设你在测量汽车mpg,但它在20. K8 T( @. h1 D$ K, q! q2 }
mpg停留一个小时,然后移动到20.1.等等。您需要评估卡住时的准确性。您还可以放置一些替代规则,以查找汽车何时在高速公路上,并使用窗口功能生成汽车的状态并进行分组。没有必要再付出代价:4 d. q; f9 `8 Q
--here's my data,you have different systems,the time of measurement,and the actual measurement--as well,the raw data has whether or not it's a repeat (hense the included window functionselect * into temporary table cumulative_repeat_calculator_dataFROM select system_measured,time_of_measurement,measurement, case when measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc) then 1 else 0 end as repeat FROM SELECT 5 as measurement,1 as time_of_measurement,1 as system_measured UNION SELECT 150 as measurement,2 as time_of_measurement,1 as system_measured UNION SELECT 5 as measurement,3 as time_of_measurement,1 as system_measured UNION SELECT 5 as measurement,4 as time_of_measurement,1 as system_measured UNION SELECT 5 as measurement,1 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,2 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,3 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,4 as time_of_measurement,2 as system_measured UNION SELECT 150 as measurement,5 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,6 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,7 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,8 as time_of_measurement,2 as system_measured ) as data) as data;--unfortunately you can't have window functions within window functions,so I had to break it down into subquery--what we need is something to partion on,the 'state' of the system if you will,so I ran a running total of the nonrepeats--this creates a row that stays the same when your data is repeating - aka something you can partition/group onselect * into temporary table cumulative_repeat_calculator_step_1FROM select *, sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system from cumulative_repeat_calculator_data order by system_measured,time_of_measurement) as data;--finally,the query. I didn't bother showing my desired output,because this (finally) got it--I wanted a sequential count of repeats that restarts when it stops repeating,time_of_measurementtime_of_measurement那么,你将采取什么不同的操作,或者你将使用什么替代工具来操作一个巨大的表呢?我正在考虑使用它plpgsql,因为我怀疑这需要在数据库中完成,也可以在数据插入过程中完成,尽管我通常在加载数据后使用它。有什么方法可以一目了然地实现,而无需诉诸子查询?
0 f0 X3 t2 @1 G" O4 F我测试了一种 替代方法8 q. i- c! ~# _8 s
,但它仍然依赖于子查询,我认为它更快。您将使用此方法start_timestamp,end_timestamp,system创建一个开始和停止表。然后,您添加一个更大的表。如果时间戳介于两者之间,则将其分类为此状态。这实际上是一个替代方案cumlative_sum_of_nonrepeats_by_system。但是,当你执行这个操作时,成千上万的设备和成千上万的事件将是1
* v# I' c' \9 N6 T7 B= 1比例加入。你认为这是更好的方法吗?
! u2 [9 Y! A0 l! c* h$ R& V) f5 _ ! T" p0 J9 Z7 b* x% L
解决方案: ' j% I2 \, G3 ] O
测试用例首先,以更有用的方式呈现您的数据-甚至更好的方法是,在
) c$ o/ D+ ^' Y5 }+ M# P$ |sqlfiddle中 使用:
6 Z! o" Z; q8 C8 [. S8 QCREATE TEMP TABLE data( system_measured int ,time_of_measurement int ,measurement int);INSERT INTO data VALUES (1、1、1、5)、(1、2、150)、(1、3、5)、(1、4、5)、(2、1、5)、(2、2、2、2、3、5)、(2、4、5)、(2、5、150)、(2、6、5)、(2、7、5)简化查询因为还不清楚,所以我只假设上面给出的内容。" E/ c9 P, G; x: V+ d
接下来,我简化了查询,得出结论:! D1 K3 D |3 x. C
WITH x AS ( SELECT *,CASE WHEN lag(measurement) OVER (PARTITION BY system_measured ORDER BY time_of_measurement) = measurement THEN 0 ELSE 1 END AS step FROM data y AS ( SELECT *,sum(step) OVER(PARTITION BY system_measured ORDER BY time_of_measurement) AS grp FROM x )SELECT * ,row_number() OVER (PARTITION BY system_measured,grp ORDER BY time_of_measurement) - 1 AS repeat_ctFROM yORDER BY system_measured,time_of_measurement;现在,虽然它是所有好的和有光泽的纯使用SQL,这将是 很大 一个PLPGSQL功能更快,因为它可以扫描一个表,其中至少需要三次扫描。! x- `+ W5 S7 {) P2 l5 N
使用plpgsql函数更快:CREATE OR REPLACE FUNCTION x.f_repeat_ct() RETURNS TABLE system_measured int ,time_of_measurement int ,measurement int,repeat_ct int ) LANGUAGE plpgsql AS$func$DECLARE r data; -- table name serves as record type r0 data;BEGIN-- SET LOCAL work_mem = '1000 MB'; -- uncomment an adapt if needed,see below!repeat_ct := 0; -- initFOR r IN SELECT * FROM data d ORDER BY d.system_measured,d.time_of_measurementLOOP IF r.system_measured = r0.system_measured AND r.measurement = r0.measurement THEN repeat_ct := repeat_ct 1; -- start new array ELSE repeat_ct := 0; -- start new count END IF; RETURN QUERY SELECT r.*,repeat_ct; r0 := r; -- remember last rowEND LOOP;END$func$;称呼:# Z6 o) C5 I3 S" W
SELECT * FROM x.f_repeat_ct();一定要一直用这个plpgsql函数限制表名,因为我们使用与输出参数相同的名称,如果不限制,则使用优先级。- f' z, r$ g* H* J
数十亿行如果您 有数十亿
, ^5 a- c, s8 R行,这个操作可能需要拆分。我在这里引用手册:
- E$ O) Q) p' t: Y" m注:如上所述,目前的实现RETURN NEXT和RETURN QUERY 存储整个结果集,然后从函数返回。这意味着,如果PL /. e$ d2 c5 b5 `
pgSQL函数产生一个非常大的结果集,性能可能很差:将数据写入磁盘,以避免内存耗尽,但函数本身在生成整个结果集之前不会返回。PL /
8 Y4 D; C8 j6 u6 A- `pgSQL未来版本可能允许用户定义没有此限制的集合返回函数。目前,数据从磁盘的位置写入磁盘work_mem 配置变量控制。在内存中存储更大结果集的管理员应考虑增加此参数。
$ ?/ |" K& I$ o考虑一次为系统计算,或设置足够高的值work_mem处理负载。请遵循报价中提供的链接,以获得更多相关信息work_mem的信息。
2 [( T4 [/ Y2 \% }0 b$ a: G- h一种方法是在函数中work_memwith设置高值SETLOCAL,这只对当前事务有效。我在函数中添加了注释行。不 设置一个非常高的全球范围,因为核弹可能会攻击你的服务器。阅读手册。 |
|