连续重复/重复有序计数

一世纪

我非常怀疑我是否以最有效的方式执行这个操作，这就是为什么我plpgsql这里有标记。一千个测量系统，* 我需要在 20亿行" p7 ?, F3 x5 l& y6 {8 L
上运行。*
你有一些测量系统。当它们失去连接时，它们通常会报告以前的值，有时（但有时很长）会突然断开连接。你需要总结，但当你这样做时，你需要检查它的重复时间，并根据信息进行各种筛选。假设你在测量汽车mpg，但它在20
mpg停留一个小时，然后移动到20.1.等等。您需要评估卡住时的准确性。您还可以放置一些替代规则，以查找汽车何时在高速公路上，并使用窗口功能生成汽车的状态并进行分组。没有必要再付出代价：
--here's my data,you have different systems,the time of measurement,and the actual measurement--as well,the raw data has whether or not it's a repeat (hense the included window functionselect * into temporary table cumulative_repeat_calculator_dataFROM select    system_measured,time_of_measurement,measurement, case when    measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)    then 1 else 0 end as repeat FROM SELECT 5 as measurement,1 as time_of_measurement,1 as system_measured UNION SELECT 150 as measurement,2 as time_of_measurement,1 as system_measured UNION SELECT 5 as measurement,3 as time_of_measurement,1 as system_measured UNION SELECT 5 as measurement,4 as time_of_measurement,1 as system_measured UNION SELECT 5 as measurement,1 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,2 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,3 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,4 as time_of_measurement,2 as system_measured UNION SELECT 150 as measurement,5 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,6 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,7 as time_of_measurement,2 as system_measured UNION SELECT 5 as measurement,8 as time_of_measurement,2 as system_measured ) as data) as data;--unfortunately you can't have window functions within window functions,so I had to break it down into subquery--what we need is something to partion on,the 'state' of the system if you will,so I ran a running total of the nonrepeats--this creates a row that stays the same when your data is repeating - aka something you can partition/group onselect * into temporary table cumulative_repeat_calculator_step_1FROM select    *,          sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system from cumulative_repeat_calculator_data order by system_measured,time_of_measurement) as data;--finally,the query. I didn't bother showing my desired output,because this (finally) got it--I wanted a sequential count of repeats that restarts when it stops repeating,time_of_measurementtime_of_measurement那么，你将采取什么不同的操作，或者你将使用什么替代工具来操作一个巨大的表呢？我正在考虑使用它plpgsql，因为我怀疑这需要在数据库中完成，也可以在数据插入过程中完成，尽管我通常在加载数据后使用它。有什么方法可以一目了然地实现，而无需诉诸子查询？
我测试了一种替代方法
，但它仍然依赖于子查询，我认为它更快。您将使用此方法start_timestamp，end_timestamp，system创建一个开始和停止表。然后，您添加一个更大的表。如果时间戳介于两者之间，则将其分类为此状态。这实际上是一个替代方案cumlative_sum_of_nonrepeats_by_system。但是，当你执行这个操作时，成千上万的设备和成千上万的事件将是1
= 1比例加入。你认为这是更好的方法吗？

解决方案:
                                                            测试用例首先，以更有用的方式呈现您的数据-甚至更好的方法是，在
sqlfiddle中 使用：
CREATE TEMP TABLE data( system_measured int ,time_of_measurement int ,measurement int);INSERT INTO data VALUES (1、1、1、5)、(1、2、150)、(1、3、5)、(1、4、5)、(2、1、5)、(2、2、2、2、3、5)、(2、4、5)、(2、5、150)、(2、6、5)、(2、7、5)简化查询因为还不清楚，所以我只假设上面给出的内容。
接下来，我简化了查询，得出结论：
WITH x AS ( SELECT *,CASE WHEN lag(measurement) OVER (PARTITION BY system_measured                            ORDER BY time_of_measurement) = measurement                THEN 0 ELSE 1 END AS step FROM data  y AS ( SELECT *,sum(step) OVER(PARTITION BY system_measured                         ORDER BY time_of_measurement) AS grp FROM x )SELECT * ,row_number() OVER (PARTITION BY system_measured,grp                            ORDER BY time_of_measurement) - 1 AS repeat_ctFROM yORDER  BY system_measured,time_of_measurement;现在，虽然它是所有好的和有光泽的纯使用SQL，这将是很大一个PLPGSQL功能更快，因为它可以扫描一个表，其中至少需要三次扫描。
使用plpgsql函数更快：CREATE OR REPLACE FUNCTION x.f_repeat_ct()  RETURNS TABLE  system_measured int  ,time_of_measurement int  ,measurement int,repeat_ct int  )  LANGUAGE plpgsql AS$func$DECLARE r data;    -- table name serves as record type r0 data;BEGIN-- SET LOCAL work_mem = '1000 MB';  -- uncomment an adapt if needed,see below!repeat_ct := 0; -- initFOR r IN SELECT * FROM data d ORDER BY d.system_measured,d.time_of_measurementLOOP IF  r.system_measured = r0.system_measured    AND r.measurement = r0.measurement THEN    repeat_ct := repeat_ct  1；  -- start new array ELSE    repeat_ct := 0；          -- start new count END IF; RETURN QUERY SELECT r.*,repeat_ct; r0 := r;                      -- remember last rowEND LOOP;END$func$;称呼：
SELECT * FROM x.f_repeat_ct();一定要一直用这个plpgsql函数限制表名，因为我们使用与输出参数相同的名称，如果不限制，则使用优先级。
数十亿行如果您有数十亿
, ^5 a- c, s8 R行，这个操作可能需要拆分。我在这里引用手册:
注：如上所述，目前的实现RETURN NEXT和RETURN QUERY 存储整个结果集，然后从函数返回。这意味着，如果PL /
pgSQL函数产生一个非常大的结果集，性能可能很差：将数据写入磁盘，以避免内存耗尽，但函数本身在生成整个结果集之前不会返回。PL /
pgSQL未来版本可能允许用户定义没有此限制的集合返回函数。目前，数据从磁盘的位置写入磁盘work_mem 配置变量控制。在内存中存储更大结果集的管理员应考虑增加此参数。
考虑一次为系统计算，或设置足够高的值work_mem处理负载。请遵循报价中提供的链接，以获得更多相关信息work_mem的信息。
一种方法是在函数中work_memwith设置高值SETLOCAL，这只对当前事务有效。我在函数中添加了注释行。不设置一个非常高的全球范围，因为核弹可能会攻击你的服务器。阅读手册。

连续重复/重复有序计数

一世纪 LV1