我有一个320万行的表buildings。我需要将这个表扩展到11个不同的句点,以(平衡) Paneldata的形式处理它。这意味着,每一个物体都有11个不同的年份(从2000-2010年)进行观测。这些时期应称为:
2000
2001
...
2009
2010表定义
CREATE TABLE public.buildings
(
gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
osm_id character varying(11),
name character varying(48),
type character varying(16),
geom geometry(MultiPolygon,4326),
centroid geometry(Point,4326),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr smallint,
capac double precision,
instdate date,
pvid integer,
dist double precision,
gemewz integer,
n500 integer,
ibase double precision,
popden integer,
instp smallint,
b2000 double precision,
b2001 double precision,
b2002 double precision,
b2003 double precision,
b2004 double precision,
b2005 double precision,
b2006 double precision,
b2007 double precision,
b2008 double precision,
b2009 double precision,
b2010 double precision,
ibase_id integer[],
ibase_dist integer[],
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.buildings
OWNER TO postgres;
CREATE INDEX build_centroid_gix
ON public.buildings
USING gist
(st_transform(centroid, 31467));
CREATE INDEX buildings_geom_idx
ON public.buildings
USING gist
(geom);我想用这些数据在R中进行回归分析。
ibase_id是gid的数组。ibase_dist是一个相关的数组,与gid到服从的距离是相关的,两个数组的长度总是相同的。
数组中的gid属于buildings的记录,其半径在centroid周围500米以内,是服从对象的中心,具有pv=TRUE (这意味着dist、instdate、instp、capac&pvid是NOT NULL)。
SELECT a.gid AS buildid, array_agg(b.gid) AS ibase_id, array_agg(round(ST_Distance(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467))::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN (SELECT * FROM buildings WHERE pv=TRUE) AS b ON ST_DWithin(ST_Transform(a.centroid, 31467), ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
GROUP BY a.gid示例:
ibase_id: {3075528,409073,322311,226643,833798,322344,226609};
ibase_dist {290,293,398,494,411,381,384}
UPDATE buildings
SET ibase=SUM(1/s)
FROM unnest(SELECT ibasedist FROM buildings WHERE (SELECT instp
FROM buildings
WHERE gid IN unnest(ibase_id))<year) s对于每个时间段,只应考虑数组的入口,其年份早于面板数据的观测周期。(上面的查询还不起作用,因为我需要先连接数组),这两个数组保存了多年的信息。这就是为什么我认为它们应该被添加到每一段时间,所以在扩展到面板数据之后,我计算每个记录的ibase(11x320万)。
我不需要所有的列进行回归分析。如果它将显着地提高乘法的性能,我们可以坚持行(基本上省略几何列):
gid integer NOT NULL DEFAULT nextval('buildings_gid_seq'::regclass),
gembez character varying(50),
gemname character varying(50),
krsbez character varying(50),
krsname character varying(50),
pv boolean,
gr smallint,
capac double precision,
dist double precision,
gemewz integer,
n500 integer,
ibase double precision,
popden integer,
instp smallint,
b2000 double precision,
b2001 double precision,
b2002 double precision,
b2003 double precision,
b2004 double precision,
b2005 double precision,
b2006 double precision,
b2007 double precision,
b2008 double precision,
b2009 double precision,
b2010 double precision,
ibase_id integer[],
ibase_dist integer[],
CONSTRAINT buildings_pkey PRIMARY KEY (gid)
)
WITH (
OIDS=FALSE解方法
我的基本想法是创建包含11个不同句点的第二个表periods,并将该表与表buildings相乘。不知道如何实现这一点。不幸的是,我对R没有太多的经验,而且还没有使用R的数据库接口。
使用PostgreSQL 9.5beta2,由Visual C++ build 1800、64位和R x64 3.2.1编译
发布于 2016-02-16 12:59:50
我通过使用包含句点的临时表t1的交叉连接来创建Paneldata表。
CREATE TABLE public.t1
(
period smallint
)
WITH (
OIDS=FALSE
);
CREATE TABLE paneldata AS
(SELECT *
FROM t1 CROSS JOIN
(SELECT gid,
gemname,
gembez,
krsname,
krsbez,
pv,
gr,
capac,
dist,
gemewz,
n500,
popden,
instp
FROM buildings) AS test
ORDER BY gid)发布于 2016-02-07 01:39:13
从本质上说,面板数据集是long格式的数据,每个记录的重复年份作为时间列。您的当前结构是wide格式。虽然R可以转换这个非常大的数据集,但是PostGreSQL可以用它的引擎在一个联合查询中将所有年叠加在一起,并将结果集传递到R中。请注意一些数据类型,例如几何对象和数组可能不能正确地转换为R数据类型,所以删除它们或将它们转换为字符串/数值类型。
下面是这样一个具有堆叠年数的SQL UNION查询。我不太清楚ibase_id和ibase_dist或“乘”方面是什么意思,但是Year列与相应的b列一起添加。让R脚本通过RPostGreSQL模块调用它。
import("RPostgreSQL")
# CREATE CONNECTION
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "postgres",
host = "localhost", port = ####,
user = "username", password = "password")
strSQL <- "SELECT '2000' As year, gid, gembez, gemname, krsbez,
krsname, pv, gr, capac, dist, gemewz, n500
popden, instp, b2000 As b, (1/ibase_dist) As ibase
FROM public.buildings
INNER JOIN
(SELECT a.gid AS buildid,
SUM(round(ST_Distance(
ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467)
)::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN buildings b
ON ST_DWithin(ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
WHERE b.pv=True AND b.instp < a.instp
GROUP BY a.gid) AS distSum
ON public.buildings.gid = distSum.buildid
WHERE public.buildings.instp = 2000
UNION
...other SELECT statements for years 2001-2010..."
# IMPORT QUERY RESULTSET INTO DATAFRAME
df <- dbGetQuery(con, strSQL)
# CLOSE CONNECTION
dbDisconnect(con)但是,请确保您拥有操作大型数据集所需的随机存取存储器。您可能需要相应地分配内存。或者,您可以迭代地将每年的SELECT语句追加到一个不断增长的dataframe对象中,而不是一次性加载。
# ...SAME CONNECTION SETUP AS ABOVE...
years = c('2000', '2001', '2002', '2003', '2004', '2005',
'2006', '2007', '2008', '2009', '2010')
# CREATES LIST OF YEAR DATA FRAME
dfList = lapply(years,
function(y) {
# NOTICE CONCATENATION OF Y IN SELECT STATEMENT
strSQL <- paste0("SELECT '", y, "' As year, gid, gembez, gemname, krsbez,
krsname, pv, gr, capac, dist, gemewz, n500,
popden, instp, b", y, ", As b, (1/ibase_dist) As ibase,
FROM public.buildings
INNER JOIN
(SELECT a.gid AS buildid,
SUM(round(ST_Distance(
ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467)
)::integer)) AS ibase_dist
FROM buildings a
LEFT JOIN buildings b
ON ST_DWithin(ST_Transform(a.centroid, 31467),
ST_Transform(b.centroid, 31467), 500.0)
AND a.gid <> b.gid
WHERE b.pv=True AND b.instp < a.instp
GROUP BY a.gid) AS distSum
ON public.buildings.gid = distSum.buildid
WHERE public.buildings.instp =", y)
dbGetQuery(con, strSQL)
})
# APPEND LIST OF DATA FRAMES INTO ONE LARGE DATA FRAME
df <- do.call(rbind, dfList)
# REMOVE PREVIOUS LIST FOR MEMORY RESOURCES
rm(dfList)
# CLOSE CONNECTION
dbDisconnect(con)https://stackoverflow.com/questions/35246959
复制相似问题