我使用微软的"数据科学端到端漫游“设置了R,他们的例子工作得很好。
该示例(纽约出租车数据)使用非范畴变量(即距离、出租车票价等)。预测分类变量(1或0表示是否支付小费)。
我正在尝试使用分类变量作为输入来预测类似的二进制输出,使用线性回归( rxLinMod函数),并出现了一个错误。
错误说参数的数量与变量的数量不匹配,但是在我看来,number of variables实际上是每个因素(变量)中的级别数。
复制
在Server中创建一个称为示例的表:
USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE TABLE [dbo].[example](
[Person] [nvarchar](max) NULL,
[City] [nvarchar](max) NULL,
[Bin] [integer] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];把数据放进去:
insert into [dbo].[example] values ('John','London',0);
insert into [dbo].[example] values ('Paul','New York',0);
insert into [dbo].[example] values ('George','Liverpool',1);
insert into [dbo].[example] values ('Ringo','Paris',1);
insert into [dbo].[example] values ('John','Sydney',1);
insert into [dbo].[example] values ('Paul','Mexico City',1);
insert into [dbo].[example] values ('George','London',1);
insert into [dbo].[example] values ('Ringo','New York',1);
insert into [dbo].[example] values ('John','Liverpool',1);
insert into [dbo].[example] values ('Paul','Paris',0);
insert into [dbo].[example] values ('George','Sydney',0);
insert into [dbo].[example] values ('Ringo','Mexico City',0);我还使用了一个SQL函数,它以表格式返回变量,因为这是Microsoft示例所要求的。创建函数formatAsTable
USE [my_database];
SET ANSI_NULLS ON;
SET QUOTED_IDENTIFIER ON;
CREATE FUNCTION [dbo].[formatAsTable] (
@City nvarchar(max)='',
@Person nvarchar(max)='')
RETURNS TABLE
AS
RETURN
(
-- Add the SELECT statement with parameter references here
SELECT
@City AS City,
@Person AS Person
);现在我们有了一个包含两个分类变量的表- Person和City。
我们开始预测吧。在R中,运行以下命令:
library(RevoScaleR)
# Set up the database connection
connStr <- "Driver=SQL Server;Server=<servername>;Database=<dbname>;Uid=<uid>;Pwd=<password>"
sqlShareDir <- paste("C:\\AllShare\\",Sys.getenv("USERNAME"),sep="")
sqlWait <- TRUE
sqlConsoleOutput <- FALSE
cc <- RxInSqlServer(connectionString = connStr, shareDir = sqlShareDir,
wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(cc)
# Set the SQL which gets our data base
sampleDataQuery <- "SELECT * from [dbo].[example] "
# Set up the data source
inDataSource <- RxSqlServerData(sqlQuery = sampleDataQuery, connectionString = connStr,
colClasses = c(City = "factor",Bin="logical",Person="factor"
),
rowsPerRead=500) 现在,建立线性回归模型。
isWonObj <- rxLinMod(Bin ~ City+Person,data = inDataSource)查看模型对象:
isWonObj请注意,它看起来如下:
...
Total independent variables: 11 (Including number dropped: 3)
...
Coefficients:
Bin
(Intercept) 6.666667e-01
City=London -1.666667e-01
City=New York 4.450074e-16
City=Liverpool 3.333333e-01
City=Paris 4.720871e-16
City=Sydney -1.666667e-01
City=Mexico City Dropped
Person=John -1.489756e-16
Person=Paul -3.333333e-01
Person=George Dropped
Person=Ringo Dropped它说有11个变量,这很好,因为这是因素的水平之和。
现在,当我试图根据Bin和Person预测City值时,会得到一个错误:
首先,我将我想要预测的City和Person格式化为一个表。然后,我预测使用这个作为输入。
sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
, colClasses = c(City = "factor",Person="factor"))如果您检查pred对象,它看起来与预期的一样:
> head(pred)
City Person
1 London George现在,当我试图预测时,我得到了一个错误。
scoredOutput <- RxSqlServerData(
connectionString = connStr,
table = "binaryOutput"
)
rxPredict(modelObject = isWonObj, data = pred, outData = scoredOutput,
predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE,checkFactorLevels = FALSE)错误说:
INTERNAL ERROR: In rxPredict, the number of parameters does not match the number of variables: 3 vs. 11. 我可以看到11是从哪里来的,但是我只为预测查询提供了2个值--所以我看不出3是从哪里来的,也不知道为什么会出现问题。
任何帮助都是非常感谢的!
发布于 2016-08-05 15:52:52
答案似乎与R如何对待因素变量一致,然而,错误信息可以更清楚地区分因素、级别、变量和参数。
用于生成预测的参数似乎不能仅仅是没有级别的字符或因素。,它们需要具有与模型参数化中使用的相同变量的因素相同的级别。
因此,以下几行:
sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
, colClasses = c(City = "factor",Person="factor"))..。应将其替换为:
sq<-"SELECT City, Person FROM [dbo].[formatAsTable]('London','George')"
column_information<-list(
City=list(type="factor",levels=c("London","New York","Liverpool","Paris","Sydney","Mexico City")),
Person=list(type="factor",levels=c("John","Paul","George","Ringo")),
Bin=list(type="logical")
)
pred<-RxSqlServerData(sqlQuery = sq,connectionString = connStr
,colInfo=column_information,
stringsAsFactors=FALSE)我已经看到了其他的例子,分类变量似乎不起作用,但也许水平是在那里反正。
我希望这能节省一些时间,就像我在这上面失去的一样!
编辑SLSvenR的响应
我想我对训练水平和训练水平保持不变的看法是肯定的。
fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1
levels(predictionData$fac)<-levels(trainingData$fac)
# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE,checkFactorLevels = TRUE)
rxPred
# This result appears correct to me.我不能评论这是好的还是坏的-然而,它似乎是一种方法来绕过这是应用水平的培训数据到测试集,我认为你可以做实时。
发布于 2016-09-26 14:58:01
您确定指定colInfo解决了问题吗?在rxPredict中似乎存在一个普遍的问题,而不是将rxPredict与Server结合在一起:
# lm() and predict() don't have a problem with missing factor levels ("two" in this case):
fac <- c("one", "two", "three")
val = c(1, 2, 3)
trainingData <- data.frame(fac, val, stringsAsFactors = TRUE)
lmModel <- lm(val ~ fac, data = trainingData)
print(summary(lmModel))
predictionData = data.frame(fac = c("one", "three", "one", "one"))
lmPred <- predict(lmModel, newdata = predictionData)
lmPred
# The result is OK:
# 1 2 3 4
# 1 3 1 1
# rxLinMod() and rxPredict() behave different:
rxModel <- rxLinMod(val ~ fac, data = trainingData)
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown:
# "INTERNAL ERROR: In rxPredict, the number of parameters does not match
# the number of variables: 3 vs. 4."
# checkFactorLevels = FALSE doesn't help here, it actually seems to just
# check the order of factor levels.
levels(predictionData$fac) <- c("two", "three", "one")
rxPred <- rxPredict(rxModel, data = predictionData, writeModelVars = TRUE)
# The following error is thrown (twice):
# ERROR:order of factor levels in the data are inconsistent with
# the order of the model coefficients:fac = two versus fac = one. Set
# checkFactorLevels = FALSE to ignore.
rxPred <- rxPredict(rxModel, data = predictionData, checkFactorLevels = FALSE, writeModelVars = TRUE)
rxPred
# val_Pred fac
#1 1 two
#2 3 three
#3 1 two
#4 1 two
# This looks suspicious at best. While the prediction values are still
# correct if you look only at the order of the records in trainingData,
# the model variables are messed up.在我的场景中,我有一个大约有10.000级的因素(只有在创建模型时才知道),还有几个因素,每个因素大约有5个级别(在创建模型之前就知道了)。在按“正确”顺序调用rxPredict()时,似乎不可能为所有这些级别指定级别。
发布于 2016-10-04 06:59:23
而只设置因子水平(.levels(predictionData$fac)<-levels(trainingData$fac ...)避免了错误--它还会导致模型使用错误的因子指标,如果writeModelVars设置为TRUE,则可以看出这一点。在colInfo中为我的因子设置几乎10.000级别的RxSqlServerData会导致应用程序挂起,尽管查询已正确传递给Server。我改变了策略,将数据加载到没有任何因素的数据框架中,然后将RxFactors应用于它:
rxSetComputeContext(“本地”)
sqlPredictQueryDS <- RxSqlServerData(connectionString = sqlConnString,sqlQuery = sqlQuery,stringsAsFactors = FALSE)
predictQueryDS = rxImport(sqlPredictQueryDS)
if ("Artikelnummer“%in% colname( predictQueryDS )){predictQueryDS <- rxFactors(predictQueryDS,factorInfo = list(Artikelnummer =list(level=allItems)}
除了设置所需的因子级别之外,RxFactors还重新排序了因子指数。我并不是说colInfo的解决方案是错误的,也许它只是不适用于“太多”级别的因素。
https://stackoverflow.com/questions/38790530
复制相似问题