我试图分析一些我拥有的数据,但是我的数据中有很多不一致的地方。
我有一个SQL表,我正试图分析它。
该表是一个大学表,其结构如下:name:string, city:string, state:string, country:string名称始终存在,但城市、州、国家可能会丢失。我的主要问题是,有大量的排字和不同的倾斜大学名称。例如,下面是我在做SELECT "universities".* FROM "perm_universities" WHERE (name like '%stanford%')时发现的斯坦福大学的衰落:
stanford university - stanford - ca - united states of america
the leland stanford junior university - stanford - ca - united states of america
leland stanford jr. university - stanford - ca - united states of america
stanford university graduate school of business - stanford - ca - united states of america
the leland stanford junior university (stanford university) - stanford - ca - united states of america
leland stanford junior university - stanford - ca - united states of america
stanford university - stanford - -
leland stanford jr. university, graduate school of business - stanford - ca - united states of america
stanford law school - stanford - ca - united states of america
stanford - stanford - ca - united states of america
stanford university, graduate school of business - stanford - ca - united states of america
stanford graduate school of business - stanford - ca - united states of america
stanford univerity - stanford - ca - united states of america
stanford university (the leland stanford junior university) - stanford - ca - united states of america
the leland stanford jr. university - palo alto - ca - united states of america
leland stanford junior university, school of law - stanford - ca / n/a - united states of america
stanford universit - stanford - ca - united states of america
the leland stanford university - stanford - ca - united states of america
leland standford stanford junior university - stanford - ca - united states of america
stanford university - cambridge - ma - united states of america
the leland stanford junior university 'stanford university' - stanford - ca - united states of america
stanford university school of law - stanford - ca - united states of america
stanford univresity - stanford - ca - united states of america
the leland stanford jr. university (stanford university) - stanford - ca - united states of america
leeland stanford junior university - stanford - ca - united states of america
leland stanford junion university - - ca - united states of america
leland stanford junior university (stanford university) - stanford - ca - united states of america
the leland stanford junior university - stanford - -
stanford university - graduate school of business - stanford - ca - united states of america
graduate school of business, stanford university - stanford - ca - united states of america
stanford universoty - stanford - ca - united states of america
leland stanford junior university - stanford - -
stanford univeristy - palo alto - ca - united states of america
leland stanford university - palo alto - ca - united states of america
stanford university - stanford - ca / n/a - united states of america
the leland stanford junior university, stanford university - stanford - ca - united states of america
the leland stanford junior university graduate school of business - stanford - ca - united states of america
stanford universtiy - stanford - ca - united states of america
stanford univerisity - stanford - ca - united states of america
stanford university - stanford - ct - united states of america
stanford law scool - stanford - ca - united states of america
mba: stanford university - stanford - ca - united states of america他们都是同一所大学,但有些有排字,有些有不同的名字,有些没有城市,有些有错误的城市,数据不是很好。
所以我想修好它。如何合并这些数据?
发布于 2016-05-09 06:27:31
由于这个数据集已经组织在一个表中,所以您可以利用标准SQL函数来执行大部分清理。一个记录似乎由4个字段组成,例如:
university name, city, state, country
stanford law school - stanford - ca - united states of america您可以按照以下步骤获得此数据集的更清晰的表示:
发布于 2016-04-11 23:03:49
如果不首先构造数据集,就很难做到这一点。清理数据集花费数千美元是有原因的,因为它们试图为您解决这些问题。
您可以尝试的是首先创建一个分类系统。首先,你给一般的“斯坦福大学”一个ID "1“。像“斯坦福商学院”这样的机构将获得ID "1.2.5",其中新的"2“指研究生院部门,"5”指的是商学院类别。这真的取决于你的最终目标是什么。简而言之,为可能的分支设置一个ID列表,“研究生,本科生等等”。然后进一步细分。
对于位置,通常可以定义“主”和“次要”位置,即"Palo“和"Stanford”,您可以通过对每一项的直方图计数和选择前两项来确定它们。
为了纠正拼写错误,您可以使用google并利用“显示结果.”这样你就能得到正确的拼写。
https://datascience.stackexchange.com/questions/11096
复制相似问题