我用大型数据库对文本挖掘做了文本预处理,我想把数据库上所有的文章生成一个camus数据到数组中,但是要花很长的时间。
$multiMem = memory_get_usage();
$xstart = microtime(TRUE);
$word = "";
$sql = mysql_query("SELECT * FROM tbl_content");
while($data = mysql_fetch_assoc($sql)){
$word = $word."".$data['article'];
}
$preprocess = new preprocess($word);
$word= $preprocess->preprocess($word);
print_r($kata);
$xfinish = microtime(TRUE);这是我的课前处理
class preprocess {
var $teks;
function preprocess($teks){
/*start process segmentation*/
$teks = trim($teks);
//menghapus tanda baca
$teks = str_replace("'", "", $teks);
$teks = str_replace("-", "", $teks);
$teks = str_replace(")", "", $teks);
$teks = str_replace("(", "", $teks);
$teks = str_replace("=", "", $teks);
$teks = str_replace(".", "", $teks);
$teks = str_replace(",", "", $teks);
$teks = str_replace(":", "", $teks);
$teks = str_replace(";", "", $teks);
$teks = str_replace("!", "", $teks);
$teks = str_replace("?", "", $teks);
//remove HTML tags
$teks = strip_tags($teks);
$teks = preg_replace('@<(\w+)\b.*?>.*?</\1>@si', '', $teks);
/*end proses segmentation*/
/*start case folding*/
$teks = strtolower($teks);
$teks = preg_replace('/[0-9]+/', '', $teks);
/*end case folding*/
/*start of tokenizing*/
$teks = explode(" ", $teks);
/*end of tokenizing*/
/*start of filtering*/
//stopword
$file = file_get_contents('stopword.txt', FILE_USE_INCLUDE_PATH);
$stopword = explode("\n", $file);
//remove stopword
$teks = preg_replace('/\b('.implode('|',$stopword).')\b/','',$teks);
/*end of filtering*/
/*start of stemming*/
require_once('stemming.php');
foreach($teks as $t => $value){
$teks[$t] = stemming($value);
}
/*end of stemming*/
$teks = array_filter($teks);
$teks = array_values($teks);
return $teks;
}
}有人想在我的程序上快速处理吗?请帮帮忙
谢谢你提前
发布于 2017-04-03 10:01:46
有几件事可能会改进..。
$word之后,您可以释放查询结果$sql和data
$word = '';$sql = mysql_query("SELECT * FROM tbl_content");while($data = mysql_fetch_assoc($sql)){ $word = $word。$ mysql_free_result($sql);unset($sql,$data);可以写成这样:
$teks = str_replace(array('(','-',')',',','.','=',';','!','?'), '', $teks);str_replace调用中添加数字,或者将上面的字符添加到preg_replace中。
$teks =str_replace(数组(‘0 ','1','2','3','4','5','6','7','8','9','',’‘),',’‘);
或
$teks = preg_replace('/0-9,()-\=.\,\;\+/‘,'',$teks);$teks = strip_tags($teks);应该就够了。如果它是‘y’,那么只使用preg_replace下面的代码,因为它做的是类似的事情。followed by theexplodesince thefilereturns an array directly. Also there is no need to explode the $teks的file insted
$stopword = file('stopword.txt');array_walk($stopword,函数(& $item1 ){$item1= '/\b‘)。$item1。$teks = preg_replace($stopword,'',$teks);"",因为处理器会尝试评估内容,这需要更长的时间。stopword.txt列表没有改变,那么将其作为数组直接放在代码中,然后访问文件系统来读取它会更好、更快。https://stackoverflow.com/questions/43180617
复制相似问题