如何使用lovinsstemmer imaging

冬奥会 | 林黛玉 | 供暖 | 混凝土 | 服装面料 | exo | 人口 | 坐月子 | 国家队 | 牙齿美白 | 玄幻小说 | 美杜莎 | 家庭 | 金平区 | 牙齿矫正 | 伊宁市 | 滦州市 | 男士护肤 | 法国 | 大城市 | 威士忌 | 梦想改造家 | 旅游推荐 | 孙悟空 | 机箱 | 周易 | 武术 | ISIS（伊斯兰国） | 艺考 | 骊威 | 温州市 | 易经 | 单片机 | 运动损伤 | 大白菜 | 爽肤水 | 电路设计 | 米酒 | 城市 | 韩国旅游 | 杭州生活 | 新风系统 | 机动车辆保险 | 戚继光 | 唇膏 | 寺庙 | 政府 | 貂蝉 | 咖啡馆 | 葫芦 | 动画制作 | 巴中市 | 美术生 | 房贷 | 意大利 | 暑假 | 香港购物 | 五粮液 | 台风 | 酱油 | 展会 | 名言 | 第三者 | 高三 | 徽州区 | 烹饪学校 | 三菱商事 | 梵蒂冈 | 红木艺术 | 螃蟹 | 自行车选购 | 内黄县 | 成都美食 | 果酒 | 少儿英语 | 酸奶 | 呼和浩特市 | 糕点 | 昌平区 | 宝洁（P&G） | 天气 | 任天堂 | 赛欧 | 火影忍者 | 英国 | 卫生间 | 葡萄 | 双色球 | 印度 | 赋 | 宇宙 | 智商 | 李白 | 延安市 | 合生元 | 洗面奶 | 青年旅舍 | 商标 | 西藏大学 | 抽脂 | 网盘 | 电梯 | 岳阳县 | 歌词 | 旅游线路 | 案件 | 卡通 | 卡地亚（Cartier） | 长春市 | 大红袍 | 少数民族 | 韭菜 | 通辽市 | 西点 | 铜陵市 | 魏无羡 | 食品 | 精酿啤酒 | 乾隆 | 肺炎 | 鲤鱼 | 显示器 | 论文写作 | 婴儿喂养 | 紫檀 | 牛初乳 | 郭德纲 | 老挝 | 中学 | 孝感市 | 嘉兴市 | 进贤县 | 祛痘印 | 鸭绿江 | 前端开发 | 中国教育 | 卫生巾 | 科幻 | 兰蔻（lancome） | 潮牌 | 视频剪辑 | 诛仙 | 余杭区 | 趣味 | 本田（honda） | 福州市 | 酱料 | 礼仪 | 纪录片 | 专升本 | 雪碧 | 写字楼 | 宜昌市 | 辣条 | gucci | 美容化妆 | 身材 | 泾川县 | 亲情 | 菠萝 | 安庆市 | 三国人物 | 朋友关系 | 恋爱心理 | 家装 | 新泰市 | logo设计 | 中国银行 | 大三学生 | 鱼丸 | 方便面 | 机车 | 红木家具 | 咖啡机 | 骨折 | 雅马哈 | 大城县 | 化妆技巧 | 海蛇 | 王建国 | 吸尘器 | 大学生创业 | 埇桥区 | 星座（占星） | 德国 | 陶瓷 | 城市生活 | 姓氏 | 孩子 | 肖战 | 电压 | 糖尿病 | 文景之治 | 江门市 | 铜仁市 | 果冻 | 海西蒙古族藏族自治州 | 狗粮 | 庐山 | 黑暗料理 |

你的位置：网站首页 >> 频道首页 >>生活 >>如何使用lovinsstemmer imaging

如何使用lovinsstemmer imaging

来源：蜘蛛抓取(WebSpider) 时间：2017-07-08 09:41 标签： porter stemmer

>> PostNode.java
PostNode.java ( 文件浏览 )
/*******************************************************************************
* Copyright 2012 by the Department of Computer Science (University of Oxford)
This file is part of LogMap.
LogMap is free software: you can redistribute it and/or modify
it under the terms of the GNU Lesser General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
LogMap is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
GNU Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public License
along with LogMap.
If not, see &http://www.gnu.org/licenses/&.
******************************************************************************/
package uk.ac.ox.krr.logmap2.indexing.labelling_
* @author Anton Morant
public class PostNode extends Node {
private Interval descOrderI
private Interval ascOrderI
public PostNode(int classId) {
super(classId);
descOrderInterval = new Interval(-1,-1);
descIntervals.add(descOrderInterval);
ascOrderInterval = new Interval(-1,-1);
ascIntervals.add(ascOrderInterval);
public void setDescOrder(int postorder) {
descOrderInterval.setRightBound(postorder);
public void setDescChildOrder(int minPostorder) {
descOrderInterval.setLeftBound(minPostorder);
public int getDescOrder() {
return descOrderInterval.getRightBound();
public int getDescChildOrder() {
return descOrderInterval.getLeftBound();
public Interval getDescOrderInterval() {
return descOrderI
public void setAscOrder(int postorder) {
ascOrderInterval.setRightBound(postorder);
public void setAscChildOrder(int minPostorder) {
ascOrderInterval.setLeftBound(minPostorder);
public int getAscOrder() {
return ascOrderInterval.getRightBound();
public int getAscChildOrder() {
return ascOrderInterval.getLeftBound();
public Interval getAscOrderInterval() {
return ascOrderI
展开＞＜收缩
下载源码到电脑，阅读使用更方便
还剩0行未阅读，继续阅读 ▼
Sponsored links
源码文件列表
温馨提示：点击源码文件名可预览文件内容哦 ^_^
3.28 kB 18:57
8.35 kB 11:14
7.47 kB 11:13
15.52 kB 19:16
90.78 kB 19:16
6.02 kB 19:16
7.90 kB 19:16
5.26 kB 19:16
52.59 kB 19:16
30.11 kB 19:16
9.35 kB 19:16
15.63 kB 19:16
3.97 kB 19:16
4.06 kB 19:16
5.57 kB 19:16
3.02 kB 19:16
43.09 kB 19:16
18.87 kB 19:16
111.09 kB 19:16
4.35 kB 19:16
10.15 kB 19:16
1.35 kB 19:16
3.11 kB 19:16
2.21 kB 19:16
1.42 kB 19:16
2.48 kB 19:16
5.19 kB 19:16
8.22 kB 19:16
7.69 kB 19:16
2.85 kB 19:16
2.29 kB 19:16
2.83 kB 19:16
2.38 kB 19:16
5.85 kB 19:16
1.63 kB 19:16
1.72 kB 19:16
30.62 kB 19:16
9.57 kB 19:16
9.01 kB 19:16
4.44 kB 19:16
11.61 kB 19:16
3.84 kB 19:16
1.08 kB 19:16
1.73 kB 19:16
4.29 kB 19:16
4.02 kB 19:16
9.97 kB 19:16
10.82 kB 19:16
3.40 kB 19:16
3.22 kB 19:16
2.17 kB 19:16
2.47 kB 19:16
2.24 kB 19:16
Lex_norm_LRNOM.gz185.65 kB 19:16
Lex_plurals_LRAGR.gz1.37 MB 19:16
Lex_spelling_LRSPL.gz1.20 MB 19:16
9.91 kB 19:16
10.22 kB 19:16
4.21 kB 19:16
3.03 kB 19:16
1.67 kB 19:16
1.93 kB 19:16
1.52 kB 19:16
24.13 kB 19:16
1.38 kB 19:16
13.32 kB 19:16
2.04 kB 19:16
12.47 kB 19:16
1.44 kB 19:16
1.14 kB 19:16
2.09 kB 19:16
3.14 kB 19:16
2.17 kB 19:16
1.90 kB 19:16
3.07 kB 19:16
5.66 kB 19:16
79.64 kB 19:16
6.52 kB 19:16
68.52 kB 19:16
5.16 kB 19:16
1.98 kB 19:16
1.48 kB 19:16
3.61 kB 19:16
14.30 kB 19:16
19.70 kB 19:16
3.72 kB 19:16
4.77 kB 19:16
2.86 kB 19:16
7.85 kB 19:16
7.48 kB 19:16
10.16 kB 19:16
13.75 kB 19:16
6.52 kB 19:16
8.01 kB 19:16
7.34 kB 19:16
4.74 kB 19:16
5.33 kB 19:16
11.29 kB 19:16
18.45 kB 19:16
8.25 kB 19:16
2.88 kB 19:16
1.04 kB 19:16
3.37 kB 19:16
1.64 kB 19:16
2.30 kB 19:16
8.23 kB 19:16
6.08 kB 19:16
5.58 kB 19:16
18.73 kB 19:16
2.57 kB 19:16
13.40 kB 19:16
2.73 kB 19:16
5.23 kB 19:16
2.40 kB 19:16
1.99 kB 19:16
1.91 kB 19:16
2.64 kB 19:16
2.00 kB 19:16
2.55 kB 19:16
1.27 kB 19:16
8.20 kB 19:16
1.18 kB 19:16
1.66 kB 19:16
2.48 kB 19:16
3.48 kB 19:16
19.67 kB 19:16
4.04 kB 19:16
16.23 kB 19:16
60.37 kB 19:16
4.47 kB 19:16
1.82 kB 19:16
2.71 kB 19:16
4.36 kB 19:16
4.56 kB 19:16
12.22 kB 19:16
2.45 kB 19:16
41.97 kB 19:16
22.66 kB 19:16
1.95 kB 19:16
1.41 kB 19:16
25.11 kB 19:16
6.90 kB 19:16
1.84 kB 19:16
9.07 kB 19:16
6.02 kB 19:16
1.79 kB 19:16
2.48 kB 19:16
5.25 kB 19:16
5.40 kB 19:16
14.15 kB 19:16
2.68 kB 19:16
15.78 kB 19:16
11.08 kB 19:16
4.60 kB 19:16
10.36 kB 19:16
12.58 kB 19:16
5.78 kB 19:16
26.11 kB 19:16
5.53 kB 19:16
3.45 kB 19:16
15.20 kB 19:16
10.52 kB 19:16
5.58 kB 19:16
4.78 kB 19:16
2.45 kB 19:16
7.86 kB 19:16
22.58 kB 19:16
9.38 kB 19:16
30.55 kB 19:16
24.24 kB 19:16
38.82 kB 19:16
25.87 kB 19:16
1.17 kB 19:16
28.93 kB 19:16
29.63 kB 19:16
34.02 kB 19:16
9.22 kB 19:16
8.64 kB 19:16
3.73 kB 19:16
2.67 kB 19:16
5.67 kB 19:16
27.62 kB 19:16
2.59 kB 19:16
2.84 kB 19:16
4.62 kB 19:16
30.79 kB 19:16
4.65 kB 19:16
4.66 kB 19:16
4.18 kB 19:16
7.69 kB 19:16
2.03 kB 19:16
3.98 kB 19:16
2.47 kB 19:16
2.22 kB 19:16
4.88 kB 19:16
Sponsored links
评价成功，多谢！
CodeForge积分（原CF币）全新升级，功能更强大，使用更便捷，不仅可以用来下载海量源代码马上还可兑换精美小礼品了
您的积分不足
支付宝优惠套餐快速获取 30 积分
10积分 / ￥100
30积分 / ￥200原价￥300 元
100积分 / ￥500原价￥1000 元
订单支付完成后，积分将自动加入到您的账号。以下是优惠期的人民币价格，优惠期过后将恢复美元价格。
支付宝支付宝付款
微信钱包微信付款
更多付款方式：、
您本次下载所消耗的积分将转交上传作者。
同一源码，30天内重复下载，只扣除一次积分。
鲁ICP备号-3 runtime:Elapsed:114.412ms 27.69
登录 CodeForge
还没有CodeForge账号？
Switch to the English version?
^_^"呃 ...
Sorry!这位大神很神秘，未开通博客呢，请浏览一下其他的吧>> 基于逻辑的和可扩展的本体匹配
基于逻辑的和可扩展的本体匹配
所属分类：
下载地址：
logmap2_sources_v2.4_oct_2013.zi文件大小：3.17 MB
分享有礼！》
请点击右侧的分享按钮，把本代码分享到各社交媒体。
通过您的分享链接访问Codeforge，每来2个新的IP，您将获得0.1 积分的奖励。
通过您的分享链接，每成功注册一个用户，该用户在Codeforge上所获得的每1个积分，您都将获得0.2 积分的分成奖励。
LogMap: Logic-based and Scalable Ontology Matching
LogMap is a highly scalable ontology matching system with ‘built-in’ reasoning and inconsistency repair capabilities. LogMap extract mappings between classes, properties and instances.
To the best of our knowledge, LogMap is one of the few matching systems that:
1. can efficiently match semantically rich ontologies containing tens (and even hundreds) of thousands of classes,
1. incorporates sophisticated reasoning and repair techniques to minimise the number of logical inconsistencies, and
1. provides support for user intervention during the matching process (see
accessible from ).
&a href='Hidden comment:
The problems with !LogMap&s web interface are solved.
Unknown end tag for &/font&
or [http://csu6325.cs.ox.ac.uk:8080 http://csu6325.cs.ox.ac.uk:8080]
Please refer to the
for official results about LogMap.
Using LogMap
LogMap accepts the same ontology formats as the : e.g., RDF/XML, OWL/XML, OWL Functional, OBO, KRSS, and Turtle (n3).
As an Ontology Matching System
LogMap can be used from the command line with the
or the , or directly from its
LogMap can also be easily integrated in other .
As a Mapping Debugging System
LogMap can also be used as a mapping debugging system from the
or integrated in a .
Main Publications
Ernesto Jiménez Ruiz, Bernardo Cuenca Grau, Yujiao Zhou and Ian Horrocks. Large-scale Interactive Ontology Matching: Algorithms and Implementation. In the 20th European Conference on Artificial Intelligence (ECAI 2012).
Ernesto Jiménez-Ruiz, Bernardo Cuenca Grau. LogMap: Logic-based and Scalable Ontology Matching. In the 10th International Semantic Web Confernece (ISWC 2011).
Ernesto Jiménez-Ruiz, Christian Meilicke, Bernardo Cuenca Grau and Ian Horrocks. Evaluating Mapping Repair Systems with Large Biomedical Ontologies. In 26th International Workshop on Description Logics (DL 2013).
LogMap is currently developed by ,
and . Anton Morant and
have also contributed to the project in the past.
Acknowledgements
LogMap has been created in the
group at the
of the University of Oxford. Development has been supported by , the
We also thank the organisers of the
for providing test data and infrastructure.
Sponsored links
源码文件列表
温馨提示：点击源码文件名可预览文件内容哦 ^_^
3.28 kB 18:57
8.35 kB 11:14
7.47 kB 11:13
15.52 kB 19:16
90.78 kB 19:16
6.02 kB 19:16
7.90 kB 19:16
5.26 kB 19:16
52.59 kB 19:16
30.11 kB 19:16
9.35 kB 19:16
15.63 kB 19:16
3.97 kB 19:16
4.06 kB 19:16
5.57 kB 19:16
3.02 kB 19:16
43.09 kB 19:16
18.87 kB 19:16
111.09 kB 19:16
4.35 kB 19:16
10.15 kB 19:16
1.35 kB 19:16
3.11 kB 19:16
2.21 kB 19:16
1.42 kB 19:16
2.48 kB 19:16
5.19 kB 19:16
8.22 kB 19:16
7.69 kB 19:16
2.85 kB 19:16
2.29 kB 19:16
2.83 kB 19:16
2.38 kB 19:16
5.85 kB 19:16
1.63 kB 19:16
1.72 kB 19:16
30.62 kB 19:16
9.57 kB 19:16
9.01 kB 19:16
4.44 kB 19:16
11.61 kB 19:16
3.84 kB 19:16
1.08 kB 19:16
1.73 kB 19:16
4.29 kB 19:16
4.02 kB 19:16
9.97 kB 19:16
10.82 kB 19:16
3.40 kB 19:16
3.22 kB 19:16
2.17 kB 19:16
2.47 kB 19:16
2.24 kB 19:16
Lex_norm_LRNOM.gz185.65 kB 19:16
Lex_plurals_LRAGR.gz1.37 MB 19:16
Lex_spelling_LRSPL.gz1.20 MB 19:16
9.91 kB 19:16
10.22 kB 19:16
4.21 kB 19:16
3.03 kB 19:16
1.67 kB 19:16
1.93 kB 19:16
1.52 kB 19:16
24.13 kB 19:16
1.38 kB 19:16
13.32 kB 19:16
2.04 kB 19:16
12.47 kB 19:16
1.44 kB 19:16
1.14 kB 19:16
2.09 kB 19:16
3.14 kB 19:16
2.17 kB 19:16
1.90 kB 19:16
3.07 kB 19:16
5.66 kB 19:16
79.64 kB 19:16
6.52 kB 19:16
68.52 kB 19:16
5.16 kB 19:16
1.98 kB 19:16
1.48 kB 19:16
3.61 kB 19:16
14.30 kB 19:16
19.70 kB 19:16
3.72 kB 19:16
4.77 kB 19:16
2.86 kB 19:16
7.85 kB 19:16
7.48 kB 19:16
10.16 kB 19:16
13.75 kB 19:16
6.52 kB 19:16
8.01 kB 19:16
7.34 kB 19:16
4.74 kB 19:16
5.33 kB 19:16
11.29 kB 19:16
18.45 kB 19:16
8.25 kB 19:16
2.88 kB 19:16
1.04 kB 19:16
3.37 kB 19:16
1.64 kB 19:16
2.30 kB 19:16
8.23 kB 19:16
6.08 kB 19:16
5.58 kB 19:16
18.73 kB 19:16
2.57 kB 19:16
13.40 kB 19:16
2.73 kB 19:16
5.23 kB 19:16
2.40 kB 19:16
1.99 kB 19:16
1.91 kB 19:16
2.64 kB 19:16
2.00 kB 19:16
2.55 kB 19:16
1.27 kB 19:16
8.20 kB 19:16
1.18 kB 19:16
1.66 kB 19:16
2.48 kB 19:16
3.48 kB 19:16
19.67 kB 19:16
4.04 kB 19:16
16.23 kB 19:16
60.37 kB 19:16
4.47 kB 19:16
1.82 kB 19:16
2.71 kB 19:16
4.36 kB 19:16
4.56 kB 19:16
12.22 kB 19:16
2.45 kB 19:16
41.97 kB 19:16
22.66 kB 19:16
1.95 kB 19:16
1.41 kB 19:16
25.11 kB 19:16
6.90 kB 19:16
1.84 kB 19:16
9.07 kB 19:16
6.02 kB 19:16
1.79 kB 19:16
2.48 kB 19:16
5.25 kB 19:16
5.40 kB 19:16
14.15 kB 19:16
2.68 kB 19:16
15.78 kB 19:16
11.08 kB 19:16
4.60 kB 19:16
10.36 kB 19:16
12.58 kB 19:16
5.78 kB 19:16
26.11 kB 19:16
5.53 kB 19:16
3.45 kB 19:16
15.20 kB 19:16
10.52 kB 19:16
5.58 kB 19:16
4.78 kB 19:16
2.45 kB 19:16
7.86 kB 19:16
22.58 kB 19:16
9.38 kB 19:16
30.55 kB 19:16
24.24 kB 19:16
38.82 kB 19:16
25.87 kB 19:16
1.17 kB 19:16
28.93 kB 19:16
29.63 kB 19:16
34.02 kB 19:16
9.22 kB 19:16
8.64 kB 19:16
3.73 kB 19:16
2.67 kB 19:16
5.67 kB 19:16
27.62 kB 19:16
2.59 kB 19:16
2.84 kB 19:16
4.62 kB 19:16
30.79 kB 19:16
4.65 kB 19:16
4.66 kB 19:16
4.18 kB 19:16
7.69 kB 19:16
2.03 kB 19:16
3.98 kB 19:16
2.47 kB 19:16
2.22 kB 19:16
4.88 kB 19:16
（提交有效评论获得积分）
评论内容不能少于15个字，不要超出160个字。
评价成功，多谢！
下载logmap2_sources_v2.4_oct_2013.zi
CodeForge积分（原CF币）全新升级，功能更强大，使用更便捷，不仅可以用来下载海量源代码马上还可兑换精美小礼品了
您的积分不足，优惠套餐快速获取 30 积分
10积分 / ￥100
30积分 / ￥200原价￥300 元
100积分 / ￥500原价￥1000 元
订单支付完成后，积分将自动加入到您的账号。以下是优惠期的人民币价格，优惠期过后将恢复美元价格。
支付宝支付宝付款
微信钱包微信付款
更多付款方式：、
您本次下载所消耗的积分将转交上传作者。
同一源码，30天内重复下载，只扣除一次积分。
鲁ICP备号-3 runtime:Elapsed:192.237ms - init:0.1;find:0.9;t:0.9;tags:0.3;related:114.4;comment:0.2; 5.8
登录 CodeForge
还没有CodeForge账号？
Switch to the English version?
^_^"呃 ...
Sorry!这位大神很神秘，未开通博客呢，请浏览一下其他的吧您所在位置： &
&nbsp&&nbsp&nbsp&&nbsp
基于文档查询信息的检索系统研究实现.pdf文档全文免费阅读、在线看 62页
本文档一共被下载：
次 ,您可全文免费在线阅读后下载本文档。
下载提示
1.本站不保证该用户上传的文档完整性，不预览、不比对内容而直接下载产生的反悔问题本站不予受理。
2.该文档所得收入(下载+内容+预览三)归上传者、原创者。
3.登录后可充值，立即自动返金币，充值渠道很便利
需要金币：200 &&
你可能关注的文档：
··········
··········
moreandmorethe
rapiddevelopmentInteract，peoplerely
enolTnous base．Search
interfaceusers
knowledge engines，known
usenetwork
information，have
development
Meanwhile．、viththe
popularizationcomputer increasingpeople’S
andtherelevant
technologiesup
readingusingcomputers，thedigital
fromdocumentsand
growing．Obtainingquery
submittingquery
common ofthemodeminformation
twomethodsinformation
present，these
knowledgeobtaining．At
tocombinedocumentbrowser
significant
obtaining separately．It
tomakeusers
informationmore and
effectively．
正在加载中，请稍后...Dawid Weiss - Polish stemmer
Dawid Weiss
Stemming engine for Polish
An update... (January 19th, 2015)
I have left academia and will no longer reply to individual e-mails requesting
support for Lametyzator. All the code and development has been moved to the
project. If
you have questions or wish to download the software, seek help there.
An update... (September 27, 2006)
A major bug has been fixed in the FSA code. All new versions and source code
repository is now available as part of the
An update... (16 August, 2006)
Added a Lametyzator constructor which allows custom dictionary stream, field
delimiters and encoding. Added an option for building stand-alone
JAR that does not include the default polish dictionary.
An update... (26 May, 2006)
Lametyzator has been updated and includes an additional API methods needed
for a cooperative effort with Marcin Mi?kowski ( ).
An update... (February 3rd, 2005)
Lametyzator has been updated: a new dictionary is available (more inflected
forms). Also, the API has changed a little bit (repackaged). An additional
bonus is the tight (but optional) integration with
. If you put Stempel
in CLASSPATH, Lametyzator will become a hybrid stemmer (also known
as Stempelator -- see
for an overview).
An update... (August 7th, 2004)
There is another freely available Polish stemmer -- this one algorithmical (so
it can stem words it doesn't know about). Please check out
-- it is definitely worth taking a look at.
A free stemming engine for the Polish language
A stemmer is usually an algorithm, or in general a method, of bringing an inflected form
of a word to its base form. For instance, English "coincidential" would be converted to
"coincidence". While there are several such algorithms for English (Google for "Porter stemmer" or "Lovins stemmer"),
there is a significant lack of such methods for Polish. A couple of excellent stemmers exist, but they
are commercial. Because stemming usually improves quality of text mining methods,
it is a pity Polish researchers are left without any to experiment.
I decided to create this project to fill this gap somehow. It is not an algorithm, like in the case
of Porter's stemmer, but a dictionary method. I used existing ispell dictionaries and flection rules
to generate pairs: inflected_form -& base_form. Then I created a very efficient representation
of such pairs as a finite state automata, using .
Stemming is done by traversing the automata, which is very efficient, but of course has some drawbacks.
The most important drawback is that none of the words outside the dictionary are stemmed. I would like
to extend the stemmer somehow, so that it would accept proper names and words outside the dictionary,
but it is not easy considering complex Polish inflection rules.
The idea behind my Polish stemmer is very basic. Let's suppose we have a term A and we
are looking for its base for B. Now, if we only had a mapping of all terms TERM-&BASE FORM,
we would be set.
Problems and solutions
Despite the simplicity of the above idea, it brings a number of new problems to solve:
How to store the mapping A-&B, if we know it can contain thousands, or even millions
I used a deterministic automaton for storing the mapping. More information about
automata and an implementation
of building it out of a set of words can be found on
's web site.
My work in this area consisted of writing a
of the dictionaries produced
by Daciuk's FSA package. The entire dictionary is compressed from 44Mb to only 1.5Mb.
How to make use of the mapping efficiently?
The data structure used for storing the mapping (deterministic automaton) can be very effectively
traversed. In general, it takes N steps to determine whether a word is in the dictionary or not,
where N is the number of characters in the word.
Where to take the mappings from (a dictionary of inflected forms)?
This is a difficult problem. Most dictionaries are commercial and cannot be distributed
or used in open algorithms or products. It should be stated that a simple collection of
words is not enough, the core problem lies in finding the mapping between an inflected
form and its base form. Not many dictionaries, even commercial, provide this information.
I decided to use the flection rules provided with Polish dictionaries of ispell. These rules
decribe how to convert a base form of a term into a set of inflected terms. In ispell this information
is used mainly for spell checking, I use it for generating the desired mappings mentioned above.
When will the dictionary contain all the possible terms in Polish?
Never. A language is a live thing, which constantly evolves - new words are added, inflection rules change sometimes.
The most difficult problem for the dictionary is in identification of proper names: people's names, every day use objects etc.
These terms also have complex inflection in Polish and they are in majority not included in ispell's dictionaries.
Zipf's law states that even a very large dictionary will cover only a part of some random text. For instance,
in Rzeczpospolita corpus, comprising about 87 million words (884 thousand unique terms), as much as 613 thousand
words are present less than 5 times. In other words, 69% of the corpus is built of very low-frequency terms.
Is there a solution for terms having more than one base form (example: a term 'celi' may be
an inflection of base term 'cela' (prison cell), or 'cel' (target).
Currently the dictionary returns all possible (and known) base forms of a word. Choosing the right base form is
not possible without understanding the meaning of its inclined version. Stemmers usually limit their functionality
to returning all possible base forms, which are then fed to a syntactical analyzer.
As one can clearly see, my stemmer has many drawbacks. I think its main advantage is in the fact that it exists.
There are many stemming engines much more sophisticated than mine, created by people much more knowledgeable in the Polish language
than myself, but none of them is, according to my knowledge, available free of charge. This is understandable, since Polish stemmers
usually include quite complex dictionaries, which consume a great deal of effort. Scientist have right to sell them. My stemmer can be
used as a cheap alternative, for student research, or in a proof of concept projects.
Due to low-interest, the on-line demo is no longer available.
Lametyzator/Stempelator and FSA code are now part of the
The stemmer is available free of charge now and so it will be in the future. Please follow these
guidelines, however:
Please notify me by e-mail if you downloaded/ used the program. My e-mail address is:
Commercial use note: Jan Daciuk's FSA package is not free. The dictionaries I produced with the FSA are
freely distributable, however, if you want to make your own automata, please contact Jan Daciuk for permission.
Please put a reference to my name in your product, if you used my stemmer. I would also suggest putting
Jan Daciuk's name there and a reference to Polish ispell dictionary.
Todo's (future plans)
I don't claim ownership to the below ideas, if you solve them or decide to work on them, please
let me know so that I can use your results.
Perform experiments on real Polish texts to check how effective the stemmer is.
Check the coverage of the stemmer on a language corpus.
Add new words to the mapping (and to ispell, maybe).
Add an option of fuzzy analysis of terms based on suffixes present in the dictionary. Such quasi-stemmer
has already been written by me and worked quite well in . If added to this
stemmer, proper words could be more effectively recognized.
AcknowledgementsI would like to thank the following individuals for help, resources or both:
Known bugs
No software is ideal (maybe with an exception of Mr. Knuth's software :). The known limitations are listed below,
please feel free to let me know if anything else you think doesn't work as expected.
[Miros?aw Prywata]
Some prefixes in ispell create negation of a term, like spo?eczny -& a-spo?eczny.
[Miros?aw Prywata]
K rule in ispell creates a different category of a word, for instance: beton-&betoniarka (concrete-&concrete mixer).

如何使用lovinsstemmer imaging

我要回帖

更多关于 porter stemmer 的文章

随机推荐