使用LingPipe进行分词及词性标注

1、首先下载LinePipe
http://alias-i.com/lingpipe/

2、测试例子
en.txt

Don't ever let somebody tell you you can't do something, not even me. 
You got a dream, you gotta protect it. 
People can’t do something themselves, they wanna tell you you can’t do it. 
If you want something, go get it. 
Period.

3、运行gui_pos_en_general_brown.bat

4、测试结果
enpos.txt

<?xml version="1.0" encoding="GBK"?><output><s i="0"><token pos="np">Don</token><token pos="'">'</token><token pos="ql">t</token> <token pos="rb">ever</token> <token pos="vb">let</token> <token pos="pn">somebody</token> <token pos="vb">tell</token> <token pos="ppo">you</token> <token pos="ppss">you</token> <token pos="md">can</token><token pos="'">'</token><token pos="rbt">t</token> <token pos="do">do</token> <token pos="pn">something</token><token pos=",">,</token> <token pos="*">not</token> <token pos="vb">even</token> <token pos="ppo">me</token><token pos=".">.</token></s> 
<s i="1"><token pos="ppss">You</token> <token pos="vbd">got</token> <token pos="at">a</token> <token pos="nn">dream</token><token pos=",">,</token> <token pos="ppss">you</token> <token pos="vbn">gotta</token> <token pos="vb">protect</token> <token pos="ppo">it</token><token pos=".">.</token></s> 
<s i="2"><token pos="nns">People</token> <token pos="md">can</token><token pos="nil">’</token><token pos="nil">t</token> <token pos="do">do</token> <token pos="pn">something</token> <token pos="ppls">themselves</token><token pos=",">,</token> <token pos="ppss">they</token> <token pos="vb">wanna</token> <token pos="vb">tell</token> <token pos="ppo">you</token> <token pos="ppss">you</token> <token pos="md">can</token><token pos="nil">’</token><token pos="nil">t</token> <token pos="do">do</token> <token pos="ppo">it</token><token pos=".">.</token></s> 
<s i="3"><token pos="cs">If</token> <token pos="ppss">you</token> <token pos="vb">want</token> <token pos="pn">something</token><token pos=",">,</token> <token pos="vb">go</token> <token pos="vb">get</token> <token pos="ppo">it</token><token pos=".">.</token></s> 
<s i="4"><token pos="nn">Period</token><token pos=".">.</token></s></output>

使用Stanford套件进行分词及词性标注

1、首先下载两个工具,分别是分词工具及标定工具
Stanford Word Segmenter
Stanford POS Tagger
http://nlp.stanford.edu/software/
需要安装JDK8的哦。

2、测试例子
en.txt

Don't ever let somebody tell you you can't do something, not even me. 
You got a dream, you gotta protect it. 
People can’t do something themselves, they wanna tell you you can’t do it. 
If you want something, go get it. 
Period.

zh.txt

别让别人告诉你你成不了才,即使是我也不行。
如果你有梦想的话,就要去捍卫它。
那些一事无成的人想告诉你你也成不了大器。
如果你有理想的话,就要去努力实现。
就这样。

3、执行语句

segment.bat ctb zh.txt GBK 0 > zhws.txt
segment.bat ctb en.txt GBK 0 > enws.txt

stanford-postagger models/chinese-distsim.tagger zhws.txt > zhpos.txt
stanford-postagger models/english-bidirectional-distsim.tagger enws.txt > enpos.txt

4、测试结果
enpos.txt

Do_VB n't_RB ever_RB let_VB somebody_NN tell_VB you_PRP you_PRP ca_MD n't_RB do_VB something_NN ,_, not_RB even_RB me_PRP ._.
You_PRP got_VBD a_DT dream_NN ,_, you_PRP got_VBD ta_RB protect_VB it_PRP ._.
People_NNS can_MD '_POS t_NN do_VBP something_NN themselves_PRP ,_, they_PRP wan_VBP na_TO tell_VB you_PRP you_PRP can_MD '_POS t_NN do_VBP it_PRP ._.
If_IN you_PRP want_VBP something_NN ,_, go_VB get_VB it_PRP ._.
Period_NN ._.

zhpos.txt

别#AD 让#VV 别人#NN 告诉#VV 你#PN 你#PN 成不了#AD 才#AD ,#PU 即使#CS 是#VC 我#PN 也#AD 不#AD 行#VV 。#PU
如果#CS 你#PN 有#VE 梦想#NN 的话#SP ,#PU 就要#AD 去#VV 捍卫#VV 它#PN 。#PU
那些#DT 一事无成#VV 的#DEC 人#NN 想#VV 告诉#VV 你#PN 你#PN 也#AD 成#VV 不了#AD 大器#NN 。#PU
如果#CS 你#PN 有理想#VV 的话#SP ,#PU 就要#AD 去#VV 努力#AD 实现#VV 。#PU
就#AD 这样#VA 。#PU

使用OpenNLP进行分词及词性标注

1、下载OpenNLP
http://opennlp.apache.org/maven-dependency.html

2、下载模型文件
http://opennlp.sourceforge.net/models-1.5/

3、编码进行分词并标记

package com.neohope.opennlp.test;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;

import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

public class TestIt {

	@SuppressWarnings("deprecation")
	public static void POSTag() throws IOException {
		POSModel model = new POSModelLoader()
				.load(new File("en-pos-maxent.bin"));
		PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
		POSTaggerME tagger = new POSTaggerME(model);

		String input = "Don't ever let somebody tell you you can't do something, not even me. "
				+ "You got a dream, you gotta protect it. "
				+ "People can’t do something themselves, they wanna tell you you can’t do it. "
				+ "If you want something, go get it. " + "Period.";
		ObjectStream<String> lineStream = new PlainTextByLineStream(
				new StringReader(input));

		perfMon.start();
		String line;
		while ((line = lineStream.read()) != null) {

			String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE
					.tokenize(line);
			String[] tags = tagger.tag(whitespaceTokenizerLine);

			POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
			System.out.println(sample.toString());

			perfMon.incrementCounter();
		}
		perfMon.stopAndPrintFinalResult();
	}

	public static void main(String[] args) throws IOException {
		POSTag();
	}
}

4、输出结果

Don't_NNP ever_RB let_VB somebody_NN tell_VB you_PRP you_PRP can't_MD do_VB something,_RB not_RB even_RB me._RBR You_PRP got_VBD a_DT dream,_NN you_PRP gotta_VBP protect_VB it._PRP People_NNS can’t_MD do_VB something_NN themselves,_, they_PRP wanna_MD tell_VB you_PRP you_PRP can’t_MD do_VB it._PRP If_IN you_PRP want_VBP something,_NN go_VB get_VB it._PRP Period._.

使用ICTCLAS进行分词及词性标注

1、测试例子
en.txt

Don't ever let somebody tell you you can't do something, not even me. 
You got a dream, you gotta protect it. 
People can’t do something themselves, they wanna tell you you can’t do it. 
If you want something, go get it. 
Period.

zh.txt

别让别人告诉你你成不了才,即使是我也不行。
如果你有梦想的话,就要去捍卫它。
那些一事无成的人想告诉你你也成不了大器。
如果你有理想的话,就要去努力实现。
就这样。

2、输出结果
enout.txt

Don/n  't/wyy   ever/d   let/v   somebody/r   tell/v   you/rzt   you/rzt   can/vyou   't/wyy   do/vyou   something/r   ,/wd   not/d   even/d   me/rzt   ./wj   
You/rzt   got/v   a/rzv   dream/n   ,/wd   you/rzt   gotta/vd   protect/v   it/rzt   ./wj   
People/n   can’t/n   do/vyou   something/r   themselves/rzv   ,/wd   they/rzt   wanna/v   tell/v   you/rzt   you/rzt   can’t/n   do/vyou   it/rzt   ./wj   
If/c   you/rzt   want/v   something/r   ,/wd   go/v   get/v   it/rzt   ./wj   
Period/n   ./wj   

zhout.txt

别/d   让/v   别人/rr   告诉/v   你/rr   你/rr   成/v   不/d   了/ule   才/d   ,/wd   即使/c   是/vshi   我/rr   也/d   不行/a   。/wj   
如果/c   你/rr   有/vyou   梦想/n   的/ude1   话/n   ,/wd   就要/d   去/vf   捍卫/v   它/rr   。/wj   
那些/rz   一事无成/vl   的/ude1   人/n   想/v   告诉/v   你/rr   你/rr   也/d   成/v   不/d   了/ule   大器/n   。/wj   
如果/c   你/rr   有/vyou   理想/a   的/ude1   话/n   ,/wd   就要/d   去/vf   努力/ad   实现/v   。/wj   
就/p   这样/rzv   。/wj   

使用FudanDNN进行分词及词性标注

1、下载并解压
http://homepage.fudan.edu.cn/zhengxq/deeplearning/

2、测试类
cn.edu.fudan.flow.PosTaggerStart

3、测试例子
zh.txt

别让别人告诉你你成不了才,即使是我也不行。
如果你有梦想的话,就要去捍卫它。
那些一事无成的人想告诉你你也成不了大器。
如果你有理想的话,就要去努力实现。
就这样。

4、测试结果
zhout.txt

别让/V 别人/PN 告诉/V 你/PN 你/PN 成/V 不/AD 了/V 才/NN ,/PU 即使/C 是/V 我/PN 也/AD 不行/V 。/PU 
如果/C 你/PN 有/V 梦想/NN 的/U 话/NN ,/PU 就要/AD 去/V 捍卫/V 它/PN 。/PU 
那些/PN 一事无成/I 的/U 人/NN 想/V 告诉/V 你/PN 你/PN 也/AD 成/V 不/AD 了/V 大器/NN 。/PU 
如果/C 你/PN 有/V 理想/JJ 的/U 话/NN ,/PU 就要/AD 去/V 努力/JJ 实现/V 。/PU

LTP分词相关字典

1、词性标注字典

Tag Description Example Tag Description Example
a adjective 美丽 ni organization name 保险公司
b other noun-modifier 大型, 西式 nl location noun 城郊
c conjunction 和, 虽然 ns geographical name 北京
d adverb nt temporal noun 近日, 明代
e exclamation nz other proper noun 诺贝尔奖
g morpheme 茨, 甥 o onomatopoeia 哗啦
h prefix 阿, 伪 p preposition 在, 把
i idiom 百花齐放 q quantity
j abbreviation 公检法 r pronoun 我们
k suffix 界, 率 u auxiliary 的, 地
m number 一, 第一 v verb 跑, 学习
n general noun 苹果 wp punctuation ,。!
nd direction noun 右侧 ws foreign words CPU
nh person name 杜甫, 汤姆 x non-lexeme 萄, 翱

2、依存句法分析标注关系字典

关系类型 Tag Description Example
主谓关系 SBV subject-verb 我送她一束花 (我 <-- 送)
动宾关系 VOB 直接宾语,verb-object 我送她一束花 (送 –> 花)
间宾关系 IOB 间接宾语,indirect-object 我送她一束花 (送 –> 她)
前置宾语 FOB 前置宾语,fronting-object 他什么书都读 (书 <-- 读)
兼语 DBL double 他请我吃饭 (请 –> 我)
定中关系 ATT attribute 红苹果 (红 <-- 苹果)
状中结构 ADV adverbial 非常美丽 (非常 <-- 美丽)
动补结构 CMP complement 做完了作业 (做 –> 完)
并列关系 COO coordinate 大山和大海 (大山 –> 大海)
介宾关系 POB preposition-object 在贸易区内 (在 –> 内)
左附加关系 LAD left adjunct 大山和大海 (和 <-- 大海)
右附加关系 RAD right adjunct 孩子们 (孩子 –> 们)
独立结构 IS independent structure 两个单句在结构上彼此独立
核心关系 HED head 指整个句子的核心

3、语义角色字典

标记 说明
A0 动作的施事
A1 动作的影响
A2-A5 根据谓词而定
ADV adverbial, default tag ( 附加的,默认标记 )
BNE bene?ciary ( 受益人 )
CND condition ( 条件 )
DIR direction ( 方向 )
DGR degree ( 程度 )
EXT extent ( 扩展 )
FRQ frequency ( 频率 )
LOC locative ( 地点 )
MNR manner ( 方式 )
PRP purpose or reason ( 目的或原因 )
TMP temporal ( 时间 )
TPC topic ( 主题 )
CRD coordinated arguments ( 并列参数 )
PRD predicate ( 谓语动词 )
PSR possessor ( 持有者 )
PSE possessee ( 被持有 )

3、语义依存关系字典

关系类型 Tag Description Example
施事关系 Agt Agent 我送她一束花 (我 <-- 送)
当事关系 Exp Experiencer 我跑得快 (跑 –> 我)
感事关系 Aft Affection 我思念家乡 (思念 –> 我)
领事关系 Poss Possessor 他有一本好读 (他 <-- 有)
受事关系 Pat Patient 他打了小明 (打 –> 小明)
客事关系 Cont Content 他听到鞭炮声 (听 –> 鞭炮声)
成事关系 Prod Product 他写了本小说 (写 –> 小说)
源事关系 Orig Origin 我军缴获敌人四辆坦克 (缴获 –> 坦克)
涉事关系 Datv Dative 他告诉我个秘密 ( 告诉 –> 我 )
比较角色 Comp Comitative 他成绩比我好 (他 –> 我)
属事角色 Belg Belongings 老赵有俩女儿 (老赵 <-- 有)
类事角色 Clas Classification 他是中学生 (是 –> 中学生)
依据角色 Accd According 本庭依法宣判 (依法 <-- 宣判)
缘故角色 Reas Reason 他在愁女儿婚事 (愁 –> 婚事)
意图角色 Int Intention 为了金牌他拼命努力 (金牌 <-- 努力)
结局角色 Cons Consequence 他跑了满头大汗 (跑 –> 满头大汗)
方式角色 Mann Manner 球慢慢滚进空门 (慢慢 <-- 滚)
工具角色 Tool Tool 她用砂锅熬粥 (砂锅 <-- 熬粥)
材料角色 Malt Material 她用小米熬粥 (小米 <-- 熬粥)
时间角色 Time Time 唐朝有个李白 (唐朝 <-- 有)
空间角色 Loc Location 这房子朝南 (朝 –> 南)
历程角色 Proc Process 火车正在过长江大桥 (过 –> 大桥)
趋向角色 Dir Direction 部队奔向南方 (奔 –> 南)
范围角色 Sco Scope 产品应该比质量 (比 –> 质量)
数量角色 Quan Quantity 一年有365天 (有 –> 天)
数量数组 Qp Quantity-phrase 三本书 (三 –> 本)
频率角色 Freq Frequency 他每天看书 (每天 <-- 看)
顺序角色 Seq Sequence 他跑第一 (跑 –> 第一)
描写角色 Desc(Feat) Description 他长得胖 (长 –> 胖)
宿主角色 Host Host 住房面积 (住房 <-- 面积)
名字修饰角色 Nmod Name-modifier 果戈里大街 (果戈里 <-- 大街)
时间修饰角色 Tmod Time-modifier 星期一上午 (星期一 <-- 上午)
反角色 r + main role 打篮球的小姑娘 (打篮球 <-- 姑娘)
嵌套角色 d + main role 爷爷看见孙子在跑 (看见 –> 跑)
并列关系 eCoo event Coordination 我喜欢唱歌和跳舞 (唱歌 –> 跳舞)
选择关系 eSelt event Selection 您是喝茶还是喝咖啡 (茶 –> 咖啡)
等同关系 eEqu event Equivalent 他们三个人一起走 (他们 –> 三个人)
先行关系 ePrec event Precedent 首先,先
顺承关系 eSucc event Successor 随后,然后
递进关系 eProg event Progression 况且,并且
转折关系 eAdvt event adversative 却,然而
原因关系 eCau event Cause 因为,既然
结果关系 eResu event Result 因此,以致
推论关系 eInf event Inference 才,则
条件关系 eCond event Condition 只要,除非
假设关系 eSupp event Supposition 如果,要是
让步关系 eConc event Concession 纵使,哪怕
手段关系 eMetd event Method
目的关系 ePurp event Purpose 为了,以便
割舍关系 eAban event Abandonment 与其,也不
选取关系 ePref event Preference 不如,宁愿
总括关系 eSum event Summary 总而言之
分叙关系 eRect event Recount 例如,比方说
连词标记 mConj Recount Marker 和,或
的字标记 mAux Auxiliary 的,地,得
介词标记 mPrep Preposition 把,被
语气标记 mTone Tone 吗,呢
时间标记 mTime Time 才,曾经
范围标记 mRang Range 都,到处
程度标记 mDegr Degree 很,稍微
频率标记 mFreq Frequency Marker 再,常常
趋向标记 mDir Direction Marker 上去,下来
插入语标记 mPars Parenthesis Marker 总的来说,众所周知
否定标记 mNeg Negation Marker 不,没,未
情态标记 mMod Modal Marker 幸亏,会,能
标点标记 mPunc Punctuation Marker ,。!
重复标记 mPept Repetition Marker 走啊走 (走 –> 走)
多数标记 mMaj Majority Marker 们,等
实词虚化标记 mVain Vain Marker
离合标记 mSepa Seperation Marker 吃了个饭 (吃 –> 饭) 洗了个澡 (洗 –> 澡)
根节点 Root Root 全句核心节点

使用LTP进行分词及词性标注

1、下载并解压
https://github.com/HIT-SCIR/ltp/releases
一定要注意,默认输出为utf-8,所以输出到屏幕上为乱码的。

2、语法
要看文档来这里:http://ltp.readthedocs.org/zh_CN/latest/install.html

ltp_test in LTP 3.3.1 - (C) 2012-2015 HIT-SCIR
The console application for Language Technology Platform.

usage: ./ltp_test <options>

options:
  --threads arg           The number of threads [default=1].
  --last-stage arg        The last stage of analysis. This option can be used
                          when the user onlywants to perform early stage
                          analysis, like only segment without postagging.value
                          includes:
                          - ws: Chinese word segmentation
                          - pos: Part of speech tagging
                          - ner: Named entity recognization
                          - dp: Dependency parsing
                          - srl: Semantic role labeling (equals to all)
                          - all: The whole pipeline [default]
  --input arg             The path to the input file.
  --segmentor-model arg   The path to the segment model
                          [default=ltp_data/cws.model].
  --segmentor-lexicon arg The path to the external lexicon in segmentor
                          [optional].
  --postagger-model arg   The path to the postag model
                          [default=ltp_data/pos.model].
  --postagger-lexicon arg The path to the external lexicon in postagger
                          [optional].
  --ner-model arg         The path to the NER model [default=ltp_data/ner.model
                          ].
  --parser-model arg      The path to the parser model
                          [default=ltp_data/parser.model].
  --srl-data arg          The path to the SRL model directory
                          [default=ltp_data/srl_data/].
  --debug-level arg       The debug level.
  -h [ --help ]           Show help information

3、测试例子
en.txt

Don't ever let somebody tell you you can't do something, not even me. 
You got a dream, you gotta protect it. 
People can’t do something themselves, they wanna tell you you can’t do it. 
If you want something, go get it. 
Period.

zh.txt

别让别人告诉你你成不了才,即使是我也不行。
如果你有梦想的话,就要去捍卫它。
那些一事无成的人想告诉你你也成不了大器。
如果你有理想的话,就要去努力实现。
就这样。

4、执行语句

ltp_test --input en.txt > enout.txt
ltp_test --input zh.txt > zhout.txt

5、测试结果
enpos.txt

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="Don&apos;t ever let somebody tell you you can&apos;t do something, not even me.">
                <word id="0" cont="Don&apos;t" pos="ws" ne="O" parent="1" relate="ATT" />
                <word id="1" cont="ever" pos="ws" ne="O" parent="7" relate="ATT" />
                <word id="2" cont="let" pos="ws" ne="O" parent="3" relate="ATT" />
                <word id="3" cont="somebody" pos="ws" ne="O" parent="4" relate="ATT" />
                <word id="4" cont="tell" pos="ws" ne="O" parent="5" relate="ATT" />
                <word id="5" cont="you" pos="ws" ne="O" parent="6" relate="ATT" />
                <word id="6" cont="you" pos="ws" ne="O" parent="7" relate="ATT" />
                <word id="7" cont="can&apos;t" pos="ws" ne="O" parent="8" relate="ATT" />
                <word id="8" cont="do" pos="ws" ne="O" parent="9" relate="ATT" />
                <word id="9" cont="something" pos="ws" ne="O" parent="-1" relate="HED" />
                <word id="10" cont="," pos="wp" ne="O" parent="9" relate="WP" />
                <word id="11" cont="not" pos="ws" ne="O" parent="13" relate="ATT" />
                <word id="12" cont="even" pos="ws" ne="O" parent="13" relate="ATT" />
                <word id="13" cont="me" pos="ws" ne="O" parent="9" relate="COO" />
                <word id="14" cont="." pos="wp" ne="O" parent="13" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="You got a dream, you gotta protect it.">
                <word id="0" cont="You" pos="ws" ne="O" parent="3" relate="ATT" />
                <word id="1" cont="got" pos="ws" ne="O" parent="2" relate="ATT" />
                <word id="2" cont="a" pos="ws" ne="O" parent="3" relate="ATT" />
                <word id="3" cont="dream" pos="ws" ne="O" parent="-1" relate="HED" />
                <word id="4" cont="," pos="wp" ne="O" parent="3" relate="WP" />
                <word id="5" cont="you" pos="ws" ne="O" parent="8" relate="ATT" />
                <word id="6" cont="gotta" pos="ws" ne="O" parent="8" relate="ATT" />
                <word id="7" cont="protect" pos="ws" ne="O" parent="8" relate="ATT" />
                <word id="8" cont="it" pos="ws" ne="O" parent="3" relate="COO" />
                <word id="9" cont="." pos="wp" ne="O" parent="3" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="People can’t do something themselves, they wanna tell you you can’t do it.">
                <word id="0" cont="People" pos="ws" ne="O" parent="1" relate="ATT" />
                <word id="1" cont="can" pos="ws" ne="O" parent="-1" relate="HED" />
                <word id="2" cont="’" pos="wp" ne="O" parent="6" relate="WP" />
                <word id="3" cont="t" pos="ws" ne="O" parent="4" relate="ATT" />
                <word id="4" cont="do" pos="ws" ne="O" parent="5" relate="ATT" />
                <word id="5" cont="something" pos="ws" ne="O" parent="6" relate="ATT" />
                <word id="6" cont="themselves" pos="ws" ne="O" parent="1" relate="COO" />
                <word id="7" cont="," pos="wp" ne="O" parent="6" relate="WP" />
                <word id="8" cont="they" pos="ws" ne="O" parent="9" relate="ATT" />
                <word id="9" cont="wanna" pos="ws" ne="O" parent="10" relate="ATT" />
                <word id="10" cont="tell" pos="ws" ne="O" parent="11" relate="ATT" />
                <word id="11" cont="you" pos="ws" ne="O" parent="12" relate="ATT" />
                <word id="12" cont="you" pos="ws" ne="O" parent="13" relate="ATT" />
                <word id="13" cont="can" pos="ws" ne="O" parent="6" relate="COO" />
                <word id="14" cont="’" pos="wp" ne="O" parent="17" relate="WP" />
                <word id="15" cont="t" pos="ws" ne="O" parent="17" relate="ATT" />
                <word id="16" cont="do" pos="ws" ne="O" parent="17" relate="ATT" />
                <word id="17" cont="it" pos="ws" ne="O" parent="13" relate="COO" />
                <word id="18" cont="." pos="wp" ne="O" parent="17" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="If you want something, go get it.">
                <word id="0" cont="If" pos="ws" ne="O" parent="3" relate="ATT" />
                <word id="1" cont="you" pos="ws" ne="O" parent="3" relate="ATT" />
                <word id="2" cont="want" pos="ws" ne="O" parent="3" relate="ATT" />
                <word id="3" cont="something" pos="ws" ne="O" parent="-1" relate="HED" />
                <word id="4" cont="," pos="wp" ne="O" parent="3" relate="WP" />
                <word id="5" cont="go" pos="ws" ne="O" parent="7" relate="ATT" />
                <word id="6" cont="get" pos="ws" ne="O" parent="7" relate="ATT" />
                <word id="7" cont="it" pos="ws" ne="O" parent="3" relate="COO" />
                <word id="8" cont="." pos="wp" ne="O" parent="3" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="Period.">
                <word id="0" cont="Period" pos="ws" ne="O" parent="-1" relate="HED" />
                <word id="1" cont="." pos="wp" ne="O" parent="0" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

zhpos.txt

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="别让别人告诉你你成不了才,即使是我也不行。">
                <word id="0" cont="别" pos="d" ne="O" parent="1" relate="ADV" />
                <word id="1" cont="让" pos="v" ne="O" parent="-1" relate="HED">
                    <arg id="0" type="ù" beg="0" end="0" />
                    <arg id="1" type="" beg="2" end="2" />
                    <arg id="2" type="" beg="3" end="9" />
                </word>
                <word id="2" cont="别人" pos="r" ne="O" parent="1" relate="DBL" />
                <word id="3" cont="告诉" pos="v" ne="O" parent="1" relate="VOB">
                    <arg id="0" type="R&#x07;" beg="4" end="4" />
                    <arg id="1" type="" beg="5" end="9" />
                </word>
                <word id="4" cont="你" pos="r" ne="O" parent="3" relate="IOB" />
                <word id="5" cont="你" pos="r" ne="O" parent="6" relate="SBV" />
                <word id="6" cont="成" pos="v" ne="O" parent="3" relate="VOB">
                    <arg id="0" type="ˆ&#x07;" beg="5" end="5" />
                    <arg id="1" type="" beg="9" end="9" />
                </word>
                <word id="7" cont="不" pos="d" ne="O" parent="8" relate="ADV" />
                <word id="8" cont="了" pos="v" ne="O" parent="6" relate="CMP" />
                <word id="9" cont="才" pos="n" ne="O" parent="6" relate="VOB" />
                <word id="10" cont="," pos="wp" ne="O" parent="1" relate="WP" />
                <word id="11" cont="即使" pos="c" ne="O" parent="12" relate="ADV" />
                <word id="12" cont="是" pos="v" ne="O" parent="1" relate="COO">
                    <arg id="0" type="|&#x1C;S" beg="11" end="11" />
                </word>
                <word id="13" cont="我" pos="r" ne="O" parent="15" relate="SBV" />
                <word id="14" cont="也" pos="d" ne="O" parent="15" relate="ADV" />
                <word id="15" cont="不行" pos="a" ne="O" parent="12" relate="VOB">
                    <arg id="0" type="‘&#x07;" beg="13" end="13" />
                    <arg id="1" type="" beg="14" end="14" />
                </word>
                <word id="16" cont="。" pos="wp" ne="O" parent="1" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="如果你有梦想的话,就要去捍卫它。">
                <word id="0" cont="如果" pos="c" ne="O" parent="2" relate="ADV" />
                <word id="1" cont="你" pos="r" ne="O" parent="2" relate="SBV" />
                <word id="2" cont="有" pos="v" ne="O" parent="-1" relate="HED">
                    <arg id="0" type="&#x02;" beg="0" end="0" />
                    <arg id="1" type="" beg="1" end="1" />
                    <arg id="2" type="" beg="3" end="3" />
                    <arg id="3" type="" beg="5" end="5" />
                </word>
                <word id="3" cont="梦想" pos="n" ne="O" parent="2" relate="VOB" />
                <word id="4" cont="的" pos="u" ne="O" parent="2" relate="RAD" />
                <word id="5" cont="话" pos="n" ne="O" parent="2" relate="VOB" />
                <word id="6" cont="," pos="wp" ne="O" parent="2" relate="WP" />
                <word id="7" cont="就要" pos="d" ne="O" parent="9" relate="ADV" />
                <word id="8" cont="去" pos="v" ne="O" parent="9" relate="ADV" />
                <word id="9" cont="捍卫" pos="v" ne="O" parent="2" relate="COO">
                    <arg id="0" type="£&#x07;V" beg="7" end="7" />
                    <arg id="1" type="" beg="10" end="10" />
                </word>
                <word id="10" cont="它" pos="r" ne="O" parent="9" relate="VOB" />
                <word id="11" cont="。" pos="wp" ne="O" parent="2" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="那些一事无成的人想告诉你你也成不了大器。">
                <word id="0" cont="那些" pos="r" ne="O" parent="3" relate="ATT" />
                <word id="1" cont="一事无成" pos="i" ne="O" parent="3" relate="ATT" />
                <word id="2" cont="的" pos="u" ne="O" parent="1" relate="RAD" />
                <word id="3" cont="人" pos="n" ne="O" parent="4" relate="SBV" />
                <word id="4" cont="想" pos="v" ne="O" parent="-1" relate="HED">
                    <arg id="0" type="‘&#x07;" beg="0" end="3" />
                    <arg id="1" type="" beg="5" end="12" />
                </word>
                <word id="5" cont="告诉" pos="v" ne="O" parent="4" relate="VOB">
                    <arg id="0" type="â&#x07;" beg="6" end="6" />
                    <arg id="1" type="" beg="7" end="12" />
                </word>
                <word id="6" cont="你" pos="r" ne="O" parent="5" relate="IOB" />
                <word id="7" cont="你" pos="r" ne="O" parent="9" relate="SBV" />
                <word id="8" cont="也" pos="d" ne="O" parent="9" relate="ADV" />
                <word id="9" cont="成" pos="v" ne="O" parent="5" relate="VOB">
                    <arg id="0" type="ù" beg="7" end="7" />
                    <arg id="1" type="" beg="8" end="8" />
                    <arg id="2" type="" beg="12" end="12" />
                </word>
                <word id="10" cont="不" pos="d" ne="O" parent="11" relate="ADV" />
                <word id="11" cont="了" pos="v" ne="O" parent="9" relate="CMP" />
                <word id="12" cont="大器" pos="n" ne="O" parent="9" relate="VOB" />
                <word id="13" cont="。" pos="wp" ne="O" parent="4" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="如果你有理想的话,就要去努力实现。">
                <word id="0" cont="如果" pos="c" ne="O" parent="2" relate="ADV" />
                <word id="1" cont="你" pos="r" ne="O" parent="2" relate="SBV" />
                <word id="2" cont="有" pos="v" ne="O" parent="-1" relate="HED">
                    <arg id="0" type="&#x13;&#x01;S" beg="0" end="0" />
                    <arg id="1" type="" beg="1" end="1" />
                    <arg id="2" type="" beg="3" end="3" />
                </word>
                <word id="3" cont="理想" pos="n" ne="O" parent="2" relate="VOB" />
                <word id="4" cont="的话" pos="u" ne="O" parent="2" relate="RAD" />
                <word id="5" cont="," pos="wp" ne="O" parent="2" relate="WP" />
                <word id="6" cont="就要" pos="d" ne="O" parent="9" relate="ADV">
                    <arg id="0" type="" beg="4" end="4" />
                </word>
                <word id="7" cont="去" pos="v" ne="O" parent="9" relate="ADV">
                    <arg id="0" type="³&#x03;D" beg="4" end="4" />
                    <arg id="1" type="" beg="6" end="6" />
                </word>
                <word id="8" cont="努力" pos="a" ne="O" parent="9" relate="ADV" />
                <word id="9" cont="实现" pos="v" ne="O" parent="2" relate="COO">
                    <arg id="0" type=" &#x01;D" beg="4" end="4" />
                    <arg id="1" type="" beg="6" end="6" />
                    <arg id="2" type="" beg="8" end="8" />
                </word>
                <word id="10" cont="。" pos="wp" ne="O" parent="2" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

<?xml version="1.0" encoding="utf-8" ?>
<xml4nlp>
    <note sent="y" word="y" pos="y" ne="y" parser="y" wsd="n" srl="y" />
    <doc>
        <para id="0">
            <sent id="0" cont="就这样。">
                <word id="0" cont="就" pos="d" ne="O" parent="1" relate="ADV" />
                <word id="1" cont="这样" pos="r" ne="O" parent="-1" relate="HED" />
                <word id="2" cont="。" pos="wp" ne="O" parent="1" relate="WP" />
            </sent>
        </para>
    </doc>
</xml4nlp>

使用NiuParser进行分词及词性标注

1、下载并解压
http://www.niuparser.com/

2、语法

[USAGE]
         NiuParser-v1.3.0-mt-win.exe    <Action>        <OPTIONS>
[ACTION]
        --WS    :  Word Segmentation.
        --POS   :  Part-Of-Speech Tagging.
        --NER   :  Named Entity Recognition.
        --CHK   :  Chunking (shallow syntactic parsing).
        --CP    :  Constituent Parser.
        --DP    :  Dependency Parser.
        --SRL   :  Semantic Role Label.
[OPITION]
>>   Get Options of Word Segmentation
                 NiuParser-v1.3.0-mt-win.exe    --WS
>>   Get Options of POS Tagging
                 NiuParser-v1.3.0-mt-win.exe    --POS
>>   Get Options of Named Entity Recognition
                 NiuParser-v1.3.0-mt-win.exe    --NER
>>   Get Options of Base Phrase Chunking
                 NiuParser-v1.3.0-mt-win.exe    --CHK
>>   Get Options of Constituent Parser
                 NiuParser-v1.3.0-mt-win.exe    --CP
>>   Get Options of Dependency Parser
                 NiuParser-v1.3.0-mt-win.exe    --DP

3、测试例子
en.txt

Don't ever let somebody tell you you can't do something, not even me. 
You got a dream, you gotta protect it. 
People can’t do something themselves, they wanna tell you you can’t do it. 
If you want something, go get it. 
Period.

zh.txt

别让别人告诉你你成不了才,即使是我也不行。
如果你有梦想的话,就要去捍卫它。
那些一事无成的人想告诉你你也成不了大器。
如果你有理想的话,就要去努力实现。
就这样。

4、执行语句

NiuParser-v1.3.0-mt-win.exe --WS -c niuparser.config -in en.txt -out enws.txt
NiuParser-v1.3.0-mt-win.exe --POS -c niuparser.config -in enws.txt -out enpos.txt

NiuParser-v1.3.0-mt-win.exe --WS -c niuparser.config -in zh.txt -out zhws.txt
NiuParser-v1.3.0-mt-win.exe --POS -c niuparser.config -in zhws.txt -out zhpos.txt

5、测试结果
enpos.txt

Don't/NR ever/NN let/VV somebody/NR tell/NR you/NR you/NR can't/NN d/NN o/VV something/JJ ,/NN not/VV even/NR me./NN 
You/NR go/NN t/NN a/AD dream,/VV you/NR gotta/NR protect/NN it./NN 
People/NR can/NR ’/PU t/NN d/NN o/VV something/JJ t/NN hemselves/NN ,/PU the/DT y/NN wanna/NR tell/NR you/NR you/NR can/NR ’/PU t/NN do/VV it./NN 
If/NR you/NR want/VV something/JJ ,/NN go/NN get/VV it./NN 

zhpos.txt

别/AD 让/VV 别人/NN 告诉/VV 你/PN 你/PN 成/VV 不/AD 了/VV 才/AD ,/PU 即使/CS 是/VC 我/PN 也/AD 不/AD 行/VV 。/PU 
如果/CS 你/PN 有/VE 梦想/NN 的话/SP ,/PU 就/AD 要/VV 去/VV 捍卫/VV 它/PN 。/PU 
那些/DT 一事无成/CD 的/DEG 人/NN 想/VV 告诉/VV 你/PN 你/PN 也/AD 成/VV 不/AD 了/VV 大器/NN 。/PU 
如果/CS 你/PN 有/VE 理想/NN 的话/SP ,/PU 就/AD 要/VV 去/VV 努力/AD 实现/VV 。/PU 

使用THULANC进行分词及词性标注

1、首先到这里下载程序及模型
http://thulac.thunlp.org/

2、解压,我用的是java版本的程序

3、基本语法

java -jar THULAC_lite_java_run.jar [-t2s] [-seg_only] [-deli delimeter] [-user userword.txt] -input input_file -output output_file

其中:
-t2s                将句子从繁体转化为简体
-seg_only           只进行分词,不进行词性标注
-deli delimeter     设置词与词性间的分隔符,默认为下划线_
-filter             使用过滤器去除一些没有意义的词语,例如“可以”。
-user userword.txt  设置用户词典,用户词典中的词会被打上uw标签。词典中每一个词一行,UTF8编码(python版暂无)
-model_dir dir      设置模型文件所在文件夹,默认为models/
-input input_file   设置从文件读入,默认为命令行输入
-output output_file 设置输出到文件中,默认为命令行输出

4、测试例子
en.txt

Don't ever let somebody tell you you can't do something, not even me. 
You got a dream, you gotta protect it. 
People can’t do something themselves, they wanna tell you you can’t do it. 
If you want something, go get it. 
Period.

zh.txt

别让别人告诉你你成不了才,即使是我也不行。
如果你有梦想的话,就要去捍卫它。
那些一事无成的人想告诉你你也成不了大器。
如果你有理想的话,就要去努力实现。
就这样。

5、执行语句

java -jar THULAC_lite_java_run.jar -input en.txt -output enout.txt
java -jar THULAC_lite_java_run.jar -input zh.txt -output zhout.txt

6、测试结果
enout.txt

Don_n '_w t_g ever_nz let_x somebody_x tell_np you_np you_np can_np '_w t_g d_g o_v something_x ,_w not_np even_np me._np 
You_np got_np a_v dream_np ,_w you_np gotta_x protect_x it._x 
People_x can??_n t_g d_g o_v something_x themselves_x ,_w they_x wanna_n tell_np you_np you_np can??_n t_g do_v it._m 
If_v you_np want_x something_x ,_w go_v get_np it._m 
Period._x 

zhout.txt

别_d 让_v 别人_r 告诉_v 你你_r 成_v 不_d 了_v 才_n ,_w 即使_c 是_v 我_r 也_d 不行_a 。_w 
如果_c 你_r 有_v 梦想_n 的_u 话_n ,_w 就要_d 去_v 捍卫_v 它_r 。_w 
那些_r 一事无成_id 的_u 人_n 想_v 告诉_v 你你_r 也_d 成_v 不_d 了_v 大器_n 。_w 
如果_c 你_r 有_v 理想_n 的_u 话_n ,_w 就要_d 去_v 努力_a 实现_v 。_w 
就_d 这样_r 。_w 

使用NLTK进行分词及词性标注

1、首先是安装
1.1、安装Python 3.4
注意要用32位版本
http://www.python.org/downloads/

1.2、安装Numpy
注意两点,一是不一定所有版本都有windows安装包,二是要找支持python3.4的安装包
http://sourceforge.net/projects/numpy/files/NumPy/

1.3、安装NLTK
注意3.2版本有bug,不要用。
http://pypi.python.org/pypi/nltk

2、下载NLT Data
方法1:
在python中运行:

import nltk
nltk.download()

方法2:
到下面的地址,直接去找链接,然后自己下载解压
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml

3、进行分词
3.1、设置环境变量

set PYTHON_HOME=C:\NeoLanguages\Python34_x86
set PATH=%PYTHON_HOME%;%PATH%
set NLTK_DATA=D:\NLP\NLTK\nltk_data
@python

3.2、py文件

#!usr/bin/python

import nltk

#测试句子
sentence = "Don’t ever let somebody tell you you can’t do something, not even me. \
You got a dream, you gotta protect it. People can’t do something themselves, \
they wanna tell you you can’t do it. If you want something, go get it. Period."

#分词
tokens = nltk.word_tokenize(sentence)

#词性标注
tagged = nltk.pos_tag(tokens)

#句法分析
entities = nltk.chunk.ne_chunk(tagged)

3.3、逐句运行

D:\MyProjects\NLP\NLTK>python
Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> sentence = "Don’t ever let somebody tell you you can’t do something, not e
ven me. \
... You got a dream, you gotta protect it. People can’t do something themselves
, \
... they wanna tell you you can’t do it. If you want something, go get it. Peri
od."
>>> tokens = nltk.word_tokenize(sentence)
>>> tagged = nltk.pos_tag(tokens)
>>> entities = nltk.chunk.ne_chunk(tagged)

>>> tokens
['Don’t', 'ever', 'let', 'somebody', 'tell', 'you', 'you', 'can’t', 'do', 'som
ething', ',', 'not', 'even', 'me', '.', 'You', 'got', 'a', 'dream', ',', 'you',
'got', 'ta', 'protect', 'it', '.', 'People', 'can’t', 'do', 'something', 'thems
elves', ',', 'they', 'wan', 'na', 'tell', 'you', 'you', 'can’t', 'do', 'it', '.
', 'If', 'you', 'want', 'something', ',', 'go', 'get', 'it', '.', 'Period', '.']

>>> tagged
[('Don’t', 'NNP'), ('ever', 'RB'), ('let', 'VB'), ('somebody', 'NN'), ('tell',
'VB'), ('you', 'PRP'), ('you', 'PRP'), ('can’t', 'VBP'), ('do', 'VB'), ('someth
ing', 'NN'), (',', ','), ('not', 'RB'), ('even', 'RB'), ('me', 'PRP'), ('.', '.'
), ('You', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('dream', 'NN'), (',', ','), ('y
ou', 'PRP'), ('got', 'VBD'), ('ta', 'JJ'), ('protect', 'NN'), ('it', 'PRP'), ('.
', '.'), ('People', 'NNS'), ('can’t', 'VBP'), ('do', 'VBP'), ('something', 'NN'
), ('themselves', 'PRP'), (',', ','), ('they', 'PRP'), ('wan', 'VBP'), ('na', 'T
O'), ('tell', 'VB'), ('you', 'PRP'), ('you', 'PRP'), ('can’t', 'VBP'), ('do', '
VB'), ('it', 'PRP'), ('.', '.'), ('If', 'IN'), ('you', 'PRP'), ('want', 'VBP'),
('something', 'NN'), (',', ','), ('go', 'VBP'), ('get', 'VB'), ('it', 'PRP'), ('
.', '.'), ('Period', 'NNP'), ('.', '.')]

>>> entities
Tree('S', [('Don’t', 'NNP'), ('ever', 'RB'), ('let', 'VB'), ('somebody', 'NN'),
 ('tell', 'VB'), ('you', 'PRP'), ('you', 'PRP'), ('can’t', 'VBP'), ('do', 'VB')
, ('something', 'NN'), (',', ','), ('not', 'RB'), ('even', 'RB'), ('me', 'PRP'),
 ('.', '.'), ('You', 'PRP'), ('got', 'VBD'), ('a', 'DT'), ('dream', 'NN'), (',',
 ','), ('you', 'PRP'), ('got', 'VBD'), ('ta', 'JJ'), ('protect', 'NN'), ('it', '
PRP'), ('.', '.'), ('People', 'NNS'), ('can’t', 'VBP'), ('do', 'VBP'), ('someth
ing', 'NN'), ('themselves', 'PRP'), (',', ','), ('they', 'PRP'), ('wan', 'VBP'),
 ('na', 'TO'), ('tell', 'VB'), ('you', 'PRP'), ('you', 'PRP'), ('can’t', 'VBP')
, ('do', 'VB'), ('it', 'PRP'), ('.', '.'), ('If', 'IN'), ('you', 'PRP'), ('want'
, 'VBP'), ('something', 'NN'), (',', ','), ('go', 'VBP'), ('get', 'VB'), ('it',
'PRP'), ('.', '.'), Tree('PERSON', [('Period', 'NNP')]), ('.', '.')])
>>>