What are chunks?
Chunks are made up of words, and word types are defined using part-of-speech tags. It is even possible to define a pattern or words that cannot be part of Chuck, and such words are known as slits.

What are IOB tags?
This is a chunk format. These tags are similar to part-of-speech tags, but provide can denote the inner, outer, and beginning of a passage. Not just a nominal phrase is allowed here, but several different types of fragments.
Example: this is a fragment from the conll2000 corpus . Each word has a part of speech tag followed by an IOB tag on a separate line:
Mr. NNP B-NP Meador NNP I-NP had VBD B-VP been VBN I-VP executive JJ B-NP vice NN I-NP president NN I-NP of IN B-PP Balcor NNP B-NP
What does this mean?
B-NP: the beginning of a noun phrase
I-NP: describes that the word is inside the current nominal phrase.
O: end of the sentence.
B-VP and I-VP: beginning and inside a verb phrase.
Code # 1: How it works — break words into parts using IOB tags.
|
Output:
[Tree (’NP’, [(’ Mr.’, ’NNP’), (’ Meador’, ’NNP’)]), Tree ( ’VP’, [(’ had’, ’VBD’), (’ been’, ’VBN’)]), ...] [(’ Mr.’, ’NNP’,’ B-NP’), ( ’Meador’,’ NNP’, ’I-NP’), ...]
Code # 2: How it works — fragmentation of a sentence with IOB tags.
|
Output:
[Tree (’S’, [Tree (’ NP’, [(’Mr.’,’ NNP’), (’Meador’,’ NNP’)] ), Tree (’VP’, [(’ had’, ’VBD’), (’ been’, ’VBN’)]), Tree (’ NP’, [(’executive’,’ JJ’), (’ vice’, ’NN’), (’ president’, ’NN’)]), Tree (’ PP’, [(’of’,’ IN’)]), Tree (’NP’, [(’ Balcor’ , ’NNP’)]), (’ .’, ’.’)])] [[(’ Mr.’, ’NNP’,’ B-NP’), (’Meador’,’ NNP’, ’I -NP’), (’had’,’ VBD’, ’B-VP’), (’ been’, ’VBN’,’ I-VP’), (’executive’,’ JJ’, ’B-NP ’), (’ vice’, ’NN’,’ I-NP’), (’president’,’ NN’, ’I-NP’), (’ of’, ’IN’,’ B-PP’) , (’Balcor’,’ NNP’, ’B-NP’), (’ .’, ’.’,’ O’)]]
Let’s look at the code above:
- The ConllChunkCorpusReader class is used to read the IOB corpus.
- There is no paragraph separation, and each sentence is separated by a blank line, so the method s para_ * are not available.
- A tuple or list indicating the types of chunks in the file, such as (& # 39; NP & # 39 ;, & # 39; VP & # 39;, & # 39; PP & # 39 ;), is the third argument to ConllChunkCorpusReader.
- The iob_words () and iob_sents () methods return lists of three tuples (word, pos, iob)
Code # 3: Leaves of trees — those. tagged tokens
|
Output:
[(’Earlier’,’ JJR’), (’staff-reduction’,’ NN’), (’moves’ , ’NNS’)] [(’ Earlier’, ’JJR’), (’ staff-reduction’, ’NN’), (’ moves’, ’NNS’), (’ have’, ’VBP’), ( ’trimmed’,’ VBN’), (’about’,’ IN’), (’300’,’ CD’), (’jobs’,’ NNS’), (’,’, ’,’), ( ’the’,’ DT’), (’spokesman’,’ NN’), (’said’,’ VBD’), (’.’,’ .’)] [(’Earlier’,’ JJR’), (’staff-reduction’,’ NN’), (’moves’,’ NNS’), (’have’,’ VBP’), (’trimmed’,’ VBN’), (’about’,’ IN’ ), (’300’,’ CD’), (’jobs’,’ NNS’), (’,’, ’,’), (’the’,’ DT’), (’spokesman’,’ NN’ ), (’said’,’ VBD’), (’.’,’ .’)]