Pengenalan kepada Apache OpenNLP

1. Gambaran keseluruhan

Apache OpenNLP adalah perpustakaan Java Pemprosesan Bahasa Asli sumber terbuka.

Ia mempunyai API untuk kes penggunaan seperti Named Entity Recognition, Sentence Detection, POS tagging and Tokenization.

Dalam tutorial ini, kita akan melihat bagaimana menggunakan API ini untuk kes penggunaan yang berbeza.

2. Persediaan Maven

Pertama, kita perlu menambahkan kebergantungan utama pada pom.xml kita :

 org.apache.opennlp opennlp-tools 1.8.4 

Versi stabil terkini boleh didapati di Maven Central.

Sebilangan kes penggunaan memerlukan model terlatih. Anda boleh memuat turun model yang telah ditentukan di sini dan maklumat terperinci mengenai model-model ini di sini.

3. Pengesanan Kalimat

Mari kita mulakan dengan memahami apa itu ayat.

Pengesanan hukuman adalah mengenai mengenal pasti permulaan dan akhir ayat , yang biasanya bergantung pada bahasa yang ada. Ini juga disebut "Kalimat Disambiguasi Batas" (SBD).

Dalam beberapa kes, pengesanan ayat agak mencabar kerana sifat watak tempoh yang tidak jelas . Tempoh biasanya menunjukkan akhir ayat tetapi juga boleh muncul di alamat e-mel, singkatan, perpuluhan, dan banyak tempat lain.

Bagi kebanyakan tugas NLP, untuk pengesanan ayat, kita memerlukan model terlatih sebagai input, yang kita harapkan berada di folder / sumber .

Untuk melaksanakan pengesanan kalimat, kami memuat model dan memasukkannya ke dalam contoh SentenceDetectorME . Kemudian, kami hanya memasukkan teks ke kaedah sentDetect () untuk membaginya pada batas ayat:

@Test public void givenEnglishModel_whenDetect_thenSentencesAreDetected() throws Exception { String paragraph = "This is a statement. This is another statement." + "Now is an abstract word for time, " + "that is always flying. And my email address is [email protected]"; InputStream is = getClass().getResourceAsStream("/models/en-sent.bin"); SentenceModel model = new SentenceModel(is); SentenceDetectorME sdetector = new SentenceDetectorME(model); String sentences[] = sdetector.sentDetect(paragraph); assertThat(sentences).contains( "This is a statement.", "This is another statement.", "Now is an abstract word for time, that is always flying.", "And my email address is [email protected]"); }

Catatan:akhiran "ME" digunakan dalam banyak nama kelas di Apache OpenNLP dan mewakili algoritma yang berdasarkan "Entropi Maksimum".

4. Mencungkil

Sekarang kita dapat membahagikan korpus teks menjadi ayat, kita dapat mula menganalisis ayat dengan lebih terperinci.

Matlamat tokenisasi adalah untuk membahagikan ayat menjadi bahagian yang lebih kecil yang disebut token . Biasanya, token ini adalah kata, angka atau tanda baca.

Terdapat tiga jenis tokenizer yang terdapat di OpenNLP.

4.1. Menggunakan TokenizerME

Dalam kes ini, pertama kita perlu memuatkan model. Kita boleh memuat turun fail model dari sini, memasukkannya ke dalam folder / resources dan memuatkannya dari sana.

Seterusnya, kami akan membuat contoh TokenizerME menggunakan model yang dimuat, dan menggunakan kaedah tokenize () untuk melakukan tokenisasi pada String mana pun :

@Test public void givenEnglishModel_whenTokenize_thenTokensAreDetected() throws Exception { InputStream inputStream = getClass() .getResourceAsStream("/models/en-token.bin"); TokenizerModel model = new TokenizerModel(inputStream); TokenizerME tokenizer = new TokenizerME(model); String[] tokens = tokenizer.tokenize("Baeldung is a Spring Resource."); assertThat(tokens).contains( "Baeldung", "is", "a", "Spring", "Resource", "."); }

Seperti yang kita lihat, tokenizer telah mengenal pasti semua perkataan dan watak titik sebagai token yang berasingan. Tokenizer ini dapat digunakan dengan model terlatih khusus juga.

4.2. WhitespaceTokenizer

Seperti namanya, tokenizer ini hanya membagi kalimat menjadi token menggunakan watak ruang kosong sebagai pembatas:

@Test public void givenWhitespaceTokenizer_whenTokenize_thenTokensAreDetected() throws Exception { WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE; String[] tokens = tokenizer.tokenize("Baeldung is a Spring Resource."); assertThat(tokens) .contains("Baeldung", "is", "a", "Spring", "Resource."); }

Kita dapat melihat bahawa ayat itu dibahagi dengan ruang kosong dan oleh itu kita mendapat "Sumber." (dengan watak titik di akhir) sebagai token tunggal dan bukannya dua token yang berbeza untuk perkataan "Sumber" dan watak titik.

4.3. SimpleTokenizer

Tokenizer ini sedikit lebih canggih daripada WhitespaceTokenizer dan membahagikan ayat menjadi perkataan, angka, dan tanda baca. Ini adalah tingkah laku lalai dan tidak memerlukan model apa pun:

@Test public void givenSimpleTokenizer_whenTokenize_thenTokensAreDetected() throws Exception { SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE; String[] tokens = tokenizer .tokenize("Baeldung is a Spring Resource."); assertThat(tokens) .contains("Baeldung", "is", "a", "Spring", "Resource", "."); }

5. Pengiktirafan Entiti Dinamakan

Sekarang setelah kita memahami tokenisasi, mari kita lihat kes penggunaan pertama yang didasarkan pada tokenisasi yang berjaya: bernama entiti recognition (NER).

Matlamat NER adalah untuk mencari entiti bernama seperti orang, lokasi, organisasi dan benda bernama lain dalam teks tertentu.

OpenNLP menggunakan model yang telah ditentukan untuk nama orang, tarikh dan masa, lokasi, dan organisasi. Kita perlu memuatkan model menggunakan TokenNameFinderModel danmenyebarkannya ke contoh NameFinderME. Kemudian kita boleh menggunakan kaedah find () untuk mencari entiti bernama dalam teks yang diberikan:

@Test public void givenEnglishPersonModel_whenNER_thenPersonsAreDetected() throws Exception { SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE; String[] tokens = tokenizer .tokenize("John is 26 years old. His best friend's " + "name is Leonard. He has a sister named Penny."); InputStream inputStreamNameFinder = getClass() .getResourceAsStream("/models/en-ner-person.bin"); TokenNameFinderModel model = new TokenNameFinderModel( inputStreamNameFinder); NameFinderME nameFinderME = new NameFinderME(model); List spans = Arrays.asList(nameFinderME.find(tokens)); assertThat(spans.toString()) .isEqualTo("[[0..1) person, [13..14) person, [20..21) person]"); }

Seperti yang dapat kita lihat dalam penegasan, hasilnya adalah senarai objek Span yang berisi indeks awal dan akhir token yang menyusun entiti bernama dalam teks.

6. Penandaan Bahagian-Pertuturan

Kes penggunaan lain yang memerlukan senarai token sebagai input adalah penandaan bahagian ucapan.

Bahagian pertuturan (POS) mengenal pasti jenis perkataan. OpenNLP menggunakan teg berikut untuk bahagian-bahagian ucapan yang berbeza:

  • NN - kata nama, tunggal atau jisim
  • DT – determiner
  • VB – verb, base form
  • VBD – verb, past tense
  • VBZ – verb, third person singular present
  • IN – preposition or subordinating conjunction
  • NNP – proper noun, singular
  • TO – the word “to”
  • JJ – adjective

These are same tags as defined in the Penn Tree Bank. For a complete list please refer to this list.

Similar to the NER example, we load the appropriate model and then use POSTaggerME and its method tag() on a set of tokens to tag the sentence:

@Test public void givenPOSModel_whenPOSTagging_thenPOSAreDetected() throws Exception { SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE; String[] tokens = tokenizer.tokenize("John has a sister named Penny."); InputStream inputStreamPOSTagger = getClass() .getResourceAsStream("/models/en-pos-maxent.bin"); POSModel posModel = new POSModel(inputStreamPOSTagger); POSTaggerME posTagger = new POSTaggerME(posModel); String tags[] = posTagger.tag(tokens); assertThat(tags).contains("NNP", "VBZ", "DT", "NN", "VBN", "NNP", "."); }

The tag() method maps the tokens into a list of POS tags. The result in the example is:

  1. “John” – NNP (proper noun)
  2. “has” – VBZ (verb)
  3. “a” – DT (determiner)
  4. “sister” – NN (noun)
  5. “named” – VBZ (verb)
  6. “Penny” –NNP (proper noun)
  7. “.” – period

7. Lemmatization

Now that we have the part-of-speech information of the tokens in a sentence, we can analyze the text even further.

Lemmatization is the process of mapping a word form that can have a tense, gender, mood or other information to the base form of the word – also called its “lemma”.

A lemmatizer takes a token and its part-of-speech tag as input and returns the word's lemma. Hence, before Lemmatization, the sentence should be passed through a tokenizer and POS tagger.

Apache OpenNLP provides two types of lemmatization:

  • Statistical – needs a lemmatizer model built using training data for finding the lemma of a given word
  • Dictionary-based – requires a dictionary which contains all valid combinations of a word, POS tags, and the corresponding lemma

For statistical lemmatization, we need to train a model, whereas for the dictionary lemmatization we just need a dictionary file like this one.

Let's look at a code example using a dictionary file:

@Test public void givenEnglishDictionary_whenLemmatize_thenLemmasAreDetected() throws Exception { SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE; String[] tokens = tokenizer.tokenize("John has a sister named Penny."); InputStream inputStreamPOSTagger = getClass() .getResourceAsStream("/models/en-pos-maxent.bin"); POSModel posModel = new POSModel(inputStreamPOSTagger); POSTaggerME posTagger = new POSTaggerME(posModel); String tags[] = posTagger.tag(tokens); InputStream dictLemmatizer = getClass() .getResourceAsStream("/models/en-lemmatizer.dict"); DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer( dictLemmatizer); String[] lemmas = lemmatizer.lemmatize(tokens, tags); assertThat(lemmas) .contains("O", "have", "a", "sister", "name", "O", "O"); }

As we can see, we get the lemma for every token. “O” indicates that the lemma could not be determined as the word is a proper noun. So, we don't have a lemma for “John” and “Penny”.

But we have identified the lemmas for the other words of the sentence:

  • has – have
  • a – a
  • sister – sister
  • named – name

8. Chunking

Part-of-speech information is also essential in chunking – dividing sentences into grammatically meaningful word groups like noun groups or verb groups.

Similar to before, we tokenize a sentence and use part-of-speech tagging on the tokens before the calling the chunk() method:

@Test public void givenChunkerModel_whenChunk_thenChunksAreDetected() throws Exception { SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE; String[] tokens = tokenizer.tokenize("He reckons the current account deficit will narrow to only 8 billion."); InputStream inputStreamPOSTagger = getClass() .getResourceAsStream("/models/en-pos-maxent.bin"); POSModel posModel = new POSModel(inputStreamPOSTagger); POSTaggerME posTagger = new POSTaggerME(posModel); String tags[] = posTagger.tag(tokens); InputStream inputStreamChunker = getClass() .getResourceAsStream("/models/en-chunker.bin"); ChunkerModel chunkerModel = new ChunkerModel(inputStreamChunker); ChunkerME chunker = new ChunkerME(chunkerModel); String[] chunks = chunker.chunk(tokens, tags); assertThat(chunks).contains( "B-NP", "B-VP", "B-NP", "I-NP", "I-NP", "I-NP", "B-VP", "I-VP", "B-PP", "B-NP", "I-NP", "I-NP", "O"); }

As we can see, we get an output for each token from the chunker. “B” represents the start of a chunk, “I” represents the continuation of the chunk and “O” represents no chunk.

Parsing the output from our example, we get 6 chunks:

  1. “He” – noun phrase
  2. “reckons” – verb phrase
  3. “the current account deficit” – noun phrase
  4. “will narrow” – verb phrase
  5. “to” – preposition phrase
  6. “only 8 billion” – noun phrase

9. Language Detection

Additionally to the use cases already discussed, OpenNLP also provides a language detection API that allows to identify the language of a certain text.

For language detection, we need a training data file. Such a file contains lines with sentences in a certain language. Each line is tagged with the correct language to provide input to the machine learning algorithms.

A sample training data file for language detection can be downloaded here.

We can load the training data file into a LanguageDetectorSampleStream, define some training data parameters, create a model and then use the model to detect the language of a text:

@Test public void givenLanguageDictionary_whenLanguageDetect_thenLanguageIsDetected() throws FileNotFoundException, IOException { InputStreamFactory dataIn = new MarkableFileInputStreamFactory( new File("src/main/resources/models/DoccatSample.txt")); ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); LanguageDetectorSampleStream sampleStream = new LanguageDetectorSampleStream(lineStream); TrainingParameters params = new TrainingParameters(); params.put(TrainingParameters.ITERATIONS_PARAM, 100); params.put(TrainingParameters.CUTOFF_PARAM, 5); params.put("DataIndexer", "TwoPass"); params.put(TrainingParameters.ALGORITHM_PARAM, "NAIVEBAYES"); LanguageDetectorModel model = LanguageDetectorME .train(sampleStream, params, new LanguageDetectorFactory()); LanguageDetector ld = new LanguageDetectorME(model); Language[] languages = ld .predictLanguages("estava em uma marcenaria na Rua Bruno"); assertThat(Arrays.asList(languages)) .extracting("lang", "confidence") .contains( tuple("pob", 0.9999999950605625), tuple("ita", 4.939427661577956E-9), tuple("spa", 9.665954064665144E-15), tuple("fra", 8.250349924885834E-25))); }

The result is a list of the most probable languages along with a confidence score.

And, with rich models , we can achieve a very higher accuracy with this type of detection.

5. Conclusion

Kami banyak meneroka di sini, dari kemampuan menarik OpenNLP. Kami memberi tumpuan kepada beberapa ciri menarik untuk melakukan tugas NLP seperti lemmatization, POS tagging, Tokenization, Sentence Detection, Language Detection dan banyak lagi.

Seperti biasa, pelaksanaan lengkap semua perkara di atas dapat dilihat di GitHub.