Metadata-Version: 2.1
Name: connlp
Version: 0.0.11
Summary: A bunch of python codes to analyze text data in the construction industry. Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP)
Home-page: https://github.com/blank54/connlp.git
Author: Seonghyeon Boris Moon
Author-email: boris.moon514@gmail.com
License: UNKNOWN
Description: # connlp
        A bunch of python codes to analyze text data in the construction industry.  
        Mainly reconstitute the pre-exist python libraries for Natural Language Processing (NLP).
        
        ## _Project Information_
        - Supported by C!LAB (@Seoul Nat'l Univ.)
        
        ## _Contributors_
        - Seonghyeon Boris Moon (blank54@snu.ac.kr, https://github.com/blank54/)
        - Gitaek Lee (lgt0427@snu.ac.kr)
        - Taeyeon Chang (jgwoon1838@snu.ac.kr, _a.k.a. Kowoon Chang_)
        - Sehwan Chung (hwani751@snu.ac.kr)
        
        # Initialize
        
        ## _Setup_
        
        ```shell
        pip install connlp
        ```
        
        ## _Test_
        
        If the code below runs with no error, _**connlp**_ is installed successfully.
        
        ```python
        from connlp.test import hello
        hello()
        
        # 'Helloworld'
        ```
        
        # Preprocess
        
        Preprocessing module supports English and Korean.  
        NOTE: No plan for other languages (by 2021.04.02.).
        
        ## _Normalizer_
        
        _**Normalizer**_ normalizes the input text by eliminating trash characters and remaining numbers, alphabets, and punctuation marks.
        
        ```python
        from connlp.preprocess import Normalizer
        normalizer = Normalizer()
        
        normalizer.normalize(text='I am a boy!')
        
        # 'i am a boy'
        ```
        
        ## _EnglishTokenizer_
        
        _**EnglishTokenizer**_ tokenizes the input text in English based on word spacing.  
        The ngram-based tokenization is in preparation.
        
        ```python
        from connlp.preprocess import EnglishTokenizer
        tokenizer = EnglishTokenizer()
        
        tokenizer.tokenize(text='I am a boy!')
        
        # ['I', 'am', 'a', 'boy!']
        ```
        
        # Embedding
        
        ## _Vectorizer_
        
        _**Vectorizer**_ includes several text embedding methods that have been commonly used for decades.  
        
        ### _tfidf_
        
        TF-IDF is the most commonly used technique for word embedding.  
        The TF-IDF model counts the term frequency(TF) and inverse document frequency(IDF) from the given documents.  
        The results included the followings.  
        - TF-IDF Vectorizer (a class of sklearn.feature_extraction.text.TfidfVectorizer')
        - TF-IDF Matrix
        - TF-IDF Vocabulary
        
        ```python
        from connlp.preprocess import EnglishTokenizer
        from connlp.embedding import Vectorizer
        tokenizer = EnglishTokenizer()
        vectorizer = Vectorizer()
        
        docs = ['I am a boy', 'He is a boy', 'She is a girl']
        tfidf_vectorizer, tfidf_matrix, tfidf_vocab = vectorizer.tfidf(docs=docs)
        type(tfidf_vectorizer)
        
        # <class 'sklearn.feature_extraction.text.TfidfVectorizer'>
        ```
        
        The user can get a document vector by indexing the _**tfidf_matrix**_.
        
        ```python
        tfidf_matrix[0]
        
        # (0, 2)    0.444514311537431
        # (0, 0)    0.34520501686496574
        # (0, 1)    0.5844829010200651
        # (0, 5)    0.5844829010200651
        ```
        
        The _**tfidf_vocab**_ returns an index for every token.
        
        ```python
        print(tfidf_vocab)
        
        # {'i': 5, 'am': 1, 'a': 0, 'boy': 2, 'he': 4, 'is': 6, 'she': 7, 'girl': 3}
        ```
        
        ### _word2vec_
        
        Word2Vec is a distributed representation language model for word embedding.  
        The Word2vec model trains tokenized docs and returns word vectors.  
        The result is a class of 'gensim.models.word2vec.Word2Vec'.
        
        ```python
        from connlp.preprocess import EnglishTokenizer
        from connlp.embedding import Vectorizer
        tokenizer = EnglishTokenizer()
        vectorizer = Vectorizer()
        
        docs = ['I am a boy', 'He is a boy', 'She is a girl']
        tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
        w2v_model = vectorizer.word2vec(docs=tokenized_docs)
        type(w2v_model)
        
        # <class 'gensim.models.word2vec.Word2Vec'>
        ```
        
        The user can get a word vector by _**.wv**_ method.
        
        ```python
        w2v_model.wv['boy']
        
        # [-2.0130998e-03 -3.5652996e-03  2.7793974e-03 ...]
        ```
        
        The Word2Vec model provides the _topn_-most similar word vectors.
        
        ```python
        w2v_model.wv.most_similar('boy', topn=3)
        
        # [('He', 0.05311150848865509), ('a', 0.04154288396239281), ('She', -0.029122961685061455)]
        ```
        
        ### _word2vec (update)_
        
        The user can update the Word2Vec model with new data.
        
        ```python
        new_docs = ['Tom is a man', 'Sally is not a boy']
        tokenized_new_docs = [tokenizer.tokenize(text=doc) for doc in new_docs]
        w2v_model_updated = vectorizer.word2vec_update(w2v_model=w2v_model, new_docs=tokenized_new_docs)
        
        w2v_model_updated.wv['man']
        
        # [4.9649975e-03  3.8002312e-04 -1.5773597e-03 ...]
        ```
        
        ### _doc2vec_
        
        Doc2Vec is a distributed representation language model for longer text (e.g., sentence, paragraph, document) embedding.  
        The Doc2vec model trains tokenized docs with tags and returns document vectors.  
        The result is a class of 'gensim.models.doc2vec.Doc2Vec'.
        
        ```python
        from connlp.preprocess import EnglishTokenizer
        from connlp.embedding import Vectorizer
        tokenizer = EnglishTokenizer()
        vectorizer = Vectorizer()
        
        docs = ['I am a boy', 'He is a boy', 'She is a girl']
        tagged_docs = [(idx, tokenizer.tokenize(text=doc)) for idx, doc in enumerate(docs)]
        d2v_model = vectorizer.doc2vec(tagged_docs=tagged_docs)
        type(d2v_model)
        
        # <class 'gensim.models.doc2vec.Doc2Vec'>
        ```
        
        The Doc2Vec model can infer a new document.
        
        ```python
        test_doc = ['My', 'name', 'is', 'Peter']
        d2v_model.infer_vector(doc_words=test_doc)
        
        # [4.8494316e-03 -4.3647490e-03  1.1437446e-03 ...]
        ```
        
        # Analysis
        
        ## _TopicModel_
        
        _**TopicModel**_ is a class for topic modeling based on gensim LDA model.  
        It provides a simple way to train lda model and assign topics to docs.  
        
        _**TopicModel**_ requires two instances.  
        - a dict of docs whose keys are the tag
        - the number of topics for modeling
        
        ```python
        from connlp.analysis import TopicModel
        
        num_topics = 2
        docs = {'doc1': ['I', 'am', 'a', 'boy'],
                'doc2': ['He', 'is', 'a', 'boy'],
                'doc3': ['Cat', 'on', 'the', 'table'],
                'doc4': ['Mike', 'is', 'a', 'boy'],
                'doc5': ['Dog', 'on', 'the', 'table'],
                }
        
        lda_model = TopicModel(docs=docs, num_topics=num_topics)
        ```
        
        ### _learn_
        
        The users can train the model with _learn_ method.
        Unless parameters being provided by the users, the model trains based on default parameters.  
        
        After _learn_, _**TopicModel**_ provides _model_ instance that is a class of <'gensim.models.ldamodel.LdaModel'>
        
        
        ```python
        parameters = {
            'iterations': 100,
            'alpha': 0.7,
            'eta': 0.05,
        }
        lda_model.learn(parameters=parameters)
        type(lda_model.model)
        
        # <class 'gensim.models.ldamodel.LdaModel'>
        ```
        
        ### _coherence_
        
        _**TopicModel**_ provides coherence value for model evaluation.  
        The coherence value is automatically calculated right after model training.
        
        ```python
        print(lda_model.coherence)
        
        # 0.3607990279229385
        ```
        
        ### _assign_
        
        The users can easily assign the most proper topic to each doc using _assign_ method.  
        After _assign_, the _**TopicModel**_ provides _tag2topic_ and _topic2tag_ instances for convenience.
        
        ```python
        lda_model.assign()
        
        print(lda_model.tag2topic)
        print(lda_model.topic2tag)
        
        # defaultdict(<class 'int'>, {'doc1': 1, 'doc2': 1, 'doc3': 0, 'doc4': 1, 'doc5': 0})
        # defaultdict(<class 'list'>, {1: ['doc1', 'doc2', 'doc4'], 0: ['doc3', 'doc5']})
        ```
        
        ## _Named Entity Recognition_
        
        ### _Labels_
        
        _**NER_Model**_ is a class to conduct named entity recognition using Bi-directional Long-Short Term Memory (Bi-LSTM) and Conditional Random Field (CRF).  
        
        At the beginning, appropriate labels are required.  
        The labels should be numbered with start of 0.
        
        ```python
        from connlp.analysis import NER_Lables
        
        label_dict = {'NON': 0,     #None
                      'PER': 1,     #PERSON
                      'FOD': 2,}    #FOOD
        
        ner_labels = NER_Labels(label_dict=label_dict)
        ```
        
        ### _Corpus_
        
        Next, the users should prepare data including sentences and labels, of which each data being matched by the same tag.  
        The tokenized sentences and labels are then combined via _**NER_LabeledSentence**_.  
        With the data, labels, and a proper size of _max_sent_len_ (i.e., the maximum length of sentence for analysis), _**NER_Corpus**_ would be developed.  
        Once the corpus was developed, every data of sentences and labels would be padded with the length of _max_sent_len_.  
        
        ```python
        from connlp.preprocess import EnglishTokenizer
        from connlp.analysis import NER_LabeledSentence, NER_Corpus
        tokenizer = EnglishTokenizer()
        
        data_sents = {'sent1': 'Sam likes pizza',
                      'sent2': 'Erik eats pizza',
                      'sent3': 'Erik and Sam are drinking soda',
                      'sent4': 'Flora cooks chicken',
                      'sent5': 'Sam ordered a chicken',
                      'sent6': 'Flora likes chicken sandwitch',
                      'sent7': 'Erik likes to drink soda'}
        data_labels = {'sent1': [1, 0, 2],
                       'sent2': [1, 0, 2],
                       'sent3': [1, 0, 1, 0, 0, 2],
                       'sent4': [1, 0, 2],
                       'sent5': [1, 0, 0, 2],
                       'sent6': [1, 0, 2, 2],
                       'sent7': [1, 0, 0, 0, 2]}
        
        docs = []
        for tag, sent in data_sents.items():
            words = [str(w) for w in tokenizer.tokenize(text=sent)]
            labels = data_labels[tag]
            docs.append(NER_LabeledSentence(tag=tag, words=words, labels=labels))
        
        max_sent_len = 10
        ner_corpus = NER_Corpus(docs=docs, ner_labels=ner_labels, max_sent_len=max_sent_len)
        type(ner_corpus)
        
        # <class 'connlp.analysis.NER_Corpus'>
        ```
        
        ### _Word Embedding_
        
        Every word in the _**NER_Corpus**_ should be embedded into numeric vector space.  
        The user can conduct embedding with Word2Vec which is provided in _**Vectorizer**_ of _**connlp**_.  
        Note that the embedding process of _**NER_Corpus**_ only requires the dictionary of word vectors and the feature size.  
        
        ```python
        from connlp.preprocess import EnglishTokenizer
        from connlp.embedding import Vectorizer
        tokenizer = EnglishTokenizer()
        vectorizer = Vectorizer()
        
        tokenized_sents = [tokenizer.tokenize(sent) for sent in data_sents.values()]
        w2v_model = vectorizer.word2vec(docs=tokenized_sents)
        
        word2vector = vectorizer.get_word_vectors(w2v_model)
        feature_size = w2v_model.vector_size
        ner_corpus.word_embedding(word2vector=word2vector, feature_size=feature_size)
        print(ner_corpus.X_embedded)
        
        # [[[-2.40120804e-03  1.74632657e-03  ...]
        #   [-3.57543468e-03  2.86567654e-03  ...]
        #   ...
        #   [ 0.00000000e+00  0.00000000e+00  ...]] ...]
        ```
        
        ### _Model Initialization_
        
        The parameters for Bi-LSTM and model training should be provided, however, they can be composed of a single dictionary.  
        The user should initialize the _**NER_Model**_ with _**NER_Corpus**_ and the parameters.
        
        ```python
        from connlp.analysis import NER_Model
        
        parameters = {
            # Parameters for Bi-LSTM.
            'lstm_units': 512,
            'lstm_return_sequences': True,
            'lstm_recurrent_dropout': 0.2,
            'dense_units': 100,
            'dense_activation': 'relu',
        
            # Parameters for model training.
            'test_size': 0.3,
            'batch_size': 1,
            'epochs': 100,
            'validation_split': 0.1,
        }
        
        ner_model = NER_Model()
        ner_model.initialize(ner_corpus=ner_corpus, parameters=parameters)
        type(ner_model)
        
        # <class 'connlp.analysis.NER_Model'>
        ```
        
        ### _Model Training_
        
        The user can train the _**NER_Model**_ with customized parameters.  
        The model automatically gets the dataset from the _**NER_Corpus**_.  
        
        ```python
        ner_model.train(parameters=parameters)
        
        # Train on 3 samples, validate on 1 samples
        # Epoch 1/100
        # 3/3 [==============================] - 3s 1s/step - loss: 1.4545 - crf_viterbi_accuracy: 0.3000 - val_loss: 1.0767 - val_crf_viterbi_accuracy: 0.8000
        # Epoch 2/100
        # 3/3 [==============================] - 0s 74ms/step - loss: 0.8602 - crf_viterbi_accuracy: 0.7000 - val_loss: 0.5287 - val_crf_viterbi_accuracy: 0.8000
        # ...
        ```
        
        ### _Model Evaluation_
        
        The model performance can be shown in the aspects of confusion matrix and F1 score.
        
        ```python
        ner_model.evaluate()
        
        # |--------------------------------------------------
        # |Confusion Matrix:
        # [[ 3  0  3  6]
        #  [ 1  3  0  4]
        #  [ 0  0  2  2]
        #  [ 4  3  5 12]]
        # |--------------------------------------------------
        # |F1 Score: 0.757
        # |--------------------------------------------------
        # |    [NON]: 0.600
        # |    [PER]: 0.857
        # |    [FOD]: 0.571
        ```
        
        ### _Save_
        
        The user can save the _**NER_Model**_.  
        The model would save the model itself ("\<FileName\>.pk") and the dataset ("\<FileName\>-dataset.pk") that was used in model development.  
        Note that the directory should exist before saving the model.  
        
        ```python
        from connlp.util import makedir
        
        fpath_model = 'test/ner/model.pk'
        makedir(fpath=fpath_model)
        ner_model.save(fpath_model=fpath_model)
        ```
        
        ### _Load_
        
        If the user wants to load the already trained model, just call the model and load.  
        
        ```python
        fpath_model = 'test/ner/model.pk'
        ner_model = NER_Model()
        ner_model.load(fpath_model=fpath_model, ner_corpus=ner_corpus, parameters=parameters)
        ```
        
        ### _Prediction_
        
        _**NER_Model**_ can conduct a new NER task on the given sentence.  
        The result is a class of _**NER_Result**_.  
        
        ```python
        from connlp.preprocess import EnglishTokenizer
        vectorizer = Vectorizer()
        
        new_sent = 'Tom eats apple'
        tokenized_sent = tokenizer.tokenize(new_sent)
        ner_result = ner_model.predict(sent=tokenized_sent)
        print(ner_result)
        
        # Tom/PER eats/NON apple/FOD
        ```
        
        # Visualization
        
        ## _Visualizer_
        
        _**Visualizer**_ includes several simple tools for text visualization.
        
        ### _network_
        
        _**network**_ method provides a word network for tokenized docs.
        
        ```python
        from connlp.preprocess import EnglishTokenizer
        from connlp.visualize import Visualizer
        tokenizer = EnglishTokenizer()
        visualizer = Visualizer()
        
        docs = ['I am a boy', 'She is a girl']
        tokenized_docs = [tokenizer.tokenize(text=doc) for doc in docs]
        visualizer.network(docs=tokenized_docs, show=True)
        ```
        
        # Extracting Text
        
        ## _TextConverter_
        
        _**TextConverter**_ includes several methods that extract raw text from various types of files (e.g. PDF, HWP) and/or converts the files into plain text files (e.g. TXT).
        
        ### _hwp2txt_
        
        _**hwp2txt**_ method converts a HWP file into a plain text file.
        Dependencies: pyhwp package
        
        Install pyhwp (you need to install the pre-release version)
        
        ```
        pip install --pre pyhwp
        ```
        
        Example
        
        ```python
        from connlp.text_extract import TextConverter
        converter = TextConverter()
        
        hwp_fpath = '/data/raw/hwp_file.hwp'
        output_fpath = '/data/processed/extracted_text.txt'
        
        converter.hwp2txt(hwp_fpath, output_fpath) # returns 0 if no error occurs
        ```
        
        # GPU Utils
        
        ## _GPUMonitor_
        
        _**GPUMonitor**_ generates a class to monitor and display the GPU status based on nvidia-smi.  
        Refer to "https://github.com/anderskm/gputil" and "https://data-newbie.tistory.com/561" for usages.
        
        Install _GPUtils_ module with _pip_.
        
        ```
        pip install GPUtils
        ```
        
        Write your code between the initiation of the _**GPUMonitor**_ and _**monitor.stop()**_.
        
        ```python
        from connlp.util import GPUMonitor
        
        monitor = GPUMonitor(delay=3)
        # >>>Write your code here<<<
        monitor.stop()
        
        # | ID | GPU | MEM |
        # ------------------
        # |  0 |  0% |  0% |
        # |  1 |  1% |  0% |
        # |  2 |  0% | 94% |
        ```
Platform: UNKNOWN
Description-Content-Type: text/markdown
