[python] 지문을 문장 단위로 분리해주는 파이썬 코드

파이썬

[python] 지문을 문장 단위로 분리해주는 파이썬 코드

flowedu 2023. 3. 29. 12:09

맥북에서 작업했고, 워드 파일을 이용했습니다.

워드 파일의 지문을 문장 단위로 분리해주는게 가능할까? 라는 의문에서 시작했습니다.

참고로 chatGPT 4를 이용해 코드를 생성했습니다.

예전 같으면 어떤 식으로 접근해야 할지 몰라

구글 검색을 전전하면서 실마리를 찾지 못했을텐데,

chatGPT의 등장으로 저 같은 코드 초보는 감히 상상할 수 없을만큼 큰 도움을 받고 있습니다.

해결하고 싶은 문제는 다음과 같습니다.

예를 들어 다음과 같은 지문이 있다고 할때,

In the past there was little genetic pressure to stop people from becoming obese. Genetic mutations that drove people to consume fewer calories were much less likely to be passed on, because in an environment where food was scarcer and its hunting or gathering required considerable energy outlay, an individual with that mutation would probably die before they had a chance to reproduce. Mutations that in our environment of abundant food now drive us towards obesity, on the other hand, were incorporated into the population. Things are of course very different now but the problem is that evolutionary timescales are long. It's only in the last century or so, approximately 0.00004 per cent of mammalian evolutionary time, that we managed to tweak our environment to such a degree that we can pretty much eat whatever we want, whenever we want it. Evolution has another couple of thousand years to go before it can catch up with the current reality of online food shopping and delivery.

위 지문을 아래와 같이 문장 단위로 구분하는 일이 종종 필요합니다.

물론 수작업으로 해도 되지만 지문이 많아지면 그것도 쉽지 않아서,

혹시 코딩으로 가능한지 시도해 보았습니다.

In the past there was little genetic pressure to stop people from becoming obese.

Genetic mutations that drove people to consume fewer calories were much less likely to be passed on, because in an environment where food was scarcer and its hunting or gathering required considerable energy outlay, an individual with that mutation would probably die before they had a chance to reproduce.

Mutations that in our environment of abundant food now drive us towards obesity, on the other hand, were incorporated into the population.

Things are of course very different now but the problem is that evolutionary timescales are long.

It's only in the last century or so, approximately 0.00004 per cent of mammalian evolutionary time, that we managed to tweak our environment to such a degree that we can pretty much eat whatever we want, whenever we want it.

Evolution has another couple of thousand years to go before it can catch up with the current reality of online food shopping and delivery.

일단 챗GPT 사이트(https://chat.openai.com/chat)로 가서

아래 프롬프트를 이용해 코드를 생성하도록 했습니다.

chatGPT 4 Prompt:

맥북 사용함.

파이선을 이용해 코드 작성.
선택한 텍스트를 문장으로 구분해주는 파이썬 코드 작성.
각 문장을 줄바꿈으로 구분하고 나열해줄 것.
각 문장은 완벽한 형태의 문장일 것.
물음표로 끝나는 문장도 구분해 줄 것.
마침표로 끝나는 것 중에 문장이 아닌 것이 많으므로 이것도 상세 구분히 가능해야 함.
코드의 주석을 최대로 달아줄 것.
실행 방법도 상세히 알려줄 것.

워드 파일을 불러와서 원본 아래 결과값을 출력하는 것으로 수정할 것.
워드 파일은 창을 열어 유저가 지정할 수 있게 해줄 것.

그리고 그 결과값입니다.

한방에 원하는 결과가 나오지 않기 때문에

챗GPT가 잘 알아듣게 원하는 내용을 말해줘야 합니다.

그리고 오류가 나오면 원하는 내용을 다시 문의하면서 수정했습니다.

파이썬 코드를 실행해 본 적이 없다면

~~엄청난~~ 약간의 진입장벽을 느낄 수 있습니다ㅠ.ㅠ

제가 설명을 남기면 좋은데 자신이 없네요.

유튜브에 기본 자료를 참고하시면 좋을듯 싶고,

실행 방법에 대해 궁금한 사항이 있으면 댓글 주시면 최대한 설명 남기겠습니다.

import os
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from nltk.tokenize.punkt import PunktTrainer, PunktParameters
from docx import Document
import tkinter as tk
from tkinter import filedialog

nltk.download('punkt')

# 사용자 정의 줄임말 목록
custom_abbreviations = "etc. e.g. i.e."

def split_sentences(text):
    # PunktTrainer를 사용하여 사용자 정의 문장 분리 문자를 추가합니다.
    punkt_param = PunktParameters()
    punkt_param.abbrev_types = set(custom_abbreviations.split())
    trainer = PunktTrainer(text, punkt_param)
    tokenizer = PunktSentenceTokenizer(trainer.get_params())

    sentences = tokenizer.tokenize(text)
    return '\n\n'.join(sentences)

def main():
    # 파일 선택 창 띄우기
    root = tk.Tk()
    root.withdraw()
    file_path = filedialog.askopenfilename(filetypes=[("Word files", "*.docx")])

    if not file_path:
        print("No file selected.")
        return

    # 워드 파일 읽기
    document = Document(file_path)

    # 원본 텍스트 추출
    text = ""
    for paragraph in document.paragraphs:
        text += paragraph.text + "\n"

    # 문장 분리
    split_text = split_sentences(text)

    # 결과값을 원본 텍스트 아래에 추가
    document.add_paragraph("\n\n---\n\n")
    document.add_paragraph(split_text)

    # 결과를 새 파일에 저장
    output_file_path = os.path.splitext(file_path)[0] + "_output.docx"
    document.save(output_file_path)

    print(f"Processed file saved at: {output_file_path}")


if __name__ == "__main__":
    main()

저는 PyCharm을 이용해 실행했고 잘 작동하는 것을 확인했습니다.

코드를 실행하면 워드 파일의 위치를 지정하도록 물어보고,

파일을 선택하면 워드 파일을 열어서 원문 뒤에 문장으로 구분된 결과값을 나열한 파일을 생성해 줍니다.

내용이 긴 파일도 작성하는 데 몇초면 가능했습니다.

결과값은 95% 이상 잘 작동하는듯 합니다.

완벽하진 않지만 이정도면 작업하는데 문제 없어서 만족스러웠습니다 :)

'파이썬' 카테고리의 다른 글

단진자 운동을 시각화한 파이썬 코드 (0)	2023.03.22

커피 한잔의 후원은 콘텐츠 제작에 큰 힘이 됩니다
(모바일에서는 배너를 클릭해 주세요)

현재글[python] 지문을 문장 단위로 분리해주는 파이썬 코드

Flow 영어연구소