Study/AI

[LangChain] Loader를 사용해 문서 불러오기

hongeeii 2025. 8. 18.
728x90
반응형
  • TextLoader
  • DirectoryLoader
  • CSVLoader
  • PyPDFLoader
  • WebBaseLoader
  • RecursiveUrlLoader
  • WikiPediaLoader

-- 모듈 설치 ipynb 에서 명령어를 실행해서 앞에 ! 를 붙임
!pip install langchain_community==0.3.18
 

설치 성공 로그

  Collecting langchain_community==0.3.18
    Downloading langchain_community-0.3.18-py3-none-any.whl.metadata (2.4 kB)
  Requirement already satisfied: langchain-core<1.0.0,>=0.3.37 in c:\users\최민재\appdata\local\programs\python\python312\lib\site-packages (from langchain_community==0.3.18) (0.3.74)
  Collecting langchain<1.0.0,>=0.3.19 (from langchain_community==0.3.18)
    Downloading langchain-0.3.27-py3-none-any.whl.metadata (7.8 kB)
  Collecting SQLAlchemy<3,>=1.4 (from langchain_community==0.3.18)
    Downloading sqlalchemy-2.0.43-cp312-cp312-win_amd64.whl.metadata (9.8 kB)
  Requirement already satisfied: requests<3,>=2 in c:\users\최민재\appdata\local\programs\python\python312\lib\site-packages (from langchain_community==0.3.18) (2.32.4)
  Requirement already satisfied: PyYAML>=5.3 in c:\users\최민재\appdata\local\programs\python\python312\lib\site-packages (from langchain_community==0.3.18) (6.0.2)
  Collecting aiohttp<4.0.0,>=3.8.3 (from langchain_community==0.3.18)
    Downloading aiohttp-3.12.15-cp312-cp312-win_amd64.whl.metadata (7.9 kB)
  Requirement already satisfied: tenacity!=8.4.0,<10,>=8.1.0 in c:\users\최민재\appdata\local\programs\python\python312\lib\site-packages (from langchain_community==0.3.18) (9.1.2)
  Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community==0.3.18)
    Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
  Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community==0.3.18)
    Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
  Collecting langsmith<0.4,>=0.1.125 (from langchain_community==0.3.18)
    Downloading langsmith-0.3.45-py3-none-any.whl.metadata (15 kB)
  Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community==0.3.18)
    Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
  Collecting numpy<3,>=1.26.2 (from langchain_community==0.3.18)
    Downloading numpy-2.3.2-cp312-cp312-win_amd64.whl.metadata (60 kB)
      ---------------------------------------- 0.0/60.9 kB ? eta -:--:--
      ---------------------------------------- 60.9/60.9 kB 3.4 MB/s eta 0:00:00
  Collecting aiohappyeyeballs>=2.5.0 (from aiohttp<4.0.0,>=3.8.3->langchain_community==0.3.18)
  ...
      Found existing installation: langsmith 0.4.13
      Uninstalling langsmith-0.4.13:
        Successfully uninstalled langsmith-0.4.13
  Successfully installed SQLAlchemy-2.0.43 aiohappyeyeballs-2.6.1 aiohttp-3.12.15 aiosignal-1.4.0 attrs-25.3.0 dataclasses-json-0.6.7 frozenlist-1.7.0 greenlet-3.2.4 httpx-sse-0.4.1 langchain-0.3.27 langchain-text-splitters-0.3.9 langchain_community-0.3.18 langsmith-0.3.45 marshmallow-3.26.1 multidict-6.6.4 mypy-extensions-1.1.0 numpy-2.3.2 propcache-0.3.2 pydantic-settings-2.10.1 typing-inspect-0.9.0 yarl-1.20.1
  Output is truncated. View as a scrollable element or open in a text editor. Adjust cell output settings...

[notice] A new release of pip is available: 24.1.1 -> 25.2 [notice] To update, run: python.exe -m pip install --upgrade pip

TextLoader

from langchain_community.document_loaders import TextLoader

loader = TextLoader('../docs/sample1.txt', encoding='utf-8')
loader.load()
 

TextLoader 를 사용해 텍스트 파일을 읽을 수 있음
encoding 으로 인해 못부를 수 도 있으니 encoding 에 신경쓰자.

[Document(metadata={'source': '../docs/sample1.txt'}, page_content='텍스트 파일 샘플 1입니다.')]
 

DirectoryLoader

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader('../docs/', glob="*.txt", loader_cls=TextLoader,
                         loader_kwargs={"encoding": "utf-8"})
loader.load()
 

glob - 어떤 파일들을 가져올지 패턴을 명시하는 파라미터

loader_cls - 디렉토리 내의 지정된 파일을 로드할 로더 클래스 지정

loader_kwargs - 로더 클래스에 전달해줄 인자를 지정

[Document(metadata={'source': '..\\docs\\sample1.txt'}, page_content='텍스트 파일 샘플 1입니다.'),
Document(metadata={'source': '..\\docs\\sample2.txt'}, page_content='텍스트 파일 샘플 2입니다.')]
 

CSVLoader

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader('../docs/sample.csv', encoding='utf-8')
loader.load()
 
[Document(metadata={'source': '../docs/sample.csv', 'row': 0}, page_content='번호: 1\n과일: 사과\n가격: 1000'),
 Document(metadata={'source': '../docs/sample.csv', 'row': 1}, page_content='번호: 2\n과일: 복숭아\n가격: 2000'),
 Document(metadata={'source': '../docs/sample.csv', 'row': 2}, page_content='번호: 3\n과일: 바나나\n가격: 3000'),
 Document(metadata={'source': '../docs/sample.csv', 'row': 3}, page_content='번호: 4\n과일: 오렌지\n가격: 4000')]
 

PyPDFLoader

파이썬에서 PDF 파일을 다루려면 모듈이 하나 필요함

!pip install pypdf
 

설치 결과 로그

Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Downloading pypdf-6.0.0-py3-none-any.whl (310 kB)
   ---------------------------------------- 0.0/310.5 kB ? eta -:--:--
   -------------------------- ------------- 204.8/310.5 kB 6.3 MB/s eta 0:00:01
   ---------------------------------------- 310.5/310.5 kB 6.5 MB/s eta 0:00:00
Installing collected packages: pypdf
Successfully installed pypdf-6.0.0

[notice] A new release of pip is available: 24.1.1 -> 25.2 [notice] To update, run: python.exe -m pip install --upgrade pip

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('../docs/sample.pdf')
loader.load()
 
[Document(metadata={'producer': 'Microsoft® Word 2013', 'creator': 'Microsoft® Word 2013', 'creationdate': '2025-03-23T10:39:49+09:00', 'author': 'NadoCoding', 'moddate': '2025-03-23T10:39:49+09:00', 'source': '../docs/sample.pdf', 'total_pages': 5, 'page': 0, 'page_label': '1'}, page_content='샘플 PDF 파일 #1 입니다'),
 Document(metadata={'producer': 'Microsoft® Word 2013', 'creator': 'Microsoft® Word 2013', 'creationdate': '2025-03-23T10:39:49+09:00', 'author': 'NadoCoding', 'moddate': '2025-03-23T10:39:49+09:00', 'source': '../docs/sample.pdf', 'total_pages': 5, 'page': 1, 'page_label': '2'}, page_content='샘플 PDF 파일 #2 입니다'),
 Document(metadata={'producer': 'Microsoft® Word 2013', 'creator': 'Microsoft® Word 2013', 'creationdate': '2025-03-23T10:39:49+09:00', 'author': 'NadoCoding', 'moddate': '2025-03-23T10:39:49+09:00', 'source': '../docs/sample.pdf', 'total_pages': 5, 'page': 2, 'page_label': '3'}, page_content='샘플 PDF 파일 #3 입니다'),
 Document(metadata={'producer': 'Microsoft® Word 2013', 'creator': 'Microsoft® Word 2013', 'creationdate': '2025-03-23T10:39:49+09:00', 'author': 'NadoCoding', 'moddate': '2025-03-23T10:39:49+09:00', 'source': '../docs/sample.pdf', 'total_pages': 5, 'page': 3, 'page_label': '4'}, page_content='샘플 PDF 파일 #4 입니다'),
 Document(metadata={'producer': 'Microsoft® Word 2013', 'creator': 'Microsoft® Word 2013', 'creationdate': '2025-03-23T10:39:49+09:00', 'author': 'NadoCoding', 'moddate': '2025-03-23T10:39:49+09:00', 'source': '../docs/sample.pdf', 'total_pages': 5, 'page': 4, 'page_label': '5'}, page_content='샘플 PDF 파일 #5 입니다')]
 

페이지 당 하나의 Document 객체로 반환됨

metadata - 해당 페이지의 메타 정보들이 들어있음

page_content - 페이지의 내용이 들어있음

WebBaseLoader

beautifulsoup4 라는 모듈을 설치

웹 스크래핑 할 때 사용하는 모듈임

!pip install beautifulsoup4
 

설치 결과 로그

Collecting beautifulsoup4
  Downloading beautifulsoup4-4.13.4-py3-none-any.whl.metadata (3.8 kB)
Collecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.7-py3-none-any.whl.metadata (4.6 kB)
Requirement already satisfied: typing-extensions>=4.0.0 in c:\users\최민재\appdata\local\programs\python\python312\lib\site-packages (from beautifulsoup4) (4.14.1)
Downloading beautifulsoup4-4.13.4-py3-none-any.whl (187 kB)
   ---------------------------------------- 0.0/187.3 kB ? eta -:--:--
   ------ --------------------------------- 30.7/187.3 kB 1.4 MB/s eta 0:00:01
   -------------------------- ------------- 122.9/187.3 kB 1.4 MB/s eta 0:00:01
   ---------------------------------------- 187.3/187.3 kB 1.6 MB/s eta 0:00:00
Downloading soupsieve-2.7-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.13.4 soupsieve-2.7

[notice] A new release of pip is available: 24.1.1 -> 25.2 [notice] To update, run: python.exe -m pip install --upgrade pip

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.langchain.com")
loader.load()
 

다른 로더들은 파일의 경로를 넣었다면 WebBaseLoader는 url을 넣어주면 됨

결과

[Document(metadata={'source': 'https://www.langchain.com', 'title': 'LangChain', 'description': 'LangChain’s suite of products supports developers along each step of their development journey.', 'language': 'en'}, page_content="LangChain\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProducts\n\nFrameworksLangGraphLangChainPlatformsLangSmithLangGraph PlatformResources\n\nGuidesBlogCustomer StoriesLangChain AcademyCommunityEventsChangelogDocs\n\nPythonLangGraphLangSmithLangChainJavaScriptLangGraphLangSmithLangChainCompany\n\nAboutCareersPricingGet a demoSign up\n\n\n\n\n\n\n\n\n\n\n\n\nProducts\n\nFrameworksLangGraphLangChainPlatformsLangSmithLangGraph PlatformResources\n\nGuidesBlogCustomer StoriesLangChain AcademyCommunityEventsChangelogDocs\n\nPythonLangGraphLangSmithLangChainJavaScriptLangGraphLangSmithLangChainCompany\n\nAboutCareersPricingGet a demoSign upThe platform for reliable agents. Tools for every step of the agent development lifecycle -- built to unlock powerful AI\xa0in production.Request a demoSee the docs\n\nJoin us August 19 in San Francisco for LangChain Academy Live — a hands-on workshop to master building reliable agents.Learn More\n\n\nLangChain products power top engineering teams, from startups to global enterprisesAccelerate agent development.Build faster with templates & a visual agent IDE. Reuse, configure, and combine agents to go further with less code.Ship reliable agents.Design agents that can handle sophisticated tasks with control. Add human-in-the-loop to steer and approve agent actions.Gain visibility & improve quality.See what’s happening - so you can quickly trace to root cause and debug issues. Evaluate your agent performance to improve over time.The Agent StackORCHESTRATION:Build agents with LangGraphControllable agent orchestration with built-in persistence to handle conversational history, memory, and agent-to-agent collaboration.INTEGRATIONS:Integrate components with LangChainIntegrate with the latest models, databases, and tools with no engineering overhead.EVALS\xa0&\xa0OBSERVABILITY:Gain visibility with LangSmithDebug poor-performing LLM app runs. Evaluate and observe agent performance at scale.DEPLOYMENT:Deploy &\xa0manage with LangGraph PlatformDeploy and scale enterprise-grade agents with long-running workflows. Discover, reuse, and share agents across teams — and iterate faster with LangGraph Studio.CopilotsBuild native co-pilots into your application to unlock new end user experiences for domain-specific tasks.Enterprise GPTGive all employees access\u2028to information and tools\u2028in a compliant manner so they\u2028can perform their best.Customer SupportImprove the speed & efficiency\u2028of support teams that handle customer requests.ResearchSynthesize data, summarize sources & uncover insights faster than ever for knowledge work.Code generationAccelerate software development by automating code writing, refactoring, and documentation for your team.AI SearchOffer a concierge experience to guide users to products or information in a personalized way.\n\n\n\n\n\nLangChain products are designed to be used independently or stack for multiplicative benefit. LangChainLangGraphFrameworksLangSmithLangGraph PlatformPlatformsFrameworksLangChainLangGraphPlatformsLangSmithLangGraph \u2028PlatformSTACK 1:\xa0LangGraph +\xa0LangChain +\xa0LangSmith +\xa0LangGraph\xa0PlatformA full product suite for reliable agents and LLM appsLangChain's products work seamlessly together to provide an integrated solution for every step of the application development journey. When you use all LangChain products, you'll build better, get to production quicker, and grow visibility -- all with less set up and friction. LangChain provides the smoothest path to high quality agents.Orchestration:Integrations:Evals + Observability:Deployment:STACK 2: No framework +\xa0LangSmithTrace\xa0and evaluate any LLM appLangSmith is framework-agnostic. Trace using the TypeScript or Python SDK\xa0to gain visibility into your agent interactions -- whether you use LangChain's frameworks or not.Orchestration:Your choiceEvals + Observability:STACK 3:\xa0Any agent framework +\xa0LangGraph PlatformBuild agents any way you want, then deploy and scale with easeLangGraph Platform works with any agent framework, enabling stateful UXs like human-in-the-loop and streaming-native deployments.Orchestration:Your choiceDeployment:Get inspired by companies who have done it.Teams building with LangChain products are driving operational efficiency, increasing discovery & personalization, and delivering premium products that generate revenue.Discover Use Cases\n\n\nFinancial ServicesKlarna's AI assistant has reduced average customer query resolution time by 80%, powered by LangSmith and LangGraph\n\n\nTransportationThis global logistics provider is saving 600 hours a day using an automated order system built on LangGraph and LangSmith\n\n\nSecurityAs a leading cybersecurity firm with 40k+ customers, Trellix cut log parsing from days to minutes using LangGraph and LangSmith.\n\n\nThe biggest developer community in GenAILearn alongside the 1M+ practitioners using our frameworks to push the industry forward.#1Downloaded agent framework100k+GitHub stars#1Downloaded agent framework600+IntegrationsReady to start shipping \u2028reliable agents faster?Get started with tools from the LangChain product suite for every step of the agent development lifecycle.Get a demoSign up for freeProductsLangChainLangSmithLangGraphResourcesGuidesBlogCustomer StoriesLangChain AcademyCommunityEventsChangelogExpertsPython DocsLangGraph LangSmithLangChainJS DocsLangGraphLangSmithLangChainCompanyAboutCareersXLinkedInYouTubeMarketing AssetsSecuritySign up for our newsletter to stay up to dateThank you! Your submission has been received!Oops! Something went wrong while submitting the form.All systems operationalPrivacy PolicyTerms of Service\n\n\n\n\n\n\n\n\n")]
 

url 을 리스트 형태로 여러개 넘기면 여러 페이지를 Document 형태로 가져올 수 있음

loader_multiple_pages = WebBaseLoader(["https://www.langchain.com", "https://www.python.org"])
loader_multiple_pages.load()
 

결과

[Document(metadata={'source': 'https://www.langchain.com', 'title': 'LangChain', 'description': 'LangChain’s suite of products supports developers along each step of their development journey.', 'language': 'en'}, page_content="LangChain\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProducts\n\nFrameworksLangGraphLangChainPlatformsLangSmithLangGraph PlatformResources\n\nGuidesBlogCustomer StoriesLangChain AcademyCommunityEventsChangelogDocs\n\nPythonLangGraphLangSmithLangChainJavaScriptLangGraphLangSmithLangChainCompany\n\nAboutCareersPricingGet a demoSign up\n\n\n\n\n\n\n\n\n\n\n\n\nProducts\n\nFrameworksLangGraphLangChainPlatformsLangSmithLangGraph PlatformResources\n\nGuidesBlogCustomer StoriesLangChain AcademyCommunityEventsChangelogDocs\n\nPythonLangGraphLangSmithLangChainJavaScriptLangGraphLangSmithLangChainCompany\n\nAboutCareersPricingGet a demoSign upThe platform for reliable agents. Tools for every step of the agent development lifecycle -- built to unlock powerful AI\xa0in production.Request a demoSee the docs\n\nJoin us August 19 in San Francisco for LangChain Academy Live — a hands-on workshop to master building reliable agents.Learn More\n\n\nLangChain products power top engineering teams, from startups to global enterprisesAccelerate agent development.Build faster with templates & a visual agent IDE. Reuse, configure, and combine agents to go further with less code.Ship reliable agents.Design agents that can handle sophisticated tasks with control. Add human-in-the-loop to steer and approve agent actions.Gain visibility & improve quality.See what’s happening - so you can quickly trace to root cause and debug issues. Evaluate your agent performance to improve over time.The Agent StackORCHESTRATION:Build agents with LangGraphControllable agent orchestration with built-in persistence to handle conversational history, memory, and agent-to-agent collaboration.INTEGRATIONS:Integrate components with LangChainIntegrate with the latest models, databases, and tools with no engineering overhead.EVALS\xa0&\xa0OBSERVABILITY:Gain visibility with LangSmithDebug poor-performing LLM app runs. Evaluate and observe agent performance at scale.DEPLOYMENT:Deploy &\xa0manage with LangGraph PlatformDeploy and scale enterprise-grade agents with long-running workflows. Discover, reuse, and share agents across teams — and iterate faster with LangGraph Studio.CopilotsBuild native co-pilots into your application to unlock new end user experiences for domain-specific tasks.Enterprise GPTGive all employees access\u2028to information and tools\u2028in a compliant manner so they\u2028can perform their best.Customer SupportImprove the speed & efficiency\u2028of support teams that handle customer requests.ResearchSynthesize data, summarize sources & uncover insights faster than ever for knowledge work.Code generationAccelerate software development by automating code writing, refactoring, and documentation for your team.AI SearchOffer a concierge experience to guide users to products or information in a personalized way.\n\n\n\n\n\nLangChain products are designed to be used independently or stack for multiplicative benefit. LangChainLangGraphFrameworksLangSmithLangGraph PlatformPlatformsFrameworksLangChainLangGraphPlatformsLangSmithLangGraph \u2028PlatformSTACK 1:\xa0LangGraph +\xa0LangChain +\xa0LangSmith +\xa0LangGraph\xa0PlatformA full product suite for reliable agents and LLM appsLangChain's products work seamlessly together to provide an integrated solution for every step of the application development journey. When you use all LangChain products, you'll build better, get to production quicker, and grow visibility -- all with less set up and friction. LangChain provides the smoothest path to high quality agents.Orchestration:Integrations:Evals + Observability:Deployment:STACK 2: No framework +\xa0LangSmithTrace\xa0and evaluate any LLM appLangSmith is framework-agnostic. Trace using the TypeScript or Python SDK\xa0to gain visibility into your agent interactions -- whether you use LangChain's frameworks or not.Orchestration:Your choiceEvals + Observability:STACK 3:\xa0Any agent framework +\xa0LangGraph PlatformBuild agents any way you want, then deploy and scale with easeLangGraph Platform works with any agent framework, enabling stateful UXs like human-in-the-loop and streaming-native deployments.Orchestration:Your choiceDeployment:Get inspired by companies who have done it.Teams building with LangChain products are driving operational efficiency, increasing discovery & personalization, and delivering premium products that generate revenue.Discover Use Cases\n\n\nFinancial ServicesKlarna's AI assistant has reduced average customer query resolution time by 80%, powered by LangSmith and LangGraph\n\n\nTransportationThis global logistics provider is saving 600 hours a day using an automated order system built on LangGraph and LangSmith\n\n\nSecurityAs a leading cybersecurity firm with 40k+ customers, Trellix cut log parsing from days to minutes using LangGraph and LangSmith.\n\n\nThe biggest developer community in GenAILearn alongside the 1M+ practitioners using our frameworks to push the industry forward.#1Downloaded agent framework100k+GitHub stars#1Downloaded agent framework600+IntegrationsReady to start shipping \u2028reliable agents faster?Get started with tools from the LangChain product suite for every step of the agent development lifecycle.Get a demoSign up for freeProductsLangChainLangSmithLangGraphResourcesGuidesBlogCustomer StoriesLangChain AcademyCommunityEventsChangelogExpertsPython DocsLangGraph LangSmithLangChainJS DocsLangGraphLangSmithLangChainCompanyAboutCareersXLinkedInYouTubeMarketing AssetsSecuritySign up for our newsletter to stay up to dateThank you! Your submission has been received!Oops! Something went wrong while submitting the form.All systems operationalPrivacy PolicyTerms of Service\n\n\n\n\n\n\n\n\n"),
 Document(metadata={'source': 'https://www.python.org', 'title': 'Welcome to Python.org', 'description': 'The official home of the Python Programming Language', 'language': 'en'}, page_content='\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWelcome to Python.org\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNotice: While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. \n\n\n\n\n\n\nSkip to content\n\n\n▼ Close\n                \n\n\nPython\n\n\nPSF\n\n\nDocs\n\n\nPyPI\n\n\nJobs\n\n\nCommunity\n\n\n\n▲ The Python Network\n                \n\n\n\n\n\n\n\n\n\nDonate\n\n≡ Menu\n\n\nSearch This Site\n\n\n                                    GO\n                                \n\n\n\n\n\nA A\n\nSmaller\nLarger\nReset\n\n\n\n\n\n\nSocialize\n\nLinkedIn\nMastodon\nChat on IRC\nTwitter\n\n\n\n\n\n\n\n\n\n\nAbout\n\nApplications\nQuotes\nGetting Started\nHelp\nPython Brochure\n\n\n\nDownloads\n\nAll releases\nSource code\nWindows\nmacOS\nAndroid\nOther Platforms\nLicense\nAlternative Implementations\n\n\n\nDocumentation\n\nDocs\nAudio/Visual Talks\nBeginner\'s Guide\nDeveloper\'s Guide\nFAQ\nNon-English Docs\nPEP Index\nPython Books\nPython Essays\n\n\n\nCommunity\n\nDiversity\nMailing Lists\nIRC\nForums\nPSF Annual Impact Report\nPython Conferences\nSpecial Interest Groups\nPython Logo\nPython Wiki\nCode of Conduct\nCommunity Awards\nGet Involved\nShared Stories\n\n\n\nSuccess Stories\n\nArts\nBusiness\nEducation\nEngineering\nGovernment\nScientific\nSoftware Development\n\n\n\nNews\n\nPython News\nPSF Newsletter\nPSF News\nPyCon US News\nNews from the Community\n\n\n\nEvents\n\nPython Events\nUser Group Events\nPython Events Archive\nUser Group Events Archive\nSubmit an Event\n\n\n\n\n \n\n\n\n>_\n                        Launch Interactive Shell\n\n\n\n\n\n# Python 3: Fibonacci series up to n\r\n>>> def fib(n):\r\n>>>     a, b = 0, 1\r\n>>>     while a < n:\r\n>>>         print(a, end=\' \')\r\n>>>         a, b = b, a+b\r\n>>>     print()\r\n>>> fib(1000)\r\n0 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987\nFunctions Defined\nThe core of extensible programming is defining functions. Python allows mandatory and optional arguments, keyword arguments, and even arbitrary argument lists. More about defining functions in Python\xa03\n\n\n# Python 3: List comprehensions\r\n>>> fruits = [\'Banana\', \'Apple\', \'Lime\']\r\n>>> loud_fruits = [fruit.upper() for fruit in fruits]\r\n>>> print(loud_fruits)\r\n[\'BANANA\', \'APPLE\', \'LIME\']\r\n\r\n# List and the enumerate function\r\n>>> list(enumerate(fruits))\r\n[(0, \'Banana\'), (1, \'Apple\'), (2, \'Lime\')]\nCompound Data Types\nLists (known as arrays in other languages) are one of the compound data types that Python understands. Lists can be indexed, sliced and manipulated with other built-in functions. More about lists in Python\xa03\n\n\n# Python 3: Simple arithmetic\r\n>>> 1 / 2\r\n0.5\r\n>>> 2 ** 3\r\n8\r\n>>> 17 / 3  # classic division returns a float\r\n5.666666666666667\r\n>>> 17 // 3  # floor division\r\n5\nIntuitive Interpretation\nCalculations are simple with Python, and expression syntax is straightforward: the operators +, -, * and / work as expected; parentheses () can be used for grouping. More about simple math functions in Python\xa03.\n\n\n# For loop on a list\r\n>>> numbers = [2, 4, 6, 8]\r\n>>> product = 1\r\n>>> for number in numbers:\r\n...    product = product * number\r\n... \r\n>>> print(\'The product is:\', product)\r\nThe product is: 384\nAll the Flow You’d Expect\nPython knows the usual control flow statements that other languages speak — if, for, while and range — with some of its own twists, of course. More control flow tools in Python\xa03\n\n\n# Simple output (with Unicode)\r\n>>> print("Hello, I\'m Python!")\r\nHello, I\'m Python!\r\n# Input, assignment\r\n>>> name = input(\'What is your name?\\n\')\r\nWhat is your name?\r\nPython\r\n>>> print(f\'Hi, {name}.\')\r\nHi, Python.\r\n\nQuick & Easy to Learn\nExperienced programmers in any other language can pick up Python very quickly, and beginners find the clean syntax and indentation structure easy to learn. Whet your appetite with our Python\xa03 overview.\n\n\n\n\n\nPython is a programming language that lets you work quickly and integrate systems more effectively. Learn More\n\n\n\n\n\n\n\n\n\nGet Started\nWhether you\'re new to programming or an experienced developer, it\'s easy to learn and use Python.\nStart with our Beginner’s Guide\n\n\nDownload\nPython source code and installers are available for download for all versions!\nLatest: Python 3.13.7\n\n\nDocs\nDocumentation for Python\'s standard library, along with tutorials and guides, are available online.\ndocs.python.org\n\n\nJobs\nLooking for work or have a Python related position that you\'re trying to hire for? Our relaunched community-run job board is the place to go.\njobs.python.org\n\n\n\n\n\nLatest News\nMore\n\n\n2025-08-14\nPython 3.14.0rc2 and 3.13.7 are go!\n\n2025-08-14\nAnnouncing the PSF Board Candidates for 2025!\n\n2025-08-08\nAnnouncing Python Software Foundation Fellow Members for Q2 2025! 🎉\n\n2025-08-07\nUnmasking Phantom Dependencies with Software Bill-of-Materials as Ecosystem Neutral Metadata\n\n2025-08-06\nPython 3.13.6 is now available\n\n\n\n\n\nUpcoming Events\nMore\n\n\n2025-08-18\nEuroSciPy 2025\n\n2025-08-23\nStatusCode 2 Hackathon\n\n2025-08-23\nPyCon Togo 2025\n\n2025-08-26\nPyLadies Amsterdam: Data science that ships: production-ready pipelines with Kedro\n\n2025-08-28\nPyCon Kenya 2025\n\n\n\n\n\n\n\nSuccess Stories\nMore\n\n\nPython programmability on Algorand makes the entire development lifecycle easier and means more affordable and efficient maintenance and upgrades going forward.\n\n\n\n\nUsing Python to build a solution for instant tokenized real estate redemptions by Brian Whippo, Head of Developer Relations, Algorand Foundation\n\n\n\n\n\n\n\n\nUse Python for…\nMore\n\nWeb Development:\r\n        Django, Pyramid, Bottle, Tornado, Flask, web2py\nGUI Development:\r\n        tkInter, PyGObject, PyQt, PySide, Kivy, wxPython, DearPyGui\nScientific and Numeric:\r\n        \nSciPy, Pandas, IPython\nSoftware Development:\r\n        Buildbot, Trac, Roundup\nSystem Administration:\r\n        Ansible, Salt, OpenStack, xonsh\n\n\n\n\n\n\n\n>>> Python Software Foundation\n\nThe mission of the Python Software Foundation is to promote, protect, and advance the Python programming language, and to support and facilitate the growth of a diverse and international community of Python programmers. Learn more \n\nBecome a Member\nDonate to the PSF\n\n\n\n\n\n\n\n\n\n▲ Back to Top\n\n\nAbout\n\nApplications\nQuotes\nGetting Started\nHelp\nPython Brochure\n\n\n\nDownloads\n\nAll releases\nSource code\nWindows\nmacOS\nAndroid\nOther Platforms\nLicense\nAlternative Implementations\n\n\n\nDocumentation\n\nDocs\nAudio/Visual Talks\nBeginner\'s Guide\nDeveloper\'s Guide\nFAQ\nNon-English Docs\nPEP Index\nPython Books\nPython Essays\n\n\n\nCommunity\n\nDiversity\nMailing Lists\nIRC\nForums\nPSF Annual Impact Report\nPython Conferences\nSpecial Interest Groups\nPython Logo\nPython Wiki\nCode of Conduct\nCommunity Awards\nGet Involved\nShared Stories\n\n\n\nSuccess Stories\n\nArts\nBusiness\nEducation\nEngineering\nGovernment\nScientific\nSoftware Development\n\n\n\nNews\n\nPython News\nPSF Newsletter\nPSF News\nPyCon US News\nNews from the Community\n\n\n\nEvents\n\nPython Events\nUser Group Events\nPython Events Archive\nUser Group Events Archive\nSubmit an Event\n\n\n\nContributing\n\nDeveloper\'s Guide\nIssue Tracker\npython-dev list\nCore Mentorship\nReport a Security Issue\n\n\n\n▲ Back to Top\n\n \n\n\n\nHelp & General Contact\nDiversity Initiatives\nSubmit Website Bug\n\nStatus \n\n\n\n\nCopyright ©2001-2025.\n                            \xa0Python Software Foundation\n                            \xa0Legal Statements\n                            \xa0Privacy Notice\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n')]
 

RecursiveUrlLoader

불러온 웹 페이지에서 연결된 링크들에 접근해서 정보들을 더 가져올 수도 있음

from langchain_community.document_loaders import RecursiveUrlLoader
loader = RecursiveUrlLoader("https://www.langchain.com")
loader.load()
 
너무 많이 가져와서 코드 돌려보길;;
 

RecursiveLoader 의 인수 중에 max_depth 가 얼마나 재귀적으로 가져올 지 정함
default 가 2니까 알아서 조절하면됨

RecursiveLoader로 콘텐츠를 가져오면 태그를 그대로 가져옴 => 필터링 필요

import re
from bs4 import BeautifulSoup

def bs4_extractor(html : str) -> str:
    soup = BeautifulSoup(html, "html.parser") # html parser 를 통해 html을 파싱
    return re.sub(r"\n\n+", "\n\n", soup.text).strip() # 정규식을 통해 줄바꿈을 변경

loader = RecursiveUrlLoader('https://www.langchain.com', extractor=bs4_extractor)
docs = loader.load()
 

extractor 어떻게 사용하는지 궁금해서 살펴봄

 
len(docs)
 
18
 
docs[0].page_content[:200]
 
'LangChain\n\nProducts\n\nFrameworksLangGraphLangChainPlatformsLangSmithLangGraph PlatformResources\n\nGuidesBlogCustomer StoriesLangChain AcademyCommunityEventsChangelogDocs\n\nPythonLangGraphLangSmithLangCha'
 

WikiPediaLoader

위키피디아 정보를 위한 패키지 설치

!pip install wikipedia==1.4.0
 
from langchain_community.document_loaders import WikipediaLoader

loader = WikipediaLoader(query='지진', lang='ko', load_max_docs=2, doc_content_chars_max=4000)
docs = loader.load()
 
  • query - 위키피디아에서 검색할 검색어
  • lang - 언어 설정
  • load_max_docs - 로드할 최대 페이지 개수
  • doc_content_chars_max - 콘텐츠 최대 글자 수

doc_content_chars_max 이 작으면 콘텐츠를 전부 가져오지 못 할 수도 있기 때문에 주의해야함

페이지를 가져온 후에 content 를 잘라내기 때문에 doc_content_chars_max 가 크다고 성능에 큰 문제가 있을 거 같지도 않음

len(docs)
 
2
 
docs[0].page_content[:300]
 
'지진(地震, 영어: earthquake, quake, tremor, temblor)은 지구 암석권 내부에서 갑작스럽게 에너지를 방출하면서 지진파를 만들어내며 지구 표면까지 흔들리는 현상이다. 지진은 느낄 수 없을 정도로 약한 크기서부터 사람과 여러 물건을 공중으로 들어올리고 도시 전체를 파괴할 수 있을 정도로 매우 격렬한 크기의 지진까지 다양한 강도로 일어난다. 특정 지역의 지진 활동(seismic activity)이란 특정 기간 그 지역에서 발생한 지진의 빈도, 유형, 크기를 말한다. 지진에는 지표면의 진동 외에도 정상 미끄러짐이'
728x90
반응형

추천 글