Document loaders
DocumentLoaders load data into the standard LangChain Document format.
Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the .load method. An example use case is as follows:
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(
... # <-- Integration specific parameters here
)
data = loader.load()
API Reference:CSVLoader
Webpagesโ
The below document loaders allow you to load webpages.
Document Loader | Description | Package/API |
---|---|---|
Web | Uses urllib and BeautifulSoup to load and parse HTML web pages | Package |
RecursiveURL | Recursively scrapes all child links from a root URL | Package |
Sitemap | Scrapes all pages on a given sitemap | Package |
Firecrawl | API service that can be deployed locally, hosted version has free credits. | API |
PDFsโ
The below document loaders allow you to load PDF documents.
Document Loader | Description | Package/API |
---|---|---|
PyPDF | Uses `pypdf` to load and parse PDFs | Package |
Unstructured | Uses Unstructured's open source library to load PDFs | Package |
Amazon Textract | Uses AWS API to load PDFs | API |
MathPix | Uses MathPix to laod PDFs | Package |
PDFPlumber | Load PDF files using PDFPlumber | Package |
PyPDFDirectry | Load a directory with PDF files | Package |
PyPDFium2 | Load PDF files using PyPDFium2 | Package |
UnstructuredPDFLoader | Load PDF files using Unstructured | Package |
PyMuPDF | Load PDF files using PyMuPDF | Package |
PDFMiner | Load PDF files using PDFMiner | Package |
Common File Typesโ
The below document loaders allow you to load data from common data formats.
Document Loader | Data Type |
---|---|
CSVLoader | CSV files |
DirectoryLoader | All files in a given directory |
Unstructured | All file types |
JSONLoader | JSON files |
UnstructuredMarkdownLoader | Markdown files |
BSHTMLLoader | HTML files |
All document loadersโ
Name | Description |
---|---|
acreom | acreom is a dev-first knowledge base with tasks running on local mark... |
AirbyteLoader | Airbyte is a data integration platform for ELT pipelines from APIs, d... |
Airtable | * Get your API key here. |
Alibaba Cloud MaxCompute | Alibaba Cloud MaxCompute (previously known as ODPS) is a general purp... |
Amazon Textract | Amazon Textract is a machine learning (ML) service that automatically... |
Apify Dataset | Apify Dataset is a scalable append-only storage with sequential acces... |
ArcGIS | This notebook demonstrates the use of the langchaincommunity.document... |
ArxivLoader | arXiv is an open-access archive for 2 million scholarly articles in t... |
AssemblyAI Audio Transcripts | The AssemblyAIAudioTranscriptLoader allows to transcribe audio files ... |
AstraDB | DataStax Astra DB is a serverless vector-capable database built on Ca... |
Async Chromium | Chromium is one of the browsers supported by Playwright, a library us... |
AsyncHtml | AsyncHtmlLoader loads raw HTML from a list of URLs concurrently. |
Athena | Amazon Athena is a serverless, interactive analytics service built |
AWS S3 Directory | Amazon Simple Storage Service (Amazon S3) is an object storage service |
AWS S3 File | Amazon Simple Storage Service (Amazon S3) is an object storage servic... |
AZLyrics | AZLyrics is a large, legal, every day growing collection of lyrics. |
Azure AI Data | Azure AI Studio provides the capability to upload data assets to clou... |
Azure Blob Storage Container | Azure Blob Storage is Microsoft's object storage solution for the clo... |
Azure Blob Storage File | Azure Files offers fully managed file shares in the cloud that are ac... |
Azure AI Document Intelligence | Azure AI Document Intelligence (formerly known as Azure Form Recogniz... |
BibTeX | BibTeX is a file format and reference management system commonly used... |
BiliBili | Bilibili is one of the most beloved long-form video sites in China. |
Blackboard | Blackboard Learn (previously the Blackboard Learning Management Syste... |
Blockchain | Overview |
Brave Search | Brave Search is a search engine developed by Brave Software. |
Browserbase | Browserbase is a developer platform to reliably run, manage, and moni... |
Browserless | Browserless is a service that allows you to run headless Chrome insta... |
BSHTMLLoader | This notebook provides a quick overview for getting started with Beau... |
Cassandra | Cassandra is a NoSQL, row-oriented, highly scalable and highly availa... |
ChatGPT Data | ChatGPT is an artificial intelligence (AI) chatbot developed by OpenA... |
College Confidential | College Confidential gives information on 3,800+ colleges and univers... |
Concurrent Loader | Works just like the GenericLoader but concurrently for those who choo... |
Confluence | Confluence is a wiki collaboration platform that saves and organizes ... |
CoNLL-U | CoNLL-U is revised version of the CoNLL-X format. Annotations are enc... |
Copy Paste | This notebook covers how to load a document object from something you... |
Couchbase | Couchbase is an award-winning distributed NoSQL cloud database that d... |
CSV | A comma-separated values (CSV) file is a delimited text file that use... |
Cube Semantic Layer | This notebook demonstrates the process of retrieving Cube's data mode... |
Datadog Logs | Datadog is a monitoring and analytics platform for cloud-scale applic... |
Dedoc | This sample demonstrates the use of Dedoc in combination with LangCha... |
Diffbot | Diffbot is a suite of ML-based products that make it easy to structur... |
Discord | Discord is a VoIP and instant messaging social platform. Users have t... |
Docugami | This notebook covers how to load documents from Docugami. It provides... |
Docusaurus | Docusaurus is a static-site generator which provides out-of-the-box d... |
Dropbox | Dropbox is a file hosting service that brings everything-traditional ... |
DuckDB | DuckDB is an in-process SQL OLAP database management system. |
This notebook shows how to load email (.eml) or Microsoft Outlook (.m... | |
EPub | EPUB is an e-book file format that uses the ".epub" file extension. T... |
Etherscan | Etherscan is the leading blockchain explorer, search, API and analyt... |
EverNote | EverNote is intended for archiving and creating notes in which photos... |
example_data | |
Facebook Chat | Messenger) is an American proprietary instant messaging app and platf... |
Fauna | Fauna is a Document Database. |
Figma | Figma is a collaborative web application for interface design. |
FireCrawl | FireCrawl crawls and convert any website into LLM-ready data. It craw... |
Geopandas | Geopandas is an open-source project to make working with geospatial d... |
Git | Git is a distributed version control system that tracks changes in an... |
GitBook | GitBook is a modern documentation platform where teams can document e... |
GitHub | This notebooks shows how you can load issues and pull requests (PRs) ... |
Glue Catalog | The AWS Glue Data Catalog is a centralized metadata repository that a... |
Google AlloyDB for PostgreSQL | AlloyDB is a fully managed relational database service that offers hi... |
Google BigQuery | Google BigQuery is a serverless and cost-effective enterprise data wa... |
Google Bigtable | Bigtable is a key-value and wide-column store, ideal for fast access ... |
Google Cloud SQL for SQL server | Cloud SQL is a fully managed relational database service that offers ... |
Google Cloud SQL for MySQL | Cloud SQL is a fully managed relational database service that offers ... |
Google Cloud SQL for PostgreSQL | Cloud SQL for PostgreSQL is a fully-managed database service that hel... |
Google Cloud Storage Directory | Google Cloud Storage is a managed service for storing unstructured da... |
Google Cloud Storage File | Google Cloud Storage is a managed service for storing unstructured da... |
Google Firestore in Datastore Mode | Firestore in Datastore Mode is a NoSQL document database built for au... |
Google Drive | Google Drive is a file storage and synchronization service developed ... |
Google El Carro for Oracle Workloads | Google El Carro Oracle Operator |
Google Firestore (Native Mode) | Firestore is a serverless document-oriented database that scales to m... |
Google Memorystore for Redis | Google Memorystore for Redis is a fully-managed service that is power... |
Google Spanner | Spanner is a highly scalable database that combines unlimited scalabi... |
Google Speech-to-Text Audio Transcripts | The GoogleSpeechToTextLoader allows to transcribe audio files with th... |
Grobid | GROBID is a machine learning library for extracting, parsing, and re-... |
Gutenberg | Project Gutenberg is an online library of free eBooks. |
Hacker News | Hacker News (sometimes abbreviated as HN) is a social news website fo... |
Huawei OBS Directory | The following code demonstrates how to load objects from the Huawei O... |
Huawei OBS File | The following code demonstrates how to load an object from the Huawei... |
HuggingFace dataset | The Hugging Face Hub is home to over 5,000 datasets in more than 100 ... |
iFixit | iFixit is the largest, open repair community on the web. The site con... |
Images | This covers how to load images into a document format that we can use... |
Image captions | By default, the loader utilizes the pre-trained Salesforce BLIP image... |
IMSDb | IMSDb is the Internet Movie Script Database. |
Iugu | Iugu is a Brazilian services and software as a service (SaaS) company... |
Joplin | Joplin is an open-source note-taking app. Capture your thoughts and s... |
JSONLoader | This notebook provides a quick overview for getting started with JSON... |
Jupyter Notebook | Jupyter Notebook (formerly IPython Notebook) is a web-based interacti... |
Kinetica | This notebooks goes over how to load documents from Kinetica |
lakeFS | lakeFS provides scalable version control over the data lake, and uses... |
LangSmith | This notebook provides a quick overview for getting started with the ... |
LarkSuite (FeiShu) | LarkSuite is an enterprise collaboration platform developed by ByteDa... |
LLM Sherpa | This notebook covers how to use LLM Sherpa to load files of many type... |
Mastodon | Mastodon is a federated social media and social networking service. |
MathPixPDFLoader | Inspired by Daniel Gross's snippet here//gist.github.com/danielgross/... |
MediaWiki Dump | MediaWiki XML Dumps contain the content of a wiki (wiki pages with al... |
Merge Documents Loader | Merge the documents returned from a set of specified data loaders. |
mhtml | MHTML is a is used both for emails but also for archived webpages. MH... |
Microsoft Excel | The UnstructuredExcelLoader is used to load Microsoft Excel files. Th... |
Microsoft OneDrive | Microsoft OneDrive (formerly SkyDrive) is a file hosting service oper... |
Microsoft OneNote | This notebook covers how to load documents from OneNote. |
Microsoft PowerPoint | Microsoft PowerPoint is a presentation program by Microsoft. |
Microsoft SharePoint | Microsoft SharePoint is a website-based collaboration system that use... |
Microsoft Word | Microsoft Word is a word processor developed by Microsoft. |
Near Blockchain | Overview |
Modern Treasury | Modern Treasury simplifies complex payment operations. It is a unifie... |
MongoDB | MongoDB is a NoSQL , document-oriented database that supports JSON-li... |
News URL | This covers how to load HTML news articles from a list of URLs into a... |
Notion DB 1/2 | Notion is a collaboration platform with modified Markdown support tha... |
Notion DB 2/2 | Notion is a collaboration platform with modified Markdown support tha... |
Nuclia | Nuclia automatically indexes your unstructured data from any internal... |
Obsidian | Obsidian is a powerful and extensible knowledge base |
Open Document Format (ODT) | The Open Document Format for Office Applications (ODF), also known as... |
Open City Data | Socrata provides an API for city open data. |
Oracle Autonomous Database | Oracle autonomous database is a cloud database that uses machine lear... |
Oracle AI Vector Search: Document Processing | Oracle AI Vector Search is designed for Artificial Intelligence (AI) ... |
Org-mode | A Org Mode document is a document editing, formatting, and organizing... |
Pandas DataFrame | This notebook goes over how to load data from a pandas DataFrame. |
PDFMiner | Overview |
PDFPlumber | Like PyMuPDF, the output Documents contain detailed metadata about th... |
Pebblo Safe DocumentLoader | Pebblo enables developers to safely load data and promote their Gen A... |
Polars DataFrame | This notebook goes over how to load data from a polars DataFrame. |
Psychic | This notebook covers how to load documents from Psychic. See here for... |
PubMed | PubMedยฎ by The National Center for Biotechnology Information, Nationa... |
PyMuPDF | PyMuPDF is optimized for speed, and contains detailed metadata about ... |
PyPDFDirectoryLoader | This loader loads all PDF files from a specific directory. |
PyPDFium2Loader | This notebook provides a quick overview for getting started with PyPD... |
PyPDFLoader | This notebook provides a quick overview for getting started with PyPD... |
PySpark | This notebook goes over how to load data from a PySpark DataFrame. |
Quip | Quip is a collaborative productivity software suite for mobile and We... |
ReadTheDocs Documentation | Read the Docs is an open-sourced free software documentation hosting ... |
Recursive URL | The RecursiveUrlLoader lets you recursively scrape all child links fr... |
Reddit is an American social news aggregation, content rating, and di... | |
Roam | ROAM is a note-taking tool for networked thought, designed to create ... |
Rockset | Rockset is a real-time analytics database which enables queries on ma... |
rspace | This notebook shows how to use the RSpace document loader to import r... |
RSS Feeds | This covers how to load HTML news articles from a list of RSS feed UR... |
RST | A reStructured Text (RST) file is a file format for textual data used... |
scrapfly | ScrapFly |
ScrapingAnt | Overview |
Sitemap | Extends from the WebBaseLoader, SitemapLoader loads a sitemap from a ... |
Slack | Slack is an instant messaging program. |
Snowflake | This notebooks goes over how to load documents from Snowflake |
Source Code | This notebook covers how to load source code files using a special ap... |
Spider | Spider is the fastest and most affordable crawler and scraper that re... |
Spreedly | Spreedly is a service that allows you to securely store credit cards ... |
Stripe | Stripe is an Irish-American financial services and software as a serv... |
Subtitle | The SubRip file format is described on the Matroska multimedia contai... |
SurrealDB | SurrealDB is an end-to-end cloud-native database designed for modern ... |
Telegram | Telegram Messenger is a globally accessible freemium, cross-platform,... |
Tencent COS Directory | Tencent Cloud Object Storage (COS) is a distributed |
Tencent COS File | Tencent Cloud Object Storage (COS) is a distributed |
TensorFlow Datasets | TensorFlow Datasets is a collection of datasets ready to use, with Te... |
TiDB | TiDB Cloud, is a comprehensive Database-as-a-Service (DBaaS) solution... |
2Markdown | 2markdown service transforms website content into structured markdown... |
TOML | TOML is a file format for configuration files. It is intended to be e... |
Trello | Trello is a web-based project management and collaboration tool that ... |
TSV | A tab-separated values (TSV) file is a simple, text-based file format... |
Twitter is an online social media and social networking service. | |
Unstructured | This notebook covers how to use Unstructured document loader to load ... |
UnstructuredMarkdownLoader | This notebook provides a quick overview for getting started with Unst... |
UnstructuredPDFLoader | Overview |
Upstage | This notebook covers how to get started with UpstageLayoutAnalysisLoa... |
URL | This example covers how to load HTML documents from a list of URLs in... |
Vsdx | A visio file (with extension .vsdx) is associated with Microsoft Visi... |
Weather | OpenWeatherMap is an open-source weather service provider |
WebBaseLoader | This covers how to use WebBaseLoader to load all text from HTML webpa... |
WhatsApp Chat | WhatsApp (also called WhatsApp Messenger) is a freeware, cross-platfo... |
Wikipedia | Wikipedia is a multilingual free online encyclopedia written and main... |
XML | The UnstructuredXMLLoader is used to load XML files. The loader works... |
Xorbits Pandas DataFrame | This notebook goes over how to load data from a xorbits.pandas DataFr... |
YouTube audio | Building chat or QA applications on YouTube videos is a topic of high... |
YouTube transcripts | YouTube is an online video sharing and social media platform created ... |
Yuque | Yuque is a professional cloud-based knowledge base for team collabora... |