Skip to main content

Document loaders

DocumentLoaders load data into the standard LangChain Document format.

Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the .load method. An example use case is as follows:

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
... # <-- Integration specific parameters here
)
data = loader.load()
API Reference:CSVLoader

Webpagesโ€‹

The below document loaders allow you to load webpages.

Document LoaderDescriptionPackage/API
WebUses urllib and BeautifulSoup to load and parse HTML web pagesPackage
RecursiveURLRecursively scrapes all child links from a root URLPackage
SitemapScrapes all pages on a given sitemapPackage
FirecrawlAPI service that can be deployed locally, hosted version has free credits.API

PDFsโ€‹

The below document loaders allow you to load PDF documents.

Document LoaderDescriptionPackage/API
PyPDFUses `pypdf` to load and parse PDFsPackage
UnstructuredUses Unstructured's open source library to load PDFsPackage
Amazon TextractUses AWS API to load PDFsAPI
MathPixUses MathPix to laod PDFsPackage
PDFPlumberLoad PDF files using PDFPlumberPackage
PyPDFDirectryLoad a directory with PDF filesPackage
PyPDFium2Load PDF files using PyPDFium2Package
UnstructuredPDFLoaderLoad PDF files using UnstructuredPackage
PyMuPDFLoad PDF files using PyMuPDFPackage
PDFMinerLoad PDF files using PDFMinerPackage

Common File Typesโ€‹

The below document loaders allow you to load data from common data formats.

Document LoaderData Type
CSVLoaderCSV files
DirectoryLoaderAll files in a given directory
UnstructuredAll file types
JSONLoaderJSON files
UnstructuredMarkdownLoaderMarkdown files
BSHTMLLoaderHTML files

All document loadersโ€‹

NameDescription
acreomacreom is a dev-first knowledge base with tasks running on local mark...
AirbyteLoaderAirbyte is a data integration platform for ELT pipelines from APIs, d...
Airtable* Get your API key here.
Alibaba Cloud MaxComputeAlibaba Cloud MaxCompute (previously known as ODPS) is a general purp...
Amazon TextractAmazon Textract is a machine learning (ML) service that automatically...
Apify DatasetApify Dataset is a scalable append-only storage with sequential acces...
ArcGISThis notebook demonstrates the use of the langchaincommunity.document...
ArxivLoaderarXiv is an open-access archive for 2 million scholarly articles in t...
AssemblyAI Audio TranscriptsThe AssemblyAIAudioTranscriptLoader allows to transcribe audio files ...
AstraDBDataStax Astra DB is a serverless vector-capable database built on Ca...
Async ChromiumChromium is one of the browsers supported by Playwright, a library us...
AsyncHtmlAsyncHtmlLoader loads raw HTML from a list of URLs concurrently.
AthenaAmazon Athena is a serverless, interactive analytics service built
AWS S3 DirectoryAmazon Simple Storage Service (Amazon S3) is an object storage service
AWS S3 FileAmazon Simple Storage Service (Amazon S3) is an object storage servic...
AZLyricsAZLyrics is a large, legal, every day growing collection of lyrics.
Azure AI DataAzure AI Studio provides the capability to upload data assets to clou...
Azure Blob Storage ContainerAzure Blob Storage is Microsoft's object storage solution for the clo...
Azure Blob Storage FileAzure Files offers fully managed file shares in the cloud that are ac...
Azure AI Document IntelligenceAzure AI Document Intelligence (formerly known as Azure Form Recogniz...
BibTeXBibTeX is a file format and reference management system commonly used...
BiliBiliBilibili is one of the most beloved long-form video sites in China.
BlackboardBlackboard Learn (previously the Blackboard Learning Management Syste...
BlockchainOverview
Brave SearchBrave Search is a search engine developed by Brave Software.
BrowserbaseBrowserbase is a developer platform to reliably run, manage, and moni...
BrowserlessBrowserless is a service that allows you to run headless Chrome insta...
BSHTMLLoaderThis notebook provides a quick overview for getting started with Beau...
CassandraCassandra is a NoSQL, row-oriented, highly scalable and highly availa...
ChatGPT DataChatGPT is an artificial intelligence (AI) chatbot developed by OpenA...
College ConfidentialCollege Confidential gives information on 3,800+ colleges and univers...
Concurrent LoaderWorks just like the GenericLoader but concurrently for those who choo...
ConfluenceConfluence is a wiki collaboration platform that saves and organizes ...
CoNLL-UCoNLL-U is revised version of the CoNLL-X format. Annotations are enc...
Copy PasteThis notebook covers how to load a document object from something you...
CouchbaseCouchbase is an award-winning distributed NoSQL cloud database that d...
CSVA comma-separated values (CSV) file is a delimited text file that use...
Cube Semantic LayerThis notebook demonstrates the process of retrieving Cube's data mode...
Datadog LogsDatadog is a monitoring and analytics platform for cloud-scale applic...
DedocThis sample demonstrates the use of Dedoc in combination with LangCha...
DiffbotDiffbot is a suite of ML-based products that make it easy to structur...
DiscordDiscord is a VoIP and instant messaging social platform. Users have t...
DocugamiThis notebook covers how to load documents from Docugami. It provides...
DocusaurusDocusaurus is a static-site generator which provides out-of-the-box d...
DropboxDropbox is a file hosting service that brings everything-traditional ...
DuckDBDuckDB is an in-process SQL OLAP database management system.
EmailThis notebook shows how to load email (.eml) or Microsoft Outlook (.m...
EPubEPUB is an e-book file format that uses the ".epub" file extension. T...
EtherscanEtherscan is the leading blockchain explorer, search, API and analyt...
EverNoteEverNote is intended for archiving and creating notes in which photos...
example_data
Facebook ChatMessenger) is an American proprietary instant messaging app and platf...
FaunaFauna is a Document Database.
FigmaFigma is a collaborative web application for interface design.
FireCrawlFireCrawl crawls and convert any website into LLM-ready data. It craw...
GeopandasGeopandas is an open-source project to make working with geospatial d...
GitGit is a distributed version control system that tracks changes in an...
GitBookGitBook is a modern documentation platform where teams can document e...
GitHubThis notebooks shows how you can load issues and pull requests (PRs) ...
Glue CatalogThe AWS Glue Data Catalog is a centralized metadata repository that a...
Google AlloyDB for PostgreSQLAlloyDB is a fully managed relational database service that offers hi...
Google BigQueryGoogle BigQuery is a serverless and cost-effective enterprise data wa...
Google BigtableBigtable is a key-value and wide-column store, ideal for fast access ...
Google Cloud SQL for SQL serverCloud SQL is a fully managed relational database service that offers ...
Google Cloud SQL for MySQLCloud SQL is a fully managed relational database service that offers ...
Google Cloud SQL for PostgreSQLCloud SQL for PostgreSQL is a fully-managed database service that hel...
Google Cloud Storage DirectoryGoogle Cloud Storage is a managed service for storing unstructured da...
Google Cloud Storage FileGoogle Cloud Storage is a managed service for storing unstructured da...
Google Firestore in Datastore ModeFirestore in Datastore Mode is a NoSQL document database built for au...
Google DriveGoogle Drive is a file storage and synchronization service developed ...
Google El Carro for Oracle WorkloadsGoogle El Carro Oracle Operator
Google Firestore (Native Mode)Firestore is a serverless document-oriented database that scales to m...
Google Memorystore for RedisGoogle Memorystore for Redis is a fully-managed service that is power...
Google SpannerSpanner is a highly scalable database that combines unlimited scalabi...
Google Speech-to-Text Audio TranscriptsThe GoogleSpeechToTextLoader allows to transcribe audio files with th...
GrobidGROBID is a machine learning library for extracting, parsing, and re-...
GutenbergProject Gutenberg is an online library of free eBooks.
Hacker NewsHacker News (sometimes abbreviated as HN) is a social news website fo...
Huawei OBS DirectoryThe following code demonstrates how to load objects from the Huawei O...
Huawei OBS FileThe following code demonstrates how to load an object from the Huawei...
HuggingFace datasetThe Hugging Face Hub is home to over 5,000 datasets in more than 100 ...
iFixitiFixit is the largest, open repair community on the web. The site con...
ImagesThis covers how to load images into a document format that we can use...
Image captionsBy default, the loader utilizes the pre-trained Salesforce BLIP image...
IMSDbIMSDb is the Internet Movie Script Database.
IuguIugu is a Brazilian services and software as a service (SaaS) company...
JoplinJoplin is an open-source note-taking app. Capture your thoughts and s...
JSONLoaderThis notebook provides a quick overview for getting started with JSON...
Jupyter NotebookJupyter Notebook (formerly IPython Notebook) is a web-based interacti...
KineticaThis notebooks goes over how to load documents from Kinetica
lakeFSlakeFS provides scalable version control over the data lake, and uses...
LangSmithThis notebook provides a quick overview for getting started with the ...
LarkSuite (FeiShu)LarkSuite is an enterprise collaboration platform developed by ByteDa...
LLM SherpaThis notebook covers how to use LLM Sherpa to load files of many type...
MastodonMastodon is a federated social media and social networking service.
MathPixPDFLoaderInspired by Daniel Gross's snippet here//gist.github.com/danielgross/...
MediaWiki DumpMediaWiki XML Dumps contain the content of a wiki (wiki pages with al...
Merge Documents LoaderMerge the documents returned from a set of specified data loaders.
mhtmlMHTML is a is used both for emails but also for archived webpages. MH...
Microsoft ExcelThe UnstructuredExcelLoader is used to load Microsoft Excel files. Th...
Microsoft OneDriveMicrosoft OneDrive (formerly SkyDrive) is a file hosting service oper...
Microsoft OneNoteThis notebook covers how to load documents from OneNote.
Microsoft PowerPointMicrosoft PowerPoint is a presentation program by Microsoft.
Microsoft SharePointMicrosoft SharePoint is a website-based collaboration system that use...
Microsoft WordMicrosoft Word is a word processor developed by Microsoft.
Near BlockchainOverview
Modern TreasuryModern Treasury simplifies complex payment operations. It is a unifie...
MongoDBMongoDB is a NoSQL , document-oriented database that supports JSON-li...
News URLThis covers how to load HTML news articles from a list of URLs into a...
Notion DB 1/2Notion is a collaboration platform with modified Markdown support tha...
Notion DB 2/2Notion is a collaboration platform with modified Markdown support tha...
NucliaNuclia automatically indexes your unstructured data from any internal...
ObsidianObsidian is a powerful and extensible knowledge base
Open Document Format (ODT)The Open Document Format for Office Applications (ODF), also known as...
Open City DataSocrata provides an API for city open data.
Oracle Autonomous DatabaseOracle autonomous database is a cloud database that uses machine lear...
Oracle AI Vector Search: Document ProcessingOracle AI Vector Search is designed for Artificial Intelligence (AI) ...
Org-modeA Org Mode document is a document editing, formatting, and organizing...
Pandas DataFrameThis notebook goes over how to load data from a pandas DataFrame.
PDFMinerOverview
PDFPlumberLike PyMuPDF, the output Documents contain detailed metadata about th...
Pebblo Safe DocumentLoaderPebblo enables developers to safely load data and promote their Gen A...
Polars DataFrameThis notebook goes over how to load data from a polars DataFrame.
PsychicThis notebook covers how to load documents from Psychic. See here for...
PubMedPubMedยฎ by The National Center for Biotechnology Information, Nationa...
PyMuPDFPyMuPDF is optimized for speed, and contains detailed metadata about ...
PyPDFDirectoryLoaderThis loader loads all PDF files from a specific directory.
PyPDFium2LoaderThis notebook provides a quick overview for getting started with PyPD...
PyPDFLoaderThis notebook provides a quick overview for getting started with PyPD...
PySparkThis notebook goes over how to load data from a PySpark DataFrame.
QuipQuip is a collaborative productivity software suite for mobile and We...
ReadTheDocs DocumentationRead the Docs is an open-sourced free software documentation hosting ...
Recursive URLThe RecursiveUrlLoader lets you recursively scrape all child links fr...
RedditReddit is an American social news aggregation, content rating, and di...
RoamROAM is a note-taking tool for networked thought, designed to create ...
RocksetRockset is a real-time analytics database which enables queries on ma...
rspaceThis notebook shows how to use the RSpace document loader to import r...
RSS FeedsThis covers how to load HTML news articles from a list of RSS feed UR...
RSTA reStructured Text (RST) file is a file format for textual data used...
scrapflyScrapFly
ScrapingAntOverview
SitemapExtends from the WebBaseLoader, SitemapLoader loads a sitemap from a ...
SlackSlack is an instant messaging program.
SnowflakeThis notebooks goes over how to load documents from Snowflake
Source CodeThis notebook covers how to load source code files using a special ap...
SpiderSpider is the fastest and most affordable crawler and scraper that re...
SpreedlySpreedly is a service that allows you to securely store credit cards ...
StripeStripe is an Irish-American financial services and software as a serv...
SubtitleThe SubRip file format is described on the Matroska multimedia contai...
SurrealDBSurrealDB is an end-to-end cloud-native database designed for modern ...
TelegramTelegram Messenger is a globally accessible freemium, cross-platform,...
Tencent COS DirectoryTencent Cloud Object Storage (COS) is a distributed
Tencent COS FileTencent Cloud Object Storage (COS) is a distributed
TensorFlow DatasetsTensorFlow Datasets is a collection of datasets ready to use, with Te...
TiDBTiDB Cloud, is a comprehensive Database-as-a-Service (DBaaS) solution...
2Markdown2markdown service transforms website content into structured markdown...
TOMLTOML is a file format for configuration files. It is intended to be e...
TrelloTrello is a web-based project management and collaboration tool that ...
TSVA tab-separated values (TSV) file is a simple, text-based file format...
TwitterTwitter is an online social media and social networking service.
UnstructuredThis notebook covers how to use Unstructured document loader to load ...
UnstructuredMarkdownLoaderThis notebook provides a quick overview for getting started with Unst...
UnstructuredPDFLoaderOverview
UpstageThis notebook covers how to get started with UpstageLayoutAnalysisLoa...
URLThis example covers how to load HTML documents from a list of URLs in...
VsdxA visio file (with extension .vsdx) is associated with Microsoft Visi...
WeatherOpenWeatherMap is an open-source weather service provider
WebBaseLoaderThis covers how to use WebBaseLoader to load all text from HTML webpa...
WhatsApp ChatWhatsApp (also called WhatsApp Messenger) is a freeware, cross-platfo...
WikipediaWikipedia is a multilingual free online encyclopedia written and main...
XMLThe UnstructuredXMLLoader is used to load XML files. The loader works...
Xorbits Pandas DataFrameThis notebook goes over how to load data from a xorbits.pandas DataFr...
YouTube audioBuilding chat or QA applications on YouTube videos is a topic of high...
YouTube transcriptsYouTube is an online video sharing and social media platform created ...
YuqueYuque is a professional cloud-based knowledge base for team collabora...

Was this page helpful?


You can also leave detailed feedback on GitHub.