13 KiB
🔍 arXiv Paper Search & Download Tool
A comprehensive Python tool for searching, analyzing, and downloading research papers from arXiv using their public API. Perfect for researchers, students, and anyone interested in academic papers.
📋 Table of Contents
✨ Features
- 🔍 Smart Search: Search arXiv papers by title, author, abstract, or any keyword
- 📥 Smart Download: Download PDFs with automatic filename renaming to paper titles
- 📊 Result Parsing: Automatically extract structured information (title, authors, abstract, ID)
- 🖥️ Interactive Mode: Command-line interface for easy searching and downloading
- ⚡ Batch Operations: Search multiple papers and download in sequence
- 📈 Academic Research: Perfect for literature reviews and research discovery
- 🔄 Auto-Rename: Downloaded files are automatically named using paper titles instead of cryptic IDs
🚀 Installation
Prerequisites
- Python 3.6 or higher
- Internet connection for API access
Install Dependencies
pip install requests
Download the Script
# Clone or download arxiv_api.py to your working directory
🎯 Quick Start
Basic Search
python arxiv_api.py "machine learning"
Search with Custom Results
python arxiv_api.py "quantum computing" -n 10
Search and Download First Result
python arxiv_api.py "deep learning" -d
Interactive Mode
python arxiv_api.py -i
Download Paper by ID (with auto-rename)
# In interactive mode:
# 📚 arxiv> download 2502.05218v1
# This will automatically rename the file to the paper's title
🎮 Usage Modes
1. Command Line Mode
Direct search queries from the command line.
Syntax:
python arxiv_api.py [query] [options]
Options:
-n, --max_results: Maximum number of results (default: 5)-d, --download: Download the first result automatically-i, --interactive: Start interactive mode-h, --help: Show help message
2. Interactive Mode
Interactive command-line interface for multiple operations.
Commands:
search <query> [max_results]: Search for papersdownload <paper_id>: Download a specific paper (with auto-rename)help: Show available commandsquit/exit: Exit the program
🔧 API Functions
Core Functions
search_arxiv(query, max_results=10)
Searches arXiv for papers using the public API.
Parameters:
query(str): Search query stringmax_results(int): Maximum number of results (default: 10)
Returns:
str: XML response from arXiv API
Example:
from arxiv_api import search_arxiv
results = search_arxiv("artificial intelligence", max_results=5)
get_paper_metadata(paper_id)
Fetches paper metadata directly from arXiv API using paper ID.
Parameters:
paper_id(str): arXiv paper ID (e.g., "2502.05218v1")
Returns:
dict: Paper information dictionary, orNoneif not found
Example:
from arxiv_api import get_paper_metadata
paper_info = get_paper_metadata("2502.05218v1")
if paper_info:
print(f"Title: {paper_info['title']}")
print(f"Authors: {', '.join(paper_info['authors'])}")
download_paper(paper_id, output_dir=".", paper_title=None)
Downloads a specific paper by its arXiv ID and automatically renames it to the paper title.
Parameters:
paper_id(str): arXiv paper ID (e.g., "2502.05218v1")output_dir(str): Output directory (default: current directory)paper_title(str): Paper title for filename (optional, will be fetched automatically if not provided)
Returns:
str: File path of downloaded PDF, orNoneif failed
Features:
- Auto-rename: Automatically renames downloaded files to paper titles
- Smart cleaning: Removes special characters and limits filename length
- Fallback: Uses paper ID if title is unavailable
Example:
from arxiv_api import download_paper
# Download with automatic title fetching and renaming
filepath = download_paper("2502.05218v1")
# Download with custom title
filepath = download_paper("2502.05218v1", paper_title="My Custom Title")
parse_search_results(xml_content)
Parses XML search results and extracts structured paper information.
Parameters:
xml_content(str): XML response from arXiv API
Returns:
list: List of dictionaries containing paper information
Paper Information Structure:
{
'title': 'Paper Title',
'authors': ['Author 1', 'Author 2'],
'abstract': 'Paper abstract...',
'paper_id': '2502.05218v1',
'published': '2025-02-05T12:37:15Z'
}
search_and_download(query, max_results=5, download_first=False)
Combined function that searches for papers and optionally downloads the first result.
Parameters:
query(str): Search query stringmax_results(int): Maximum number of results (default: 5)download_first(bool): Whether to download first result (default: False)
Example:
from arxiv_api import search_and_download
# Search and display results only
search_and_download("machine learning", max_results=3)
# Search and download first result (with auto-rename)
search_and_download("deep learning", max_results=5, download_first=True)
Interactive Mode Functions
interactive_mode()
Starts the interactive command-line interface.
Features:
- Command history
- Error handling
- User-friendly prompts
- Multiple search sessions
- Smart download with auto-rename
📚 Examples
Example 1: Basic Paper Search
# Search for machine learning papers
python arxiv_api.py "machine learning"
# Output:
# Searching arXiv for: 'machine learning'
# --------------------------------------------------
# Found 5 papers:
#
# 1. Title: Introduction to Machine Learning
# Authors: John Doe, Jane Smith
# Paper ID: 2103.12345
# Published: 2021-03-15T10:30:00Z
# Abstract: This paper introduces...
Example 2: Search with Custom Results
# Get 10 results for quantum computing
python arxiv_api.py "quantum computing" -n 10
Example 3: Search and Download (with auto-rename)
# Search for papers and download the first one
python arxiv_api.py "artificial intelligence" -d
# Downloaded file will be automatically renamed to the paper title
Example 4: Interactive Mode with Smart Download
python arxiv_api.py -i
# 📚 arxiv> search blockchain finance 5
# 📚 arxiv> download 2502.05218v1
# Fetching paper information for 2502.05218v1...
# Found paper: FactorGCL: A Hypergraph-Based Factor Model...
# Downloaded: .\FactorGCL_A_Hypergraph-Based_Factor_Model...pdf
# 📚 arxiv> help
# 📚 arxiv> quit
Example 5: Python Script Integration
from arxiv_api import search_and_download, download_paper, get_paper_metadata
# Search for papers on a specific topic
search_and_download("quantitative finance China", max_results=3)
# Download a specific paper with auto-rename
download_paper("2502.05218v1")
# Get paper metadata
paper_info = get_paper_metadata("2502.05218v1")
if paper_info:
print(f"Title: {paper_info['title']}")
🔍 Advanced Usage
Smart Download Features
Automatic Filename Generation
from arxiv_api import download_paper
# The tool automatically:
# 1. Fetches paper metadata
# 2. Extracts the title
# 3. Cleans the title for filename use
# 4. Downloads and renames the file
# Example output filename:
# "FactorGCL_A_Hypergraph-Based_Factor_Model_with_Temporal_Residual_Contrastive_Learning_for_Stock_Returns_Prediction.pdf"
Custom Search Queries
Field-Specific Searches
# Search by author
python arxiv_api.py "au:Yann LeCun"
# Search by title
python arxiv_api.py "ti:deep learning"
# Search by abstract
python arxiv_api.py "abs:neural networks"
# Search by category
python arxiv_api.py "cat:cs.AI"
Complex Queries
# Multiple terms
python arxiv_api.py "machine learning AND neural networks"
# Exclude terms
python arxiv_api.py "deep learning NOT reinforcement"
# Date range
python arxiv_api.py "machine learning AND submittedDate:[20230101 TO 20231231]"
Batch Operations
Download Multiple Papers with Auto-Rename
from arxiv_api import search_arxiv, parse_search_results, download_paper
# Search for papers
query = "quantum computing"
results = search_arxiv(query, max_results=10)
papers = parse_search_results(results)
# Download all papers (each will be automatically renamed)
for paper in papers:
paper_id = paper.get('paper_id')
if paper_id:
download_paper(paper_id, output_dir="./quantum_papers")
Custom Output Formatting
from arxiv_api import search_and_download
# Custom display function
def custom_display(papers):
for i, paper in enumerate(papers, 1):
print(f"📄 Paper {i}: {paper['title']}")
print(f"👥 Authors: {', '.join(paper['authors'])}")
print(f"🆔 ID: {paper['paper_id']}")
print(f"📅 Date: {paper['published']}")
print(f"📝 Abstract: {paper['abstract'][:150]}...")
print("-" * 80)
# Use custom display
search_and_download("blockchain", max_results=3)
🛠️ Troubleshooting
Common Issues
1. No Results Found
Problem: Search returns no papers Solution:
- Check spelling and use broader terms
- Try different keyword combinations
- Verify internet connection
2. Download Failed
Problem: Paper download fails Solution:
- Verify paper ID is correct
- Check if paper exists on arXiv
- Ensure write permissions in output directory
3. API Rate Limiting
Problem: Too many requests Solution:
- Wait between requests
- Reduce batch size
- Use interactive mode for multiple searches
4. XML Parsing Errors
Problem: Error parsing search results Solution:
- Check internet connection
- Verify API response format
- Update the script if needed
5. Filename Too Long
Problem: Generated filename exceeds system limits Solution:
- The tool automatically limits filenames to 100 characters
- Special characters are automatically cleaned
- Fallback to paper ID if title is unavailable
Error Messages
Error: Failed to download paper 2502.05218v1
- Paper ID may not exist
- Network connection issue
- arXiv server problem
Error parsing XML: ...
- Malformed API response
- Network interruption
- API format change
Could not find paper information for 2502.05218v1
- Paper ID may be invalid
- arXiv API issue
- Network connectivity problem
📖 API Reference
arXiv API Endpoints
- Search API:
http://export.arxiv.org/api/query - Metadata API:
http://export.arxiv.org/api/query?id_list={paper_id} - Documentation: https://arxiv.org/help/api
- Rate Limits: Be respectful, avoid excessive requests
Data Fields Available
- Title: Paper title
- Authors: List of author names
- Abstract: Paper abstract
- Paper ID: Unique arXiv identifier
- Published Date: Publication timestamp
- Categories: arXiv subject categories
Paper ID Format
- Format:
YYMM.NNNNNvN - Example:
2502.05218v1 - Download URL:
https://arxiv.org/pdf/{paper_id}.pdf
Smart Download Features
- Automatic Metadata Fetching: Gets paper information before download
- Intelligent Filename Generation: Converts paper titles to valid filenames
- Character Cleaning: Removes special characters and spaces
- Length Limiting: Ensures filenames don't exceed system limits
- Fallback Naming: Uses paper ID if title is unavailable
🤝 Contributing
Adding New Features
- Fork the repository
- Create a feature branch
- Implement your changes
- Add tests and documentation
- Submit a pull request
Reporting Issues
- Check existing issues first
- Provide detailed error messages
- Include system information
- Describe steps to reproduce
📄 License
This project is open source and available under the MIT License.
🙏 Acknowledgments
- arXiv: For providing the public API
- Python Community: For excellent libraries and tools
- Researchers: For contributing to open science
📞 Support
Getting Help
- Check this documentation first
- Review the examples section
- Search existing issues
- Create a new issue for bugs
Useful Links
Happy Researching! 🎓📚
This tool makes academic research more accessible and efficient. Use it responsibly and respect arXiv's terms of service.