In today's rapidly advancing digital landscape, the need for efficient data collection, processing, and utilization has never been more vital. The GPT Crawler is revolutionizing the way we gather, create, and utilize data for Custom GPTs, AI assistants, and the OpenAI Playground. This blog post will walk you through the updated process of using the GPT Crawler to scrape the web and generate JSON knowledge files. With these enhancements, content creation and data compilation become more accessible and efficient, offering solutions for managing large files and overcoming common challenges like lazy loading and infinite scroll.
Introduction to the Enhanced GPT Crawler
As artificial intelligence and machine learning continue to evolve, having access to comprehensive and current data is crucial. The GPT Crawler is a sophisticated tool that automates web data collection, transforming it into a structured format that can be directly used by Custom GPTs and AI models. This tool has become indispensable for developers, content creators, and researchers, offering a streamlined, efficient approach to data acquisition.
Step-by-Step Guide
Getting Started
To get started with the GPT Crawler, setting up your environment is the first step. This process involves downloading and installing essential software such as Node.js and Visual Studio Code. Here are the links to get these resources:
Visit the GPT Crawler's GitHub page to clone or download the repository. Follow the provided instructions to install any necessary dependencies and configure the crawler for your needs.
Running the Crawler
Once your environment is ready, you can run the GPT Crawler. The tool is designed to be user-friendly, enabling you to specify the websites you want to scrape and define the output file format. This flexibility makes it an excellent choice for various projects.
Handling Large Files
The GPT Crawler now includes improved functionality for managing large files, which can sometimes be challenging to upload directly. Here are the two main strategies to handle large datasets:
Splitting Files: Use the maxFileSize option in the config.ts file to automatically split large files into smaller, manageable sizes.
Tokenization: Reduce file size by using the maxTokens option in the config.ts file, which breaks down the data into smaller, tokenized segments.
Overcoming Common Challenges: Lazy Loading and Infinite Scroll
It's essential to be aware of potential challenges when using the GPT Crawler on websites that utilize lazy loading or infinite scroll. These features can prevent the crawler from accessing all content, as seen in some initial issues where content wasn't fully loaded. To tackle this:
Pagination: Adjust your website settings to display content in numbered pages (e.g., 1, 2, 3, 4...) instead of using infinite scroll. This change ensures the crawler can systematically access all content, page by page, delivering comprehensive results.
Crawler Configuration: Update your crawler settings to accommodate these changes. Ensure the crawler navigates through each page correctly and captures all necessary data.
Practical Solution for Splitting Files
After running the GPT Crawler or any web scraping tool, you may end up with a sizable JSON file containing all the scraped data. Splitting this file into smaller segments helps with:
Handling Large Files: Uploading and processing one massive file can be difficult.
Incremental Updates: If you re-crawl or only add newly published articles, splitting and managing those new portions alone is more efficient than redoing everything from scratch.
Ease of Editing: Working with smaller, separate files (one per article or data entry) makes it simpler to fix or update content.
Why Incremental (Partial) Crawling?
Many websites frequently add new articles or content. Rather than starting the crawler from the first page and going all the way to the last each time, an incremental or partial crawl lets you only fetch newly added articles since your last crawl. For instance:
Faster: Saves time and reduces unnecessary processing.
Easier Updates: Whenever 10 new articles (for example) are posted, you can have your crawler gather just those new ones, then update your JSON output.
AI Search Engine Integration: An AI-based search engine often only needs article titles and URLs, so reprocessing just the new batch of articles is enough to keep the search index up to date.
Once you have your main JSON file (either the entire dataset or just the latest crawl), splitting it simplifies your workflow. You can edit or revise specific entries easily before potentially recombining them later if needed.
Split JSON Files Snippet
Splitting JSON Files.py
import json
import os
import re
def safe_filename(title):
"""
Replaces any invalid characters in the title string with underscores
to ensure the filename is safe for use in the filesystem.
"""
invalid_chars = r'[<>:"/\\|?*\x00-\x1f\x80-\xff]'
safe_title = re.sub(invalid_chars, '_', title).strip()
return safe_title if safe_title else "Untitled"
def extract_filename_from_url(url):
"""
Extracts a filename from a URL by removing the protocol, domain, and special characters.
"""
path = re.sub(r'https?://[^/]+/', '', url)
path = re.sub(r'[?#].*', '', path)
path = path.replace('/', '_')
return safe_filename(path)
def read_existing_titles(titles_file):
"""
Reads the existing titles from the 'titles_file' if it exists
and returns them as a set to avoid duplicates.
"""
if os.path.exists(titles_file):
with open(titles_file, 'r', encoding='utf8') as file:
return set(file.read().splitlines())
return set()
def create_folder_for_input_file(input_file):
"""
Creates a folder named after the input file without its extension
if it doesn't already exist.
"""
folder_name = os.path.splitext(input_file)[0]
if not os.path.exists(folder_name):
os.makedirs(folder_name)
return folder_name
def split_file(input_file, titles_file='titles_list.txt'):
"""
Reads the 'input_file', splits its content by JSON objects, and processes each entry.
Saves each entry as a separate JSON file within a folder named after the input file.
If 'titles_file' exists, it reads and updates it with new titles to avoid duplicates.
"""
# Read existing titles to avoid duplicates
existing_titles = read_existing_titles(titles_file)
# Create a folder to save output files
folder_name = create_folder_for_input_file(input_file)
# Read the input JSON file
with open(input_file, 'r', encoding='utf8') as file:
content = file.read()
# Try to parse JSON as a list
try:
data_list = json.loads(content)
if not isinstance(data_list, list):
raise ValueError("JSON content is not a list.")
except json.JSONDecodeError as e:
print(f"Error parsing JSON as a list: {e}")
# If parsing fails, attempt manual splitting and parsing of entries
entries = content.split('},')
data_list = []
for entry in entries:
entry = entry.strip()
if not entry.endswith('}'):
entry += '}'
try:
data = json.loads(entry)
data_list.append(data)
except json.JSONDecodeError as inner_e:
print(f"Skipping invalid entry: {entry[:30]}... Error: {inner_e}")
continue
# Track new titles to update the titles list file
new_titles = []
# Process each JSON entry
for data in data_list:
if not isinstance(data, dict):
print(f"Skipping non-dictionary entry: {data}")
continue
# Determine the title or fallback to extracting from the URL
title = data.get("title", "").strip()
if not title:
title = extract_filename_from_url(data.get("url", ""))
else:
title = safe_filename(title)
# Create a unique filename for each entry
filename = os.path.join(folder_name, f"{title}.json")
counter = 1
while os.path.exists(filename):
filename = os.path.join(folder_name, f"{title}_{counter}.json")
counter += 1
# If the title is not in the existing titles, process it
if title not in existing_titles:
new_titles.append(title)
try:
with open(filename, 'w', encoding='utf8') as new_file:
json.dump(data, new_file, ensure_ascii=False, indent=4)
except OSError as file_error:
print(f"Error writing file {filename}: {file_error}")
continue
# Append any new titles to the titles file
if new_titles:
try:
with open(titles_file, 'a', encoding='utf8') as file:
for title in new_titles:
file.write(f"{title}\n")
except OSError as e:
print(f"Error writing titles file: {e}")
# Call the split_file function with the specified input file
split_file('output-1.json') # Replace with your actual file name
Output: A folder named after the input file (e.g., output-1) will be created, containing individual JSON files (one per article).
Titles List: A titles_list.txt file records processed titles to prevent duplicates.
Combining and Further Processing
After splitting files, you might need to recombine (merge) them for simplified uploading or usage in other projects. The previously shown scripts—Combine.py and Get a list of links.py—are useful for:
Combining multiple JSON files back into a single JSON file.
Removing unneeded fields (e.g., html) and extracting only essential information (like title and url).
Combine JSON Files Snippet
Combine JSON Files
import os
import json
def create_output_directory(directory):
"""
Creates an output directory if it doesn't already exist.
Args:
directory (str): The path to the output directory.
"""
if not os.path.exists(directory):
os.makedirs(directory)
print(f"Created directory: {directory}")
else:
print(f"Directory already exists: {directory}")
def combine_json_files_in_current_directory():
"""
Combines all JSON files in the current directory into a single JSON file
within a newly created output directory.
"""
# Get the current directory where the script is running
current_directory = os.path.dirname(os.path.abspath(__file__))
# Initialize a list to hold combined data
combined_data = []
# Loop through all files in the current directory
for filename in os.listdir(current_directory):
if filename.endswith(".json"):
file_path = os.path.join(current_directory, filename)
with open(file_path, 'r', encoding='utf-8') as f:
try:
data = json.load(f)
combined_data.append(data)
except json.JSONDecodeError as e:
print(f"Error reading {file_path}: {e}")
# Create an output directory in the same location as the script
output_directory = os.path.join(current_directory, 'combined_output')
create_output_directory(output_directory)
# Define the output file path
output_file_path = os.path.join(output_directory, "combined_output.json")
# Write the combined data to the output file
with open(output_file_path, 'w', encoding='utf-8') as f:
json.dump(combined_data, f, ensure_ascii=False, indent=4)
print(f"Combined {len(combined_data)} JSON files into {output_file_path}")
# Call the function to combine JSON files in the current directory
combine_json_files_in_current_directory()
Process Combined Output Get a list of links fileSnippet
Process Combined Output
import json
import os
def process_combined_output():
"""
Reads combined_output.json, removes the 'html' key from each entry,
writes a list of titles to titles_list.txt, and writes the cleaned JSON data
to combined_output.txt.
"""
# Determine the directory where this script is located
current_dir = os.path.dirname(os.path.abspath(__file__))
input_file_path = os.path.join(current_dir, "combined_output.json")
# Check if the input file exists
if not os.path.exists(input_file_path):
print(f"Input file '{input_file_path}' not found.")
return
# Load JSON data from combined_output.json
try:
with open(input_file_path, "r", encoding="utf8") as infile:
data = json.load(infile)
except json.JSONDecodeError as e:
print(f"Error loading JSON from '{input_file_path}': {e}")
return
titles = []
cleaned_entries = []
# Process each entry in the JSON data
for entry in data:
# Remove the 'html' key if it exists
if "html" in entry:
entry.pop("html", None)
# Extract the title (defaulting to "Untitled" if missing or empty)
title = entry.get("title", "").strip() or "Untitled"
titles.append(title)
cleaned_entries.append(entry)
# Write the list of titles to titles_list.txt (one title per line)
titles_file_path = os.path.join(current_dir, "titles_list.txt")
try:
with open(titles_file_path, "w", encoding="utf8") as titles_file:
for title in titles:
titles_file.write(title + "\n")
except OSError as e:
print(f"Error writing '{titles_file_path}': {e}")
return
# Write the cleaned JSON data to combined_output.txt
output_file_path = os.path.join(current_dir, "combined_output.txt")
try:
with open(output_file_path, "w", encoding="utf8") as output_file:
json.dump(cleaned_entries, output_file, ensure_ascii=False, indent=4)
except OSError as e:
print(f"Error writing '{output_file_path}': {e}")
return
print("Processing complete.")
print(f"Titles saved to: {titles_file_path}")
print(f"Cleaned JSON data saved to: {output_file_path}")
if __name__ == "__main__":
process_combined_output()
Incremental Crawling: Only fetch newly published articles rather than re-crawling everything.
Split for Simplicity: Edit or examine articles individually.
Combine for Final Use: Bring them together into a single JSON when needed.
Clean Up: Remove large or unnecessary fields to optimize for AI-based searches.
Titles & URLs: Perfect for building an AI search index or quick lookups.
Conclusion
Implementing an incremental (partial) crawling approach with file splitting creates a more efficient workflow. It saves time by focusing only on new content each time you crawl. After that, splitting large JSON files into smaller, per-article files lets you update or edit entries individually without disturbing the rest. If you ever need a consolidated overview, you can recombine the split files. Finally, for AI or search-engine purposes, you can strip out unnecessary details—like html content—and feed the final JSON containing only titles and URLs to your AI-based search index. This modular flow—crawl → split → edit/update → combine → final data—helps keep your project organized, flexible, and ready for ongoing changes.