← All posts

GPT Crawler Uniqcret style

How to by UniqcretU are tificial intelligenceUniqcret MD AI chatbot via Line

In today's rapidly advancing digital landscape, the need for efficient data collection, processing, and utilization has never been more vital. The GPT Crawler is revolutionizing the way we gather, create, and utilize data for Custom GPTs, AI assistants, and the OpenAI Playground. This blog post will walk you through the updated process of using the GPT Crawler to scrape the web and generate JSON knowledge files. With these enhancements, content creation and data compilation become more accessible and efficient, offering solutions for managing large files and overcoming common challenges like lazy loading and infinite scroll.

Introduction to the Enhanced GPT Crawler

As artificial intelligence and machine learning continue to evolve, having access to comprehensive and current data is crucial. The GPT Crawler is a sophisticated tool that automates web data collection, transforming it into a structured format that can be directly used by Custom GPTs and AI models. This tool has become indispensable for developers, content creators, and researchers, offering a streamlined, efficient approach to data acquisition.

Step-by-Step Guide

Getting Started

To get started with the GPT Crawler, setting up your environment is the first step. This process involves downloading and installing essential software such as Node.js and Visual Studio Code. Here are the links to get these resources:

Installation and Setup

Visit the GPT Crawler's GitHub page to clone or download the repository. Follow the provided instructions to install any necessary dependencies and configure the crawler for your needs.

Running the Crawler

Once your environment is ready, you can run the GPT Crawler. The tool is designed to be user-friendly, enabling you to specify the websites you want to scrape and define the output file format. This flexibility makes it an excellent choice for various projects.

Handling Large Files

The GPT Crawler now includes improved functionality for managing large files, which can sometimes be challenging to upload directly. Here are the two main strategies to handle large datasets:

  1. Splitting Files: Use the maxFileSize option in the config.ts file to automatically split large files into smaller, manageable sizes.
  2. Tokenization: Reduce file size by using the maxTokens option in the config.ts file, which breaks down the data into smaller, tokenized segments.

Overcoming Common Challenges: Lazy Loading and Infinite Scroll

It's essential to be aware of potential challenges when using the GPT Crawler on websites that utilize lazy loading or infinite scroll. These features can prevent the crawler from accessing all content, as seen in some initial issues where content wasn't fully loaded. To tackle this:

  1. Pagination: Adjust your website settings to display content in numbered pages (e.g., 1, 2, 3, 4...) instead of using infinite scroll. This change ensures the crawler can systematically access all content, page by page, delivering comprehensive results.
  2. Crawler Configuration: Update your crawler settings to accommodate these changes. Ensure the crawler navigates through each page correctly and captures all necessary data.

Practical Solution for Splitting Files

After running the GPT Crawler or any web scraping tool, you may end up with a sizable JSON file containing all the scraped data. Splitting this file into smaller segments helps with:

  1. Handling Large Files: Uploading and processing one massive file can be difficult.
  2. Incremental Updates: If you re-crawl or only add newly published articles, splitting and managing those new portions alone is more efficient than redoing everything from scratch.
  3. Ease of Editing: Working with smaller, separate files (one per article or data entry) makes it simpler to fix or update content.

Why Incremental (Partial) Crawling?

Many websites frequently add new articles or content. Rather than starting the crawler from the first page and going all the way to the last each time, an incremental or partial crawl lets you only fetch newly added articles since your last crawl. For instance:

Once you have your main JSON file (either the entire dataset or just the latest crawl), splitting it simplifies your workflow. You can edit or revise specific entries easily before potentially recombining them later if needed.


Split JSON Files Snippet

Splitting JSON Files.py


import json
import os
import re

def safe_filename(title):
    """
    Replaces any invalid characters in the title string with underscores
    to ensure the filename is safe for use in the filesystem.
    """
    invalid_chars = r'[<>:"/\\|?*\x00-\x1f\x80-\xff]'
    safe_title = re.sub(invalid_chars, '_', title).strip()
    return safe_title if safe_title else "Untitled"

def extract_filename_from_url(url):
    """
    Extracts a filename from a URL by removing the protocol, domain, and special characters.
    """
    path = re.sub(r'https?://[^/]+/', '', url)
    path = re.sub(r'[?#].*', '', path)
    path = path.replace('/', '_')
    return safe_filename(path)

def read_existing_titles(titles_file):
    """
    Reads the existing titles from the 'titles_file' if it exists
    and returns them as a set to avoid duplicates.
    """
    if os.path.exists(titles_file):
        with open(titles_file, 'r', encoding='utf8') as file:
            return set(file.read().splitlines())
    return set()

def create_folder_for_input_file(input_file):
    """
    Creates a folder named after the input file without its extension
    if it doesn't already exist.
    """
    folder_name = os.path.splitext(input_file)[0]
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)
    return folder_name

def split_file(input_file, titles_file='titles_list.txt'):
    """
    Reads the 'input_file', splits its content by JSON objects, and processes each entry.
    Saves each entry as a separate JSON file within a folder named after the input file.
    If 'titles_file' exists, it reads and updates it with new titles to avoid duplicates.
    """
    # Read existing titles to avoid duplicates
    existing_titles = read_existing_titles(titles_file)
    
    # Create a folder to save output files
    folder_name = create_folder_for_input_file(input_file)
    
    # Read the input JSON file
    with open(input_file, 'r', encoding='utf8') as file:
        content = file.read()

    # Try to parse JSON as a list
    try:
        data_list = json.loads(content)
        if not isinstance(data_list, list):
            raise ValueError("JSON content is not a list.")
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON as a list: {e}")
        # If parsing fails, attempt manual splitting and parsing of entries
        entries = content.split('},')
        data_list = []
        for entry in entries:
            entry = entry.strip()
            if not entry.endswith('}'):
                entry += '}'
            try:
                data = json.loads(entry)
                data_list.append(data)
            except json.JSONDecodeError as inner_e:
                print(f"Skipping invalid entry: {entry[:30]}... Error: {inner_e}")
                continue

    # Track new titles to update the titles list file
    new_titles = []

    # Process each JSON entry
    for data in data_list:
        if not isinstance(data, dict):
            print(f"Skipping non-dictionary entry: {data}")
            continue
        
        # Determine the title or fallback to extracting from the URL
        title = data.get("title", "").strip()
        if not title:
            title = extract_filename_from_url(data.get("url", ""))
        else:
            title = safe_filename(title)

        # Create a unique filename for each entry
        filename = os.path.join(folder_name, f"{title}.json")
        counter = 1
        while os.path.exists(filename):
            filename = os.path.join(folder_name, f"{title}_{counter}.json")
            counter += 1

        # If the title is not in the existing titles, process it
        if title not in existing_titles:
            new_titles.append(title)
            try:
                with open(filename, 'w', encoding='utf8') as new_file:
                    json.dump(data, new_file, ensure_ascii=False, indent=4)
            except OSError as file_error:
                print(f"Error writing file {filename}: {file_error}")
                continue

    # Append any new titles to the titles file
    if new_titles:
        try:
            with open(titles_file, 'a', encoding='utf8') as file:
                for title in new_titles:
                    file.write(f"{title}\n")
        except OSError as e:
            print(f"Error writing titles file: {e}")

# Call the split_file function with the specified input file
split_file('output-1.json')  # Replace with your actual file name
    

Download Splitting JSON Files file

How to Use:

  1. Place your JSON file (e.g., output-1.json) in the same directory as Splitting JSON Files.py.
  2. Run the script: python Splitting JSON Files.py.
  3. Output: A folder named after the input file (e.g., output-1) will be created, containing individual JSON files (one per article).
  4. Titles List: A titles_list.txt file records processed titles to prevent duplicates.

Combining and Further Processing

After splitting files, you might need to recombine (merge) them for simplified uploading or usage in other projects. The previously shown scripts—Combine.py and Get a list of links.py—are useful for:

Combine JSON Files Snippet

Combine JSON Files


import os
import json

def create_output_directory(directory):
    """
    Creates an output directory if it doesn't already exist.

    Args:
        directory (str): The path to the output directory.
    """
    if not os.path.exists(directory):
        os.makedirs(directory)
        print(f"Created directory: {directory}")
    else:
        print(f"Directory already exists: {directory}")

def combine_json_files_in_current_directory():
    """
    Combines all JSON files in the current directory into a single JSON file 
    within a newly created output directory.
    """
    # Get the current directory where the script is running
    current_directory = os.path.dirname(os.path.abspath(__file__))

    # Initialize a list to hold combined data
    combined_data = []

    # Loop through all files in the current directory
    for filename in os.listdir(current_directory):
        if filename.endswith(".json"):
            file_path = os.path.join(current_directory, filename)
            with open(file_path, 'r', encoding='utf-8') as f:
                try:
                    data = json.load(f)
                    combined_data.append(data)
                except json.JSONDecodeError as e:
                    print(f"Error reading {file_path}: {e}")

    # Create an output directory in the same location as the script
    output_directory = os.path.join(current_directory, 'combined_output')
    create_output_directory(output_directory)

    # Define the output file path
    output_file_path = os.path.join(output_directory, "combined_output.json")

    # Write the combined data to the output file
    with open(output_file_path, 'w', encoding='utf-8') as f:
        json.dump(combined_data, f, ensure_ascii=False, indent=4)

    print(f"Combined {len(combined_data)} JSON files into {output_file_path}")

# Call the function to combine JSON files in the current directory
combine_json_files_in_current_directory()
    

Download Combine file

Process Combined Output Get a list of links fileSnippet

Process Combined Output


import json
import os

def process_combined_output():
    """
    Reads combined_output.json, removes the 'html' key from each entry,
    writes a list of titles to titles_list.txt, and writes the cleaned JSON data
    to combined_output.txt.
    """
    # Determine the directory where this script is located
    current_dir = os.path.dirname(os.path.abspath(__file__))
    input_file_path = os.path.join(current_dir, "combined_output.json")
    
    # Check if the input file exists
    if not os.path.exists(input_file_path):
        print(f"Input file '{input_file_path}' not found.")
        return

    # Load JSON data from combined_output.json
    try:
        with open(input_file_path, "r", encoding="utf8") as infile:
            data = json.load(infile)
    except json.JSONDecodeError as e:
        print(f"Error loading JSON from '{input_file_path}': {e}")
        return

    titles = []
    cleaned_entries = []

    # Process each entry in the JSON data
    for entry in data:
        # Remove the 'html' key if it exists
        if "html" in entry:
            entry.pop("html", None)
        
        # Extract the title (defaulting to "Untitled" if missing or empty)
        title = entry.get("title", "").strip() or "Untitled"
        titles.append(title)
        cleaned_entries.append(entry)

    # Write the list of titles to titles_list.txt (one title per line)
    titles_file_path = os.path.join(current_dir, "titles_list.txt")
    try:
        with open(titles_file_path, "w", encoding="utf8") as titles_file:
            for title in titles:
                titles_file.write(title + "\n")
    except OSError as e:
        print(f"Error writing '{titles_file_path}': {e}")
        return

    # Write the cleaned JSON data to combined_output.txt
    output_file_path = os.path.join(current_dir, "combined_output.txt")
    try:
        with open(output_file_path, "w", encoding="utf8") as output_file:
            json.dump(cleaned_entries, output_file, ensure_ascii=False, indent=4)
    except OSError as e:
        print(f"Error writing '{output_file_path}': {e}")
        return

    print("Processing complete.")
    print(f"Titles saved to: {titles_file_path}")
    print(f"Cleaned JSON data saved to: {output_file_path}")

if __name__ == "__main__":
    process_combined_output()
    

Download Get a list of links file


Key Takeaways

  1. Incremental Crawling: Only fetch newly published articles rather than re-crawling everything.
  2. Split for Simplicity: Edit or examine articles individually.
  3. Combine for Final Use: Bring them together into a single JSON when needed.
  4. Clean Up: Remove large or unnecessary fields to optimize for AI-based searches.
  5. Titles & URLs: Perfect for building an AI search index or quick lookups.

Conclusion

Implementing an incremental (partial) crawling approach with file splitting creates a more efficient workflow. It saves time by focusing only on new content each time you crawl. After that, splitting large JSON files into smaller, per-article files lets you update or edit entries individually without disturbing the rest. If you ever need a consolidated overview, you can recombine the split files. Finally, for AI or search-engine purposes, you can strip out unnecessary details—like html content—and feed the final JSON containing only titles and URLs to your AI-based search index. This modular flow—crawl → split → edit/update → combine → final data—helps keep your project organized, flexible, and ready for ongoing changes.