如何在Vertex AI中將IPFLY代理用於可擴展的數據收集管道

214次閱讀

在人工智能領域，確保機器學習模型訪問準確、實時的數據對於可靠的結果至關重要。本文闡述了將IPFLY的綜合代理服務集成到Google Vertex AI管道中以構建強大的數據採集管道。通過利用IPFLY廣泛的IP資源庫——包括靜態住宅代理、動態住宅代理和數據中心代理——組織可以安全地檢索Web數據，將大型語言模型（LLM）與過時或捏造的信息聯繫起來。本教程專爲25分鐘的參與而設計，探討了Vertex AI管道、通過檢索增強生成（RAG）進行外部數據集成的必要性，以及IPFLY代理相對於傳統方法的優勢，提供卓越的匿名性、覆蓋190多個國家的全球範圍以及企業規模運營的無限併發。

主要目標包括：

理解頂點AI管道的功能。

掌握數據檢索IPFLY代理的合併。

構建爲事實覈查或市場分析量身定製的數據採集管道。

要啓動，請建立一個IPFLY帳戶來利用這些高級代理功能，確保跨境數據採集的高成功率和合規性。

什麼是Vertex AI管道？

頂點人工智能管道構成了谷歌雲內的託管服務，促進了全面機器學習工作流程的自動化、編排和複製。它將複雜的流程分解爲可追溯和版本控制的模塊化組件，在無服務器框架中運行以簡化機器學習操作（MLOps）。這種架構支持高效擴展，特別是當與IPFLY的代理網絡等外部數據源集成時，該網絡爲搜索引擎優化、市場研究和廣告驗證等場景中的無縫高速數據流提供了超過9000萬的IP。

構建數據收集管道：爲什麼以及如何

大型語言模型擁有靜態知識庫，使它們在無法訪問當前網絡信息的情況下容易受到不準確的影響。檢索增強生成（RAG）通過在響應制定之前合併最新的外部數據來緩解這種情況，提高事實檢查等應用程序的準確性。

雖然存在某些用於數據接地的內置工具，但它們通常缺乏可擴展性、定製性和對來源的強大控制。IPFLY的代理提供了一個卓越的解決方案，支持具有高度匿名性和穩定性的程序化網絡訪問。IPFLY從真實的最終用戶設備中進行嚴格的IP選擇，確保不可重複使用的安全連接符合業務需求，如跨國電商或社交媒體營銷。

此處概述的數據採集管道包括三個階段：

1.查詢提取：LLM識別關鍵聲明並制定可搜索的查詢。

2.網絡數據檢索： IPFLY代理有助於安全獲取實時內容。

3.Data驗證：LLM處理檢索到的數據以生成驗證的輸出。

這種方法論擴展到各種應用程序，包括趨勢分析、內容摘要和自動化測試，並得到IPFLY 99.9%正常運行時間和毫秒級響應的支持。

如何將IPFLY代理集成到Vertex AI管道中

先決條件

一個活躍的Google Cloud Console帳戶。

具有配置代理憑據的IPFLY帳戶（建議管理訪問以進行最佳設置）。

步驟#1：創建和配置新的Google Cloud項目

建立一個名爲“IPFLY Data Collection Pipeline”的項目，其標識符例如ipfly-管道。記錄項目編號和ID。激活基本API，包括：

頂點AI API。

筆記本API。可選擇啓用補充API以增強功能，例如雲資源管理器或雲存儲。

步驟#2：設置雲存儲桶

生成一個唯一命名的存儲桶，例如，ipfly-pipeline-artifacts，選擇像“我們”這樣的多區域配置以確保可訪問性。將存儲管理員角色分配給項目的計算引擎服務號（[project-number]@developer.gserviceaccount.com），以避免執行期間的授權問題。存儲桶URI可能類似於gs：//ipfly-pipeline-artifacts。

步驟#3：配置IAM權限

轉到IAM&Admin部分。在Compute Engine默認服務號上授予以下角色：

服務帳戶用戶。

頂點人工智能用戶。此配置授權管道的創建和操作。

步驟#4：設置Vertex AI Workbench

導航到Google Cloud Console中的Vertex AI Workbench。使用標準規格實例化一個新環境（例如，n1-標準-4機器類型，JupyterLab 3）。啓動JupyterLab並啓動Python 3筆記本。開發完全在雲中進行，消除了本地需求。

步驟#5：安裝和初始化所需的Python庫

執行必要包的安裝：

!pip install kfp google-cloud-aiplatform google-generativeai requests --quiet --upgrade

初始化Vertex AI SDK：

import kfpfrom kfp.dsl import component, pipeline, Input, Output, Artifactfrom kfp import compilerfrom google.cloud import aiplatformfrom typing import ListPROJECT_ID = "<YOUR_GC_PROJECT_ID>"REGION = "<YOUR_REGION>"  # e.g., "us-central1"BUCKET_URI = "<YOUR_BUCKET_URI>"  # e.g., "gs://ipfly-pipeline-artifacts"aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

步驟#6：定義查詢提取組件

使用Double從輸入文本中派生可搜索的查詢：

@component(    base_image="python:3.10",    packages_to_install=["google-generativeai"],)def extract_queries(    input_text: str,    project: str,    location: str,) -> List[str]:    import google.generativeai as genai    import json    genai.configure(api_key="<YOUR_GEMINI_API_KEY>")  # Use secure key management in production    model = genai.GenerativeModel('gemini-1.5-flash')  # Updated model reference    prompt = f"""    As a data analyst, review the text and extract a list of precise search queries for verifying key claims.    Output only a Python list of strings.    Example:    Input: "The Great Wall of China is visible from space and was built in the 7th century BC."    Output: ["is the great wall of china visible from space", "when was the great wall of china built"]    Text:    "{input_text}"    """    response = model.generate_content(prompt)    query_list: List[str] = json.loads(response.text.strip())    return query_list

這利用了一個快速的模型來提高效率。

步驟#7：創建IPFLY代理驅動的Web數據檢索組件

利用IPFLY代理獲取內容：

@component(    base_image="python:3.10",    packages_to_install=["requests"],)def fetch_web_data(    queries: List[str],    ipfly_proxy_host: str,    ipfly_proxy_port: str,    ipfly_username: str,    ipfly_password: str,    output_file: Output[Artifact],):    import requests    import json    proxies = {        'http': f'http://{ipfly_username}:{ipfly_password}@{ipfly_proxy_host}:{ipfly_proxy_port}',        'https': f'http://{ipfly_username}:{ipfly_password}@{ipfly_proxy_host}:{ipfly_proxy_port}'    }    results = []    for query in queries:        url = f"https://www.google.com/search?q={query.replace(' ', '+')}"        try:            response = requests.get(url, proxies=proxies, timeout=10)            results.append({"query": query, "content": response.text})  # Parse as needed for production        except Exception as e:            results.append({"query": query, "error": str(e)})    with open(output_file.path, "w") as f:        json.dump(results, f)

該組件使用IPFLY的動態住宅代理來輪換和繞過限制，確保匿名、穩定的檢索。

步驟#8：實現數據驗證組件

處理數據以進行驗證：

@component(    base_image="python:3.10",    packages_to_install=["google-generativeai"],)def validate_with_web_data(    input_text: str,    web_data_file: Input[Artifact],    project: str,    location: str,) -> str:    import google.generativeai as genai    import json    with open(web_data_file.path, "r") as f:        web_data = json.load(f)    genai.configure(api_key="<YOUR_GEMINI_API_KEY>")    model = genai.GenerativeModel('gemini-1.5-pro')    prompt = f"""    Validate the original text using the provided web data in JSON format.    Produce a Markdown report highlighting accuracies and discrepancies.    [Original Text]    "{input_text}"    [Web Data]    "{json.dumps(web_data)}"    """    response = model.generate_content(prompt)    return response.text

選擇更高級的模型進行復雜的分析。

步驟#9：定義和編譯管道

互連組件：

@pipeline(    name="ipfly-data-collection-pipeline",    description="Retrieves web data via IPFLY proxies for validation.")def data_collection_pipeline(    input_text: str,    ipfly_proxy_host: str,    ipfly_proxy_port: str,    ipfly_username: str,    ipfly_password: str,    project: str = PROJECT_ID,    location: str = REGION,):    step1 = extract_queries(input_text=input_text, project=project, location=location)    step2 = fetch_web_data(        queries=step1.output,        ipfly_proxy_host=ipfly_proxy_host,        ipfly_proxy_port=ipfly_proxy_port,        ipfly_username=ipfly_username,        ipfly_password=ipfly_password    )    step3 = validate_with_web_data(        input_text=input_text,        web_data_file=step2.outputs["output_file"],        project=project,        location=location    )compiler.Compiler().compile(    pipeline_func=data_collection_pipeline,    package_path="data_collection_pipeline.json")

步驟#10：運行管道

示例輸入：

“東京是使用歐元作爲貨幣的日本的首都。”

TEXT_TO_VALIDATE = """Tokyo is the capital of Japan, which uses the euro as its currency."""IPFLY_PROXY_HOST = "<YOUR_IPFLY_HOST>"IPFLY_PROXY_PORT = "<YOUR_IPFLY_PORT>"IPFLY_USERNAME = "<YOUR_IPFLY_USERNAME>"IPFLY_PASSWORD = "<YOUR_IPFLY_PASSWORD>"job = aiplatform.PipelineJob(    display_name="data-collection-pipeline-run",    template_path="data_collection_pipeline.json",    pipeline_root=BUCKET_URI,    parameter_values={        "input_text": TEXT_TO_VALIDATE,        "ipfly_proxy_host": IPFLY_PROXY_HOST,        "ipfly_proxy_port": IPFLY_PROXY_PORT,        "ipfly_username": IPFLY_USERNAME,        "ipfly_password": IPFLY_PASSWORD    })job.run()

對於生產，採用安全的祕密管理，而不是直接包含憑據。

步驟#11：監控管道執行

通過以下方式觀察進展：

https://console.cloud.google.com/vertex-ai/pipelines?project={PROJECT_ID}

檢查組件狀態、日誌和工件。

步驟#12：探索輸出

提取的查詢：例如，“日本的首都是什麼”、“日本使用什麼貨幣”。

檢索到的數據：包含Web內容的存儲桶中的JSON工件。

驗證報告：Markdown輸出識別正確（東京作爲資本）和不正確（日元，而不是歐元）元素。

本次博覽會展示了IPFLY代理在頂點人工智能中的有效集成，以打造可靠的數據採集管道。通過利用IPFLY安全、廣泛的代理生態系統，企業可以在人工智能工作流程中獲得無與倫比的數據完整性和效率。IPFLY對高併發、全球知識產權輪換的支持爲數據抓取、金融服務等領域的高級應用程序提供了支持。我們鼓勵探索IPFLY的產品，通過優質的網絡數據解決方案提升您的人工智能基礎設施。

正文完