Search Archives - Sixteen Pieces of Flair

In the last post we went over setting up Experience Edge to set up a webhook whenever a publish is completed. In this post, we’ll handle receiving that webhook event to push published updates to our search index.

First let’s review the high-level architecture. Our webhook fires after a publish to Experience Edge completes. We need to send this to a serverless function in our Next.js app. That function will be responsible for parsing the json payload from the webhook and pushing any changes to our search index. This diagram illustrates the process:

Before we build our serverless function, let’s take a look at the json that gets sent with the webhook :

Note that this webhook contains data about everything in the publish operation. For search index purposes, we’re interested in updates to pages on the website. This is represented by in the payload by "entity_definition": "LayoutData". Unfortunately, all we get is the ID of the item that was updated rather than the specific things that changed. That means we’ll need to query for the page data before pushing it to the search index.

Now that we understand the webhook data we’re dealing with, we need to make our function to handle it. If you’re using Vercel to host your web app, creating a serverless function is easy. Create a typescript file in the /pages/api folder in your app. We’ll call this handler “onPublishEnd.ts”. The function needs to do the following:

Loop over all “LayoutData” entries
Query GraphQL for that item’s field data
Validate the item is part of the site we’re indexing
Push the aggregate content data to the search provider

Let’s look at a sample implementation that will accomplish these tasks:

// Import the Next.js API route handler
import { NextApiRequest, NextApiResponse } from 'next';
import { graphqlRequest, GraphQLRequest } from '@/util/GraphQLQuery';
import { GetDate } from '@/util/GetDate';

// Define the API route handler
export default async function onPublishEnd(req: NextApiRequest, res: NextApiResponse) {
  // Check if the api_key query parameter matches the WEBHOOK_API_KEY environment variable
  if (req.query.api_key !== process.env.WEBHOOK_API_KEY) {
    return res.status(401).json({ message: 'Unauthorized' });
  }

  // If the request method is not POST, return an error
  if (req.method !== 'POST') {
    return res.status(405).json({ message: 'Method not allowed' });
  }

  let data;
  try {
    // Try to parse the JSON data from the request body
    //console.log('Req body:\n' + JSON.stringify(req.body));
    data = req.body;
  } catch (error) {
    console.log('Bad Request: ', error);
    return res.status(400).json({ message: 'Bad Request. Check incoming data.' });
  }

  const items = [];

  // Loop over all the entries in updates
  for (const update of data.updates) {
    // Check if the entity_definition is LayoutData
    if (update.entity_definition === 'LayoutData') {
      // Extract the GUID portion of the identifier
      const guid = update.identifier.split('-')[0]

      try {
        // Create the GraphQL request
        const request: GraphQLRequest = {
          query: itemQuery,
          variables: { id: guid },
        };
        
        // Invoke the GraphQL query with the request
        //console.log(`Getting GQL Data for item ${guid}`);
        const result = await graphqlRequest(request);
        //console.log('Item Data:\n' + JSON.stringify(result));

        // Make sure we got some data from GQL in the result
        if (!result || !result.item) {
            console.log(`No data returned from GraphQL for item ${guid}`);
            continue;
          }
  
          // Check if it's in the right site by comparing the item.path
          if (!result.item.path.startsWith('/sitecore/content/Search Demo/Search Demo/')) {
            console.log(`Item ${guid} is not in the right site`);
            continue;
          }

          // Add the item to the items array
          items.push(result.item)
        
      } catch (error) {
        // If an error occurs while invoking the GraphQL query, return a 500 error
        return res.status(500).json({ message: 'Internal Server Error: GraphQL query failed' })
      }
    }
  }
 // Send the json data to the Yext Push API endpoint
 const pushApiEndpoint = `${process.env.YEXT_PUSH_API_ENDPOINT}?v=${GetDate()}&api_key=${process.env.YEXT_PUSH_API_KEY}`;
 console.log(`Pushing to ${pushApiEndpoint}\nData:\n${JSON.stringify(items)}`);
 
 // Send all the items to the Yext Push API endpoint
 const yextResponse = await fetch(pushApiEndpoint, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
    },
    body: JSON.stringify(items),
  });

  if (!yextResponse.ok) {
    console.log(`Failed to push data to Yext: ${yextResponse.status} ${yextResponse.statusText}`);
  }

  // Send a response
  return res.status(200).json({ message: 'Webhook event received' })
}

const itemQuery = `
query ($id: String!) {
    item(path: $id, language: "en") {
      id
      name
      path
      url {
        path
        url
      }
      fields {
        name
        jsonValue
      }
    }
  }
  
`;

https://github.com/csulham/nextjs-sandbox/blob/main/pages/api/onPublishEnd.ts

This function uses the Next.js API helpers to create a quick and easy API endpoint. After some validation (including an API key we define to ensure this endpoint isn’t used in unintended manners), the code goes through the json payload from the webhook and an executes the tasks described above. In this case, we’re pushing to Yext as our search provider, and we’re sending all the item’s field data. Sending everything is preferable here because it simplifies the query on the app side and allows us to handle mappings and transformations in our search provider, making future changes easier to manage without deploying new code.

As the previous post stated, CREATE and DELETE are separate operations that will need to be handled with separate webhooks. There may still be other considerations you’ll need to handle as well, such as a very large publish and the need to batch the querying of item data and the pushes to the search provider. Still, this example is a useful POC that you can adapt to your project’s search provider and specific requirements.

Search is one of the biggest pieces of the puzzle when building a composable solution with Sitecore’s XM Cloud. Sitecore offers their own product, Sitecore Search, and there are a couple other search vendors that have native connectors. But what if you need to set up a search product that does not have a native connector to Sitecore, such as Yext? In this post, we’ll discuss how to use GraphQL + Experience Edge to crawl Sitecore via Yext’s API connector.

The first thing we want to do is figure out how we’ll get out content out of Sitecore. Ideally we want to be able to do this with a single query, rather than chaining queries together, in order to simplify the process of setting up the API crawler in Yext. For this, we’ll use a GraphQL search query. Let’s take a look at an example query:

query YextSiteCrawl(
  $numResults: Int
  $after: String
  $rootItem: String!
  $hasLayout: String!
  $noIndex: Int
) {
  search(
    where: {
      AND: [
        { name: "_path", value: $rootItem, operator: EQ }
        { name: "_hasLayout", value: $hasLayout }
        { name: "noIndex", value: $noIndex, operator: NEQ }
      ]
    }
    first: $numResults
    after: $after
  ) {
    total
    pageInfo {
      endCursor
      hasNext
    }
    results {
      id
      name
      path
      url {
        path
        url
      }
      fields {
        name
        jsonValue
      }
    }
  }
}

Let’s take a look at this query, starting with the search filters.

_path allows us to query for every item that contains the rootPath in its path. For our site crawler, we’ll want to pass in the GUID of the site’s home page here.
_hasLayout is a system field. This filter will exclude items that do not have a presenation assigned to them, such as folders and component datasources. We’ll want to pass in “true” here.
noIndex is a custom field we have defined on our page templates. If this box is checked, we want to exclude it from the crawl. We’ll pass in “1” here.
numResults controls how many results we’ll get back from the query. We’ll use 10 to start, but you can increase this if you want your crawl to go faster. (Be wary of the query response size limits!)
after is our page cursor. In our response, we’ll get back a string that points to the next page of results.

In the results area, we’re asking for some system fields like ID, name, path, and url. These are essential for identifying the content in Yext. After that, we’re asking for every field on the item. You may want to tune this to query just the fields you need to index, but for now we’ll grab everything for simplicity’s sake.

A question you may be asking is, “Why so many parameters?” The answer is to work around a limitation with GraphQL for Experience Edge:

Due to a known issue in the Experience Edge GraphQL schema, you must not mix literal and variable notation in your query input. If you use any variables in your GraphQL query, you must replace all literals with variables.
https://doc.sitecore.com/xp/en/developers/hd/21/sitecore-headless-development/limitations-and-restrictions-of-experience-edge-for-xm.html

The only parameter we want to pass here is “after”, which is the page cursor. We’ll need our crawler to be able to page through the results. Unfortunately, that means we have to pass every literal value we need as a parameter.

Let’s look at the result of this query:

{
  "data": {
    "search": {
      "total": 51,
      "pageInfo": {
        "endCursor": "eyJzZWFythycnRlciI6WzE3MDE2OTQ5MDgwMDAsIjYwMzlGQTJE4rs3gRjVCOEFCRDk1AD5gN0VBIiwiNjAzOUZBMkQ5QzIyNDZGNUI4QUJEOTU3NURBRkI3RUEiXSwiY291bnQiOjF9",
        "hasNext": true
      },
      "results": [
        {
          "id": "00000000000000000000000000000000",
          "name": "Normal Page",
          "path": "/sitecore/content/MyProject/MySite/Home/Normal Page",
          "url": {
            "path": "/Normal-Page",
            "url": "https://xmc-myProject-etc-etc.sitecorecloud.io/en/Normal-Page"
          },
          "fields": [
            {
              "name": "title",
              "jsonValue": {
                "value": "Normal Page title"
              }
            },
            {
              "name": "summary",
              "jsonValue": {
                "value": "Normal Summary"
              }
            },
            {
              "name": "noIndex",
              "jsonValue": {
                "value": false
              }
            },
            {
              "name": "topics",
              "jsonValue": [
                {
                  "id": "00000000-0000-0000-0000-000000000000",
                  "url": "/Data/Taxonomies/Topics/Retirement",
                  "name": "Retirement",
                  "displayName": "Retirement",
                  "fields": {}
                },
                {
                  "id": "00000000-0000-0000-0000-000000000000",
                  "url": "/Data/Taxonomies/Topics/Money",
                  "name": "Money",
                  "displayName": "Money",
                  "fields": {}
                }
              ]
            }            
          ]
        }
	...
      ]
    }
  }
}

In the results block we have our pages, along with all the fields we defined on the page template in the fields block. In the pageInfo block, we have endCursor, which is the string we’ll use to page the results in our crawler.

The next step is to set up the crawler in Yext. From Yext, you’ll want to add a “Pull from API” connector. On API Settings page, we can configure the crawler to hit Experience Edge in the Request URL field, pass our API key in the Authentication section, then put our GraphQL request in the Request Body section. Finally, we can set up the Pagination Control with our cursor. Easy, right?

Unfortunately, we’ll hit a problem here. Yext (as of this writing) only supports passing pagination parameters as query parameters. When we’re using GraphQL, we need to pass variables as part of the request body in the variables block. To work around this limitation, we’ll need to wrap our query in a simple API.

Next.js makes creating an API easy. You drop your api into the /pages/api folder and that’s it! Let’s make a simple API wrapper to take our page cursor as a query parameter and then invoke this query on Experience Edge. We’ll call our api file yextCrawl.ts.

import type { NextApiRequest, NextApiResponse } from 'next'

type ResponseData = {
  message: string
}
 
export default async function handler(
  req: NextApiRequest,
  res: NextApiResponse<ResponseData>
) {  
  try {
    if (!req) {
      return;
    }
    const cursor = req.query['cursor'];

    const crawlQuery = `
      query YextSiteCrawl(
          $numResults: Int
          $after: String
          $rootItem: String!
          $hasLayout: String!
          $noIndex: Int
      ) {
          search(
          where: {
              AND: [
              { name: "_path", value: $rootItem, operator: EQ }
              { name: "_hasLayout", value: $hasLayout }
              { name: "noIndex", value: $noIndex, operator: NEQ }
              ]
          }
          first: $numResults
          after: $after
          ) {
          total
          pageInfo {
              endCursor
              hasNext
          }
          results {
              id
              name
              path
              url {
                  path
                  url
              }
              fields {
                  name
                  jsonValue
              }
          }
        }
      }
  `;
 
    const headers = {
      'content-type': 'application/json',
      'sc_apikey': process.env.YOUR_API_KEY ?? ''
    };

    const requestBody = {
      query: crawlQuery,
      variables: { 
        "numResults" : 10,
        "after" : cursor ?? "",
        "rootItem": "{00000000-0000-0000-0000-000000000000}",
        "hasLayout": "true",
        "noIndex": 1
       }
    };
	
    const options = {
      method: 'POST',
      headers,
      body: JSON.stringify(requestBody)
    };

    const response = await (await fetch('https://edge.sitecorecloud.io/api/graphql/v1', options)).json();
    res.status(200).json(response?.data);
  }
  catch (err) {
    console.log('Error during fetch:', err);
  }
};

Let’s walk through this code.

We’re making a simple handler taking in a NextRequest and a NextResponse. We’ll check the request for our cursor parameter, if it exists. The GraphQL query we have as a literal string, cut and pasted from the XM Cloud playground where we tested it. The API key gets passed in the header, and we’ve configured this in our env.local and as an environment variable in Vercel.

Our request body will contain the query and the variables. This is where we’ll get around the limitation in the Yext Pull from API crawler. We’ll set up the cursor we pulled from the query parameters here. Our other variables we pass to the query are hard coded for the sake of this example.

Finally we use fetch to query Experience Edge and return the response. The result should be the same JSON we got from testing our query in the playground earlier. Once we deploy this api to Vercel, we can see it working at: https://my-nextjs-project.vercel.app/api/yextCrawl

Try hitting that url and see if you get back your Sitecore content. Then grab the endCursor value and hit it again, passing that value as the cursor parameter in the query string. You should see the next page of results.

Back in Yext, we’ll set up our Pull from API connector again, this time hitting our Vercel hosted API wrapper.

As you can see this is a lot easier to configure! Note the values of our cursor parameter under Pagination Control. These correspond to the cursor query parameter we defined in our wrapper API, and the endCursor data in the json response from our GraphQL query. It’s also important to configure the Max Requests setting. We’re limiting this crawler to 1 request per second so we don’t hit the request limit in Experience Edge.

You can test the connector with the Pull button on the top right. If you’ve set up everything correctly, you should see the View Raw Results button light up and be able to see your results in a modal window.

From here, you can configure the mappings of your Sitecore fields to your Yext entities. That is out of the scope of this post, but Yext’s documentation will help you there. One suggestion I will make is to map the Yext entity’s ID to the page’s Sitecore GUID, defined as id in our crawler query response.

Once it’s all set, save your connector, then you can run your crawler from the connector’s page by clicking “Run Connector”. If you’ve set everything up correctly, you should see your Sitecore content flowing into your Yext tenant.

Category: Search

Real-time Search Updates with Experience Edge Webhooks: Part 2

Crawling Sitecore XM Cloud Content with Yext