Crawling Sitecore XM Cloud Content with Yext

Search is one of the biggest pieces of the puzzle when building a composable solution with Sitecore’s XM Cloud. Sitecore offers their own product, Sitecore Search, and there are a couple other search vendors that have native connectors. But what if you need to set up a search product that does not have a native connector to Sitecore, such as Yext? In this post, we’ll discuss how to use GraphQL + Experience Edge to crawl Sitecore via Yext’s API connector.

The first thing we want to do is figure out how we’ll get out content out of Sitecore. Ideally we want to be able to do this with a single query, rather than chaining queries together, in order to simplify the process of setting up the API crawler in Yext. For this, we’ll use a GraphQL search query. Let’s take a look at an example query:

query YextSiteCrawl(
  $numResults: Int
  $after: String
  $rootItem: String!
  $hasLayout: String!
  $noIndex: Int
) {
  search(
    where: {
      AND: [
        { name: "_path", value: $rootItem, operator: EQ }
        { name: "_hasLayout", value: $hasLayout }
        { name: "noIndex", value: $noIndex, operator: NEQ }
      ]
    }
    first: $numResults
    after: $after
  ) {
    total
    pageInfo {
      endCursor
      hasNext
    }
    results {
      id
      name
      path
      url {
        path
        url
      }
      fields {
        name
        jsonValue
      }
    }
  }
}

Let’s take a look at this query, starting with the search filters.

  • _path allows us to query for every item that contains the rootPath in its path. For our site crawler, we’ll want to pass in the GUID of the site’s home page here.
  • _hasLayout is a system field. This filter will exclude items that do not have a presenation assigned to them, such as folders and component datasources. We’ll want to pass in “true” here.
  • noIndex is a custom field we have defined on our page templates. If this box is checked, we want to exclude it from the crawl. We’ll pass in “1” here.
  • numResults controls how many results we’ll get back from the query. We’ll use 10 to start, but you can increase this if you want your crawl to go faster. (Be wary of the query response size limits!)
  • after is our page cursor. In our response, we’ll get back a string that points to the next page of results.

In the results area, we’re asking for some system fields like ID, name, path, and url. These are essential for identifying the content in Yext. After that, we’re asking for every field on the item. You may want to tune this to query just the fields you need to index, but for now we’ll grab everything for simplicity’s sake.

A question you may be asking is, “Why so many parameters?” The answer is to work around a limitation with GraphQL for Experience Edge:

Due to a known issue in the Experience Edge GraphQL schema, you must not mix literal and variable notation in your query input. If you use any variables in your GraphQL query, you must replace all literals with variables.

https://doc.sitecore.com/xp/en/developers/hd/21/sitecore-headless-development/limitations-and-restrictions-of-experience-edge-for-xm.html

The only parameter we want to pass here is “after”, which is the page cursor. We’ll need our crawler to be able to page through the results. Unfortunately, that means we have to pass every literal value we need as a parameter.

Let’s look at the result of this query:

{
"data": {
"search": {
"total": 51,
"pageInfo": {
"endCursor": "eyJzZWFythycnRlciI6WzE3MDE2OTQ5MDgwMDAsIjYwMzlGQTJE4rs3gRjVCOEFCRDk1AD5gN0VBIiwiNjAzOUZBMkQ5QzIyNDZGNUI4QUJEOTU3NURBRkI3RUEiXSwiY291bnQiOjF9",
"hasNext": true
},
"results": [
{
"id": "00000000000000000000000000000000",
"name": "Normal Page",
"path": "/sitecore/content/MyProject/MySite/Home/Normal Page",
"url": {
"path": "/Normal-Page",
"url": "https://xmc-myProject-etc-etc.sitecorecloud.io/en/Normal-Page"
},
"fields": [
{
"name": "title",
"jsonValue": {
"value": "Normal Page title"
}
},
{
"name": "summary",
"jsonValue": {
"value": "Normal Summary"
}
},
{
"name": "noIndex",
"jsonValue": {
"value": false
}
},
{
"name": "topics",
"jsonValue": [
{
"id": "00000000-0000-0000-0000-000000000000",
"url": "/Data/Taxonomies/Topics/Retirement",
"name": "Retirement",
"displayName": "Retirement",
"fields": {}
},
{
"id": "00000000-0000-0000-0000-000000000000",
"url": "/Data/Taxonomies/Topics/Money",
"name": "Money",
"displayName": "Money",
"fields": {}
}
]
}
]
}
...
]
}
}
}

In the results block we have our pages, along with all the fields we defined on the page template in the fields block. In the pageInfo block, we have endCursor, which is the string we’ll use to page the results in our crawler.

The next step is to set up the crawler in Yext. From Yext, you’ll want to add a “Pull from API” connector. On API Settings page, we can configure the crawler to hit Experience Edge in the Request URL field, pass our API key in the Authentication section, then put our GraphQL request in the Request Body section. Finally, we can set up the Pagination Control with our cursor. Easy, right?

Unfortunately, we’ll hit a problem here. Yext (as of this writing) only supports passing pagination parameters as query parameters. When we’re using GraphQL, we need to pass variables as part of the request body in the variables block. To work around this limitation, we’ll need to wrap our query in a simple API.

Next.js makes creating an API easy. You drop your api into the /pages/api folder and that’s it! Let’s make a simple API wrapper to take our page cursor as a query parameter and then invoke this query on Experience Edge. We’ll call our api file yextCrawl.ts.

import type { NextApiRequest, NextApiResponse } from 'next'

type ResponseData = {
  message: string
}
 
export default async function handler(
  req: NextApiRequest,
  res: NextApiResponse<ResponseData>
) {  
  try {
    if (!req) {
      return;
    }
    const cursor = req.query['cursor'];

    const crawlQuery = `
      query YextSiteCrawl(
          $numResults: Int
          $after: String
          $rootItem: String!
          $hasLayout: String!
          $noIndex: Int
      ) {
          search(
          where: {
              AND: [
              { name: "_path", value: $rootItem, operator: EQ }
              { name: "_hasLayout", value: $hasLayout }
              { name: "noIndex", value: $noIndex, operator: NEQ }
              ]
          }
          first: $numResults
          after: $after
          ) {
          total
          pageInfo {
              endCursor
              hasNext
          }
          results {
              id
              name
              path
              url {
                  path
                  url
              }
              fields {
                  name
                  jsonValue
              }
          }
        }
      }
  `;
 
    const headers = {
      'content-type': 'application/json',
      'sc_apikey': process.env.YOUR_API_KEY ?? ''
    };

    const requestBody = {
      query: crawlQuery,
      variables: { 
        "numResults" : 10,
        "after" : cursor ?? "",
        "rootItem": "{00000000-0000-0000-0000-000000000000}",
        "hasLayout": "true",
        "noIndex": 1
       }
    };
	
    const options = {
      method: 'POST',
      headers,
      body: JSON.stringify(requestBody)
    };

    const response = await (await fetch('https://edge.sitecorecloud.io/api/graphql/v1', options)).json();
    res.status(200).json(response?.data);
  }
  catch (err) {
    console.log('Error during fetch:', err);
  }
};

Let’s walk through this code.

We’re making a simple handler taking in a NextRequest and a NextResponse. We’ll check the request for our cursor parameter, if it exists. The GraphQL query we have as a literal string, cut and pasted from the XM Cloud playground where we tested it. The API key gets passed in the header, and we’ve configured this in our env.local and as an environment variable in Vercel.

Our request body will contain the query and the variables. This is where we’ll get around the limitation in the Yext Pull from API crawler. We’ll set up the cursor we pulled from the query parameters here. Our other variables we pass to the query are hard coded for the sake of this example.

Finally we use fetch to query Experience Edge and return the response. The result should be the same JSON we got from testing our query in the playground earlier. Once we deploy this api to Vercel, we can see it working at: https://my-nextjs-project.vercel.app/api/yextCrawl

Try hitting that url and see if you get back your Sitecore content. Then grab the endCursor value and hit it again, passing that value as the cursor parameter in the query string. You should see the next page of results.

Back in Yext, we’ll set up our Pull from API connector again, this time hitting our Vercel hosted API wrapper.

As you can see this is a lot easier to configure! Note the values of our cursor parameter under Pagination Control. These correspond to the cursor query parameter we defined in our wrapper API, and the endCursor data in the json response from our GraphQL query. It’s also important to configure the Max Requests setting. We’re limiting this crawler to 1 request per second so we don’t hit the request limit in Experience Edge.

You can test the connector with the Pull button on the top right. If you’ve set up everything correctly, you should see the View Raw Results button light up and be able to see your results in a modal window.

From here, you can configure the mappings of your Sitecore fields to your Yext entities. That is out of the scope of this post, but Yext’s documentation will help you there. One suggestion I will make is to map the Yext entity’s ID to the page’s Sitecore GUID, defined as id in our crawler query response.

Once it’s all set, save your connector, then you can run your crawler from the connector’s page by clicking “Run Connector”. If you’ve set everything up correctly, you should see your Sitecore content flowing into your Yext tenant.

What’s new in Sitecore 9.1

Sitecore MouseSitecore 9.1 has just hit, and with it comes a lot of exciting new features. You’ll probably be hearing and reading a lot about the Big Things they’re announcing with this release, such as the general availability of Sitecore Javascript Services (JSS), automated personalization with Cortex, Sitecore’s acquisition of digital asset manager StyleLabs, and their partnership with Salesforce.

However, there are some great quality of life enhancements coming with this release as well, which may be of particular interest to developers. Here’s a few that were highlighted.

Performance

Anyone who’s worked with Sitecore for a while, especially as a developer, has noticed how long it takes to start up the application. This can be a huge drag on productivity when you have to wait and wait for application pool recycles, especially if you’re in a rapid development cycle. You lose momentum, you lose focus, and it’s just annoying. The team at Sitecore has heard these complaints and made some serious strides on this in 9.1.

Sitecore showed some benchmarks and 9.1 is boasting a startup-time that’s cut in half. That’s time from a cold start of a CM instance to loading the Launchpad. Not bad! They’ve also cut the number of .dlls the /bin folder in half, increased the load time of the Content Editor by a factor of 6, and shaved some load time of the Experience Editor as well.

3rd Party Integrations

Sitecore has historically lagged behind in updating their integrations with supporting software. This was highlighted last year with the exposure of a security flaw in their Telerik version. In 9.1, we’ll see support for the latest versions of Sitecore’ supporting software, including Telerik, Newtonsoft Json.net, Solr, and of course .NET Core.

Horizon

The current Sitecore back-end has been essentially the same for many years, some CSS updates notwithstanding, and it’s lagging behind the competition. If you were at Symposium last year, it was mentioned during the closing keynote that Sitecore is working on an overhaul of their UI and authoring experience. This year they’ve announced the early-access availability of Horizon.

So, what is Horizon? Right now we’re not entirely sure. It’s meant to address the concerns of customers with the current Experience Editor. We know it’s an overhaul of the Experience Editor at least, but will it exist next to it, replace it outright, or complement it?

Sitecore is releasing an early access version of Horizon later this month and we’ll know a lot more. They want feedback, so as a developer you should download Horizon when it’s available, beat on it, and let them know what you think!

Native Indexing of Binary Content

Another small but welcome enhancement is the ability for the Content Search crawler to index PDF and MS Word files, out of the box. This was possible before with the installation of 3rd party tools, but Sitecore has heard their users and is wisely including this as a core feature.

That’s all for now. When Sitecore 9.1 hits, make sure to crack it open and put some of these changes through their paces. I certainly will be!

Jabberwocky Updated for Sitecore 9

a jabberwockyVelir’s Jabberwocky framework has been updated for Sitecore 9.0, initial release. This update doesn’t add any new features beyond support for Sitecore 9.

For now, the package is marked prerelease, due in-part to the dependency on Glass.Mapper, which is still in prerelease for Sitecore 9 support.  We’ll be assessing the framework during our upcoming Sitecore 9 upgrades and projects, and we will correct any uncaught issues with the framework. A final release will be available in the coming months.

As always, your feedback is welcomed!

Benchmarking Sitecore Publishing

Publishing has been a sore spot lately for some of our clients due to the high amount of content they have in their Sitecore environment. When you start to get into hundreds of thousands of pieces of content, a full site publish is prohibitive. Any time a change is made that requires a large publish your deployment window goes from an hour to potentially an all-day affair. If a user accidentally starts a large publish, subsequent content publishes will get queued and backed up until that large publish completes, or until someone logs into the server and restarts the application.

Still waiting

There are options available to speed up the publishing process. Starting in Sitecore 7.2, parallel publishing was introduced, along with some experimental optimization settings. In Sitecore 8.2, we have a new option, the Sitecore Publishing Service.

What benefits can we see from these options?  I decided to do some tests of large content publishes using these techniques. Each publishing option has its own caveats of course, but this post is concerning itself mainly with the publishing performance of each of the available options.

Skip to the results!

Methodology

I wanted to run these tests in as pure an environment as possible. I set up 3 Sitecore 8.2 environments using Sitecore Instance Manager on my local machine. Using the FillDB tool, I generated 100,000 content items nested in a folder under the site root. Each of these items is of the Sample Item template that ships with a clean Sitecore installation. Full Publish on the entire site was used in each example. Each time the content was being published for the first time.

For benchmarking purposes, my local machine has the following specs,

  • Intel  i7, 8 Core, 2.3 GHz CPU
  • 16 GB RAM
  • Seagate SSHD (not an SSD, but it claims to perform like an SSD!)
  • Windows 7 x64, SP1
  • SQL Server Express 2015
  • .NET 4.6 and .NET Core installed

Default Publishing

The first test was doing a full site publish after generating 100,000 content items using the out-of-the-box publishing configuration. This is probably how most of Sitecore sites are configured unless you took steps to optimize the publishing processes. The results are, as expected, not great.

21620 12:19:30 INFO  Job started: Publish
21620 13:51:18 INFO  Job ended: Publish (units processed: 106669)

That’s over 90 minutes to publish these items, and the content items themselves only had 2 fields with any data.

Parallel Publishing

Next I tested parallel publishing, introduced in Sitecore 7.2. To use this, you need to enable Sitecore.Publishing.Parallel.config. Since I have an 8 core CPU, I set the Publishing.MaxDegreeOfParallelism setting to 8.

There is also Sitecore.Publishing.Optimizations.config, which contains, as the name implies, some optimization settings for publishing. The file comments state that the settings are experimental, and that you should evaluate them before using them in production. For purposes of this test, I ignored this file.

With parallel publishing enabled we see a much shorter publish time of around 25 minutes.

12164 14:27:10 INFO  Job started: Publish to 'web'
12164 14:52:58 INFO  Job ended: Publish to 'web' (units processed: 106669)

Publishing Optimizations

I reran the previous test with the Sitecore.Publishing.Optimizations.config enabled, along with the parallel publishing. This shortened the publish to around 15 minutes.

9836 15:52:34 INFO  Job started: Publish to 'web'
9836 16:07:20 INFO  Job ended: Publish to 'web' (units processed: 106669)

Sitecore Publishing Service

New in Sitecore 8.2 is the Publishing Service, which is a separate web application written in .NET Core that replaces the existing publishing mechanism in your Sitecore site. The documentation on setting up this service is thorough, so kudos to Sitecore for that, however it can be a bit dense. I found this blog post quite helpful in clearing up my confusion. Using it in conjunction with the official documentation, I was able to set up this service in less than an hour.

I ran into a problem using this method, however. The Publishing Service uses some new logic to gather the items it needs to publish, and one of the things it keys off of is the Revision field. Using the FillDb tool doesn’t explicitly write to the Revision field, therefore the service didn’t publish any of my generated items. I wound up running a script with Sitecore Powershell to make a simple edit to these items forcing the Revision field to be written. After that, my items published as expected.

The results were amazing. The new Publish Service was able to publish the entire site, over 100,000 items, in just over 4 minutes. That’s over 20x faster than the default publish settings.

2016-10-19 16:34:17.027 -04:00 [Information] New Job queued : 980bee8e-a132-4041-82d8-155b8496b19f - Targets: "Internet"
2016-10-19 16:39:07.304 -04:00 [Information] Job Result: 95b88a85-64f4-465e-b33d-a7a901331488 - "Complete" - "OK". Duration: 00:04:05.2786436

Summary

Each of these optimizations come with caveats. Parallel Publishing can introduce concurrency issues if you’re firing events during publish. The optimization config settings need to be vetted before rolling out, as it disables or alters many features you may be using, even if you don’t realize you’re using them.

If you’re on Sitecore 8.2 I strongly recommend giving the Publishing Service a look. Like any change to your system, you’ll want to test the effects it has on your publishing events and other hooks before rolling it out.

Sitecore on Solr Cloud: Part 4 – Tuning Solr for Production

This post is part of a series of posts on setting up your Sitecore application to run with Solr Cloud. We’ll be covering the procedure for setting up a Sitecore environment using the Solr search provider, and the creation of a 3-node Solr cloud cluster. This series is broken into four parts.

For the fourth part of this series, we will discuss updating your Solr settings, using Zookeeper to push those changes to the nodes, and some tuning and optimizations we can make to Solr and Sitecore for production.  Continue reading Sitecore on Solr Cloud: Part 4 – Tuning Solr for Production

Sitecore Solr Support for Chinese Language

If you’re running Sitecore with Solr, you may have noticed crawling errors when you add versions in certain languages. A common requirement for multilingual sites is support for Chinese, which the generated Solr schema Sitecore provides does not support by default.  Fortunately, it’s relatively simple to correct this and add support for Chinese, as well as other languages that aren’t available in the default schema. Continue reading Sitecore Solr Support for Chinese Language

Sitecore on Solr Cloud: Part 3 – Creating Your Sitecore Collection

This post is part of a series of posts on setting up your Sitecore application to run with Solr Cloud. We’ll be covering the procedure for setting up a Sitecore environment using the Solr search provider, and the creation of a 3-node Solr cloud cluster. This series is broken into four parts.

For the third part of this series, we will create our Sitecore collection, add replicas, and connect Sitecore to the collection. We’ll also go over load balancing the requests to distribute them among the Solr cloud nodes.

Continue reading Sitecore on Solr Cloud: Part 3 – Creating Your Sitecore Collection

Sitecore on Solr Cloud: Part 2 – Setting up Zookeeper and Solr

This post is part of a series of posts on setting up your Sitecore application to run with Solr Cloud. We’ll be covering the procedure for setting up a Sitecore environment using the Solr search provider, and the creation of a 3-node Solr cloud cluster. This series is broken into four parts.

For the second part of this series, we will go through the steps to set up a Zookeeper Ensemble, individual Solr nodes, and linking them together in a Solr Cloud configuration. We’ll then create Windows services to start Zookeeper and Solr automatically on each server.

Continue reading Sitecore on Solr Cloud: Part 2 – Setting up Zookeeper and Solr

Sitecore on Solr Cloud: Part 1 – Architecture

This post is part of a series of posts on setting up your Sitecore application to run with Solr Cloud. We’ll be covering the procedure for setting up a Sitecore environment using the Solr search provider, and the creation of a 3-node Solr cloud cluster. This series is broken into four parts.

For the first part of this series, we’ll go over some assumptions, prerequisites, definitions, as well as the architecture of Solr Cloud and how it interacts with a Sitecore application.

Continue reading Sitecore on Solr Cloud: Part 1 – Architecture