File search
Who is this article for?
Instructor, no.
Incydr Professional, Enterprise, Gov F2, and Horizon, no.
Incydr Basic, Advanced, and Gov F1, no.
CrashPlan Cloud, no.
Retired product plans, yes.
CrashPlan for Small Business, no.
Overview
Use file search to search a user's backed up files based on file name, file content, and file metadata. File search allows authorized security and legal personnel to determine if an employee:
- Had access to files containing sensitive information
- Obtained unauthorized access to confidential information
- Has data that should be subject to legal hold
How file search works
File search relies on indexing users' files. When indexing is enabled, each storage server processes the archives in its store points to generate searchable indexes. These indexes are stored inside each archive.
Types of indexing
The Code42 platform supports two types of indexing:
- Metadata indexing: Information about the file is indexed, such as filename, created date, and modified date. File contents are not indexed.
- Content indexing: File contents are indexed in addition to file metadata.
Your Code42 servers automatically perform full content indexing for supported file types.
File types supported for content indexing
The Code42 platform can perform content indexing for the following file types:
- TXT
- HTML
- XML
- Microsoft Word (DOC, DOCX)
- Microsoft Excel (XLS, XLSX)
- OpenDocument (ODT)
- RTF
- EPUB
- iWork
- Most plain text files, such as source code
Requirements
To use file search, your Code42 environment must meet the following requirements:
Component Or Configuration | Requirements |
---|---|
Authority server |
|
Storage server |
|
Backup encryption key policy | Users' archives must use the Standard archive encryption key policy. Archive key password and Custom key are not supported. |
Performance optimization recommendations
Indexing consumes Code42 server system resources. Do not enable indexing if your Code42 servers have average load that is 50% or greater.
We recommend the following storage server configuration to optimize indexing performance:
- At least one CPU core per store point
- As much RAM as possible
Ideally, your Code42 servers should be able to index inbound files in real time or catch up over a 24-hour period by indexing during off-peak hours. See our performance testing results for information about expected performance.
For storage servers that are disk bound (have underutilized CPU cores), add more store points to increase indexing performance.
Enable and use file search
Setting up indexing and file search involves:
- Configuring a destination to allow indexing.
- Enabling indexing for an organization or for specific users.
- Granting authorized users access to search users' files.
After indexing is configured, an administrator or security professional uses the Code42 File Search web app to search users' backed up files.
Best practices to enable indexing
To avoid overloading your Code42 servers, we recommend taking a phased approach to enabling indexing in your Code42 environment:
- Review our performance testing results to understand how Code42 server configuration and file types impact indexing performance.
- Enable indexing for one organization.
- Use the Code42 console to monitor indexing performance for one week.
- If the organization is fully indexed and indexing keeps up with inbound backup data, enable indexing for an additional organization.
Detailed configuration and usage instructions
The following articles describe how to configure indexing and use the File Search web app in detail:
How Code42 server configuration and file types impact performance
Code42 tested specific file types with baseline hardware to determine expectations for indexing performance. Use this data to understand the variables that impact indexing performance.
Test configuration
This section summarizes the backup rate, storage server hardware, and file types that Code42 used for performance testing.
Backup rate
For testing, 2,000 CrashPlan devices backed up 50 files every 15 minutes. As a result, approximately 6,500 files backed up per minute. Code42 based this configuration on the typical inbound backup load for a Code42 server in the Code42 cloud.
Storage server server hardware
Code42 conducted indexing performance baseline testing with the following storage server hardware:
Component | Configuration |
---|---|
CPU | AMD Opteron 6212 (8 cores) |
RAM | 32 GB (8 GB allocated to the Code42 server software) |
Database | Hosted on a dedicated volume |
Archive storage | 5.5 TB volume Code42 performed testing with a single store point and with four store points on a single volume. |
Test file types
Code42 simulated three types of files to test indexing performance:
File Type | Description | Performance Impact | Examples |
---|---|---|---|
Metadata only | Binary files that have contents that cannot be examined for keywords. | Low |
|
Plain text |
ASCII files that do not need to be parsed to extract key words from their contents. |
Medium | TXT |
Content indexable |
Files that must be parsed to extract key words. |
High |
|
File sets used for tests
Code42 tested indexing performance using three specific sets of test files:
File Set | File Types |
---|---|
Mostly metadata only files |
|
Even mix of files |
|
Mostly content indexable files |
|
Observed indexing performance
The following table summarized the observed indexing performance for each file set:
File Set | Single store point | Four store points | ||
---|---|---|---|---|
Index Rate During Backup Activity1 | Index Rate Without Backup Activity | Index Rate During Backup Activity1 | Index Rate Without Backup Activity | |
Mostly metadata only files | ~2,000–3,000 files per minute | ~3,500 files per minute | ~6,500 files per minute (Keeping up with the backup rate) |
~6,500 files per minute |
Even mix of files |
~2,000 files per minute | ~2,000 files per minute | ~4,000–5,000 files per minute | ~4,000–5,000 files per minute |
Mostly content indexable files |
~1,000 files per minute | ~1,500 files per minute | ~3,500–4,000 files per minute | ~4,000 files per minute |
1 The backup rate for these tests was approximately 6,500 files per minute.
Single store point test analysis
The single store point configuration offers lower indexing performance because all archives are assigned to a single store point and one CPU core.
- This configuration cannot index backed up files in real time for any tested file set.
- Assuming each 24-hour day has 12 hours of backup activity (8-hour work day across 4 time zones) and 12 hours without backup activity, this configuration cannot index backed-up files within the same day. It is unlikely that this configuration will ever finish indexing all backed-up files during off-peak hours.
- Based on observed performance, the single store point configuration is not appropriate for this scenario.
File Set | Files Backed Up In 12 Hours | Files Indexed Over 24 Hours | New Files Not Indexed Each Day |
---|---|---|---|
Mostly metadata only files | 4,680,000 |
4,320,000 |
360,000 |
Even mix of files | 4,680,000 |
2,880,000 |
1,800,000 |
Mostly content indexable files | 4,680,000 |
1,800,000 |
2,880,000 |
Four store point test analysis
The four store point configuration performs better because the archives are spread across multiple store points and CPU cores.
- This configuration can index backed up files in real time for the mostly metadata only files file set.
- Assuming each 24-hour day has 12 hours of backup activity (8-hour work day across 4 time zones) and 12 hours without backup activity, this configuration can index backed-up files within the same day. Indexing during off-peak hours makes this possible for the file sets that cannot be indexed in real time.
- Based observed performance, the four store point configuration is appropriate for this scenario.
File Set | Files Backed Up In 12 Hours | Files Indexed Over 24 Hours | New Files Not Indexed Each Day |
---|---|---|---|
Mostly metadata only files | 4,680,000 |
9,360,000 |
N/A |
Even mix of files | 4,680,000 |
6,480,000 |
N/A |
Mostly content indexable files | 4,680,000 |
5,040,000 |
N/A |