Object Storage Best Practices¶
Cyderes has some recommended best practices to ensure seamless ingestion of your data.
- Ensure objects are decorated with the correct
ContentEncoding
andContentType
metadata. Cyderes leverages this metadata to determine how to decode data into individual logs, without these in place Cyderes has to be notified what to expect to ensure logs are delivered and not garbled/indecipherable data.- Currently, Cyderes supports
application/gzip
for encoding andplain/text
,application/json
,application/ndjson
,application/jsonl
,text/plain
- Binary types like Protocol Buffers should be avoided as this is often serialized such that it requires custom code to handle each new schema compared to text based schemas which are often more generalized.
- Currently, Cyderes supports
- Try to ensure the uncompressed size of objects is roughly around 1GB or smaller. Larger objects take more time to process and increase the risk of a transient failure, requiring reprocessing. It can also increase the latency of seeing logs in downstream SIEMs.
- Compress your objects, this saves cost all around because it reduces network impact for uploading and retrieving objects while also reducing cost for how much data is being stored.
- Objects should not be appended to, without operationally expensive methods to track what objects Cyderes has and has not seen, it's difficult to identify potential duplicate logs. Objects should be treated as immutable, once uploaded they are no longer modified.
- Cyderes leverages regex patterns to match objects to distinct data types, so ensuring no data types are co-mingled in paths is required.
- Ensure only notifications for objects intended for Cyderes to ingest are sent. If Cyderes is not made aware of the different paths/data types intended for ingestion, these notifications will be dropped. It is best that the regex pattern used for notifications matches regex patterns sent to Cyderes for ingestion.
Notifications Versus Polling¶
A best practice so large it requires its own heading. Cyderes recommends sending notifications to a Cyderes owned notification queue for new objects wherever and whenever possible. Object storage products do not allow wildcards when retrieving files for specific paths. This is likely a limitation of object storage not actually being a hierarchical storage option. This makes picking up new objects in dynamic path structures difficult. A good example is AWS Cloudtrail, which has a default path structure as follows:
AWSLogs/<AccountID>/CloudTrail/<region>/<year>/<month>/<day>/<filename>
Unfortunately Object Storage providers do not allow you to retrieve data for a specific time with wildcards as your first though might be to request something similar to the below:
AWSLogs/*/CloudTrail/*/2000/06/30/
To achieve the above, one would have to either make a very large call under AWSLogs/
and just filter out those not within a path regex pattern similar to the above or build the possible paths by actually trying to build a hierarchy of paths that should be looked under. Both can easily be cost prohibitive depending on the number of objects in the bucket and the poll frequency (poll frequency also impacts the delay between the object being available and the logs being ingested into a downstream SIEM).
If polling is required for example in cases where a vendor owns an object storage bucket, and does not offer a queue; ensure that the paths of objects are as flat as possible while still including dates and/or time in the path. For example if AWS Cloudtrail dropped logs with the following path structure, polling would be more plausible:
AWSLogs/CloudTrail/<year>/<month>/<day>/<filename>
However polling is subject to delays depending on the poll frequency and how many objects are needing to be processed within that frequency. Notifications allows us to auto-scale to meet the demands based on the queue size overall reducing the average latency between an object being available and an object being ingested.