TrueNAS: Difference between revisions

From Bitpost wiki
No edit summary
 
(42 intermediate revisions by the same user not shown)
Line 1: Line 1:
=== Installed to HIVE VM on melange proxmox ===
== Overview ==


==== Set up pools ====
=== Pools ===
I didn't want to lose storage on the shitty pile of drives I started with, so I set up three drives as standalone striped pools, and 2 sets of pairs as striped pools.  That means no loss of storage and no safety AT ALL except that striping will tell me AFTER they've started going bad, bahahaha.  I need to spend $3k on 10 ssd NAS drives to get to a better place.  Not now.


==== Set up datastores ====
TrueNAS provides storage via Pools.  A pool is a bunch of raw drives gathered and managed as a set.  My pools are one of these:
It seems I had to set up datastores under my pool root datastores, just so I could add ACL settings to give complete control to the m Samba user.  I used the root name with a -ds suffix.  I set sharing to SMB.  I set the ACL to @owner m:m.  Make sure to keep the ACL set to Basic, it's easy to screw up and once you do, files on the system will have screwed up permissions and you might have to fix and apply permissions recursively.
{| class="wikitable"
! Pool type
! Description
|-
| style="color:red" |single drive
|no TrueNAS advantage other than health checks
|-
| style="color:red" |raid1 pair
|mirrored drives give normal write speeds, fast reads, single-fail redundancy, costs half of storage potential
|-
| style="color:red" |raid0 pair
|striped drives gives fast writes, normal reads, no redundancy, no storage cost
|-
| style="color:green" |raid of multiple drives
|'''raidz''': optimization of read/write speed, redundancy, storage potential
|}


==== Set up user ====
The three levels of raidz are:
I set up m user (1000) and m group (1000)
* raidz: one drive is consumed just for parity (no data storage, ie you only get (n-1) storage total), and one drive can be lost without losing any data; fastest; very dangerous to recover from lost drive ("resilver" process is brutal on remaining drives - don't wait)
* raidz2: two drives for parity, two can be lost
* raidz3: three drives for parity, three can be lost; slowest
 
=== Datasets ===
 
Every pool should have one child dataset.  This is where we set the permissions, important for SAMBA access.  We could have more than one child dataset, but I haven't had the need.
 
==== Adding ====
hive > Storage > Pools > mine (or any newly created pool) > Add Dataset
 
Dataset settings:
name #pool#-ds
share type SMB
 
Save, then continue...
hive > Storage > Pools > mine (or any newly created pool) > mine-ds > Edit ACL
user m
group m
ACL
  who everyone@
  type Allow
  Perm type Basic  (NOTE: "Perm type Basic" is important!)
  Perm Full control (NOTE: this is not the default, you will need to change it)
  Flags type Basic
  Flags Inherit    (NOTE: this is not the default, you will need to change it)
(REMOVE on all other blocks)
SAVE
 
=== Windows SMB Shares ===
 
Share each dataset as a Samba share under:
Sharing > Windows Shares (SMB)
 
* Use the pool name for the share name.
* Use the same ACL as for the dataset.
* Purpose: No presets
 
WARNING I had to set these  Auxiliary parameters in the SMB config so that symlinks would be followed.
 
* Services > SMB > Actions > configuration > Auxiliary Parameters:
unix extensions = no
follow symlinks = yes
wide links = yes
* Stop and restart SMB service
 
== Maintenance ==
 
=== Burn in a new drive ===
[https://www.truenas.com/community/resources/hard-drive-burn-in-testing.92/ ALWAYS do this] even tho it's a PITA.  Less pain than not doing it.
 
I didn't do it for my 7-8TB-drive zraid.  Murphy said FUCK YOU and one of the eight went bad.  So... do the test, dumbass.
 
But of course I found a way to stay lazy... TrueNAS has the ability to run SMART tests directly on a drive, so do it there.  Or just wait for SMART failures to show up.  God damn, laziness rules.  Maybe.  Fool.
 
=== Regularly do SMART, scrub, resilver ===
 
YOU MUST DO THIS REGULARLY!
 
From [https://www.truenas.com/community/threads/self-healed-hard-drive.81138/ here]:
A drive, vdev or pool is declared degraded if ZFS detects problems with the data. If you reboot the error count is reset. A resilver will heal the data errors if there is sufficient redundancy. ZFS will only spot the data issues on read, that’s why we have scrubs, a forced read of all the data to try and determine if there are any errors. So schedule regular scrubs are important. This will not tell you why the data is corrupted, for this you have S.M.A.R.T tests, you need to schedule those as well, both long and short.
to get a handle on the situation as is, you need to trigger a scrub and long smart tests.
 
Never do more than one of these at a time, and never do any of them during heavy disk usage (backups, eg).
 
SMART can be done weekly (not too often or it will contribute to early wear-out of SSDs).
 
Same for scrub.
 
Resilver happens when a drive issue requires the data to be rebalanced or redistributed.  Buckle up for this one!
 
=== Pool speed check ===
 
CAST to SAFE: ~114MB/s write (compressed) on 60MB/s network
 
Do this to test raw write speed from anywhere on the LAN to the [safe] pool:
dd if=/dev/zero of=/mnt/safe/safe-dd/speedtest.data bs=4M count=10000
# on hive: 4GB transferred in ~15sec at ~2.9GB/sec, WOW
# on cast: 42GB copied in 371sec at 114MB/s - that seems in line with my network speed (see below)
 
To test the network bandwidth limit:
# on hive
iperf -s -w 2m # to run in server mode, looking for 2MB transfers
# on another LAN machine
iperf -c hive -w 2m -t 30s -i 1s
# on cast: 1.51 GB at 477Mbits/sec aka 60MB/sec
# I have a 1Gb switch, i guess that's all we get out of it?
 
=== Replace a bad disk in a raidz pool ===
My 7-drive raidz arrays can only lose ONE drive before they go boom, so you MUST replace bad disks immediately.  raidz2 uses 2 drives, raidz3 uses three, but SSD raidz you-can-lose-one-drive is, to me, a sweet spot.
 
* Watch TrueNAS for '''CRITICAL''' alerts that indicate a drive is failing its self-tests.
* Make note of its serial number.
* Find the drive in the pool, make note of its drive id (not needed but no harm).
* Change the pool drive status from FAULTED to OFFLINE
Storage > Pools > badpool > triple-dot Status > baddrive > triple-dot-status > FAULTED to OFFLINE
* Power down the whole fucking PROXMOX machine
* Pull it, and swap out bad drive for good
* Replace it
Storage > Pools > badpool > baddrive > triple-dot-status > REPLACE
 
=== Remove a bad pool ===
 
* Make note of which drives use the pool; likely some are bad and some are good and perhaps worth reusing elsewhere.
* Disconnect SMB connections to the pool
** Update valid shares in mh-setup-samba-shares
** Rerun mh-setup-samba-shares everywhere (eventually anyway)
** One possible easier way to get SMB disconnected from the pool is to stop SMB service in TrueNAS
** Sadly, to get through this for my splat pool, I had to remove pool, fail, restart hive, remove pool.
* Pool > (gear) > Export/disconnect
** [x] Delete configuration of shares that use this pool (to remove the associated SMB share)
** [x] Destroy data on this pool (you MUST select this or the silly thing will attempt to export the data)
 
=== Update TrueNAS ===
Updating is baked into the UI, nice!  And I have auto-updates enabled.  So nice.
 
These guys work hard on this, to make sure releases are well tested.  Watch for alerts about newly available updates.  Do not update past the current release!
 
System > Update > [Train] (ensure you have a good one selected; on occasion, you'll want to CHANGE it to select a newer stable release!)
Give the system a minute to load available updates...
Press Download available updates > The modal will ask if you want to apply and restart > Say yes
 
That's about it!


==== Set up shares ====
== Configuration ==
I set up shares for each of the ACL datastores.


Seems good for now!  I can access them from elsewhere with this /etc/fstab goo:
=== Set up user ===
# MDM FreeNAS has arrived
I set up m user (1000) and m group (1000)
//hive/sassy /spiceflow/sassy cifs credentials=/root/samba_credentials,uid=1000,gid=1000,file_mode=0774,dir_mode=0775,auto  0 0
//hive/mack /spiceflow/mack cifs credentials=/root/samba_credentials,uid=1000,gid=1000,file_mode=0774,dir_mode=0775,auto  0 0
//hive/sans /spiceflow/sans cifs credentials=/root/samba_credentials,uid=1000,gid=1000,file_mode=0774,dir_mode=0775,auto  0 0
//hive/splat /spiceflow/splat cifs credentials=/root/samba_credentials,uid=1000,gid=1000,file_mode=0774,dir_mode=0775,auto  0 0
//hive/reservoir /spiceflow/reservoir cifs credentials=/root/samba_credentials,uid=1000,gid=1000,file_mode=0774,dir_mode=0775,auto  0 0


==== Set up alert emails ====
=== Set up alert emails ===
Go to one of your google accounts to get an App password.  It has to be an account that has 2fa turned on, bleh, so don't use moodboom@gmail.com.  I went with abettersoftwaretrader@gmail.com.
Go to one of your google accounts to get an App password.  It has to be an account that has 2fa turned on, bleh, so don't use moodboom@gmail.com.  I went with abettersoftwaretrader@gmail.com.


Line 27: Line 158:
  System > Email > from email > abettersoftwaretrader@gmail.com, smtp.gmail.com 465 Implicit SSL, SMTP auth: (email/API password)
  System > Email > from email > abettersoftwaretrader@gmail.com, smtp.gmail.com 465 Implicit SSL, SMTP auth: (email/API password)


=== Troubleshooting ===
Then you can test it here:
System > Email > (at bottom, next to Save...) Send Test Email
 
=== Set up user ssh ===
This was not fun.
* Set up user
* You have to set password ON and make sure to check [x] Allow sudo
* Make sure to allow Samba Authentication for m user that is used for samba
* Add public key to user
* Create a valid folder on the /mnt NAS shares for the user's home; you can mkdir using samba; I created:
/mnt/safe/safe-ds/software/apps/hive-home
* set the user's home to that ^; turn off password auth
* Turn on SSH service
* System > SSH Keypairs > Add SSH keypair for main user m
* System > SSH Connections > Add, use localhost, keypair from prev step
It should work but it does not!
* Open a TrueNas prompt via proxmox console
* Go to the home dir, there should be an .ssh there now
* Reduce permissions on both HOME DIR (700) and .ssh/KEY (400)
* Get a shell and run `sudo visudo` and add this line:
m ALL=(ALL) NOPASSWD: ALL
 
Finally!  It works!
 
== Troubleshooting ==


SOME of my shares were throwing '''Permission Denied''' errors on mv.  Solutions:
SOME of my shares were throwing '''Permission Denied''' errors on mv.  Solutions:
* I applied permissions again, recursively, then restarted the SMB service on hive and the problem went away.
* I applied permissions again, recursively, then restarted the SMB service on hive and the problem went away.
* You can also always go to the melange hive console, request a shell, and things always seem to work from there (but you're in FreeBSD world and don't have any beauty scripts like mh-move-torrent!)
* You can also always go to the melange hive console, request a shell, and things always seem to work from there (but you're in FreeBSD world and don't have any beauty scripts like mh-move-torrent!)
=== Later plans ===
Get a bunch (7 or 10?)_of similar ssds and set up a robust raid2z NFS pool.  Gonna cost a couple thousand, look for a NAS drive deal. 
Right now, the price is high, everything good is at least $100/1TB, and I need a good 40TB to get 30TB of robust storage.

Latest revision as of 18:00, 16 January 2024

Overview

Pools

TrueNAS provides storage via Pools. A pool is a bunch of raw drives gathered and managed as a set. My pools are one of these:

Pool type Description
single drive no TrueNAS advantage other than health checks
raid1 pair mirrored drives give normal write speeds, fast reads, single-fail redundancy, costs half of storage potential
raid0 pair striped drives gives fast writes, normal reads, no redundancy, no storage cost
raid of multiple drives raidz: optimization of read/write speed, redundancy, storage potential

The three levels of raidz are:

  • raidz: one drive is consumed just for parity (no data storage, ie you only get (n-1) storage total), and one drive can be lost without losing any data; fastest; very dangerous to recover from lost drive ("resilver" process is brutal on remaining drives - don't wait)
  • raidz2: two drives for parity, two can be lost
  • raidz3: three drives for parity, three can be lost; slowest

Datasets

Every pool should have one child dataset. This is where we set the permissions, important for SAMBA access. We could have more than one child dataset, but I haven't had the need.

Adding

hive > Storage > Pools > mine (or any newly created pool) > Add Dataset

Dataset settings:

name #pool#-ds
share type SMB

Save, then continue...

hive > Storage > Pools > mine (or any newly created pool) > mine-ds > Edit ACL
user m
group m
ACL
 who everyone@
 type Allow
 Perm type Basic   (NOTE: "Perm type Basic" is important!)
 Perm Full control (NOTE: this is not the default, you will need to change it)
 Flags type Basic
 Flags Inherit     (NOTE: this is not the default, you will need to change it)
(REMOVE on all other blocks)
SAVE

Windows SMB Shares

Share each dataset as a Samba share under:

Sharing > Windows Shares (SMB)
  • Use the pool name for the share name.
  • Use the same ACL as for the dataset.
  • Purpose: No presets

WARNING I had to set these Auxiliary parameters in the SMB config so that symlinks would be followed.

  • Services > SMB > Actions > configuration > Auxiliary Parameters:
unix extensions = no
follow symlinks = yes
wide links = yes
  • Stop and restart SMB service

Maintenance

Burn in a new drive

ALWAYS do this even tho it's a PITA. Less pain than not doing it.

I didn't do it for my 7-8TB-drive zraid. Murphy said FUCK YOU and one of the eight went bad. So... do the test, dumbass.

But of course I found a way to stay lazy... TrueNAS has the ability to run SMART tests directly on a drive, so do it there. Or just wait for SMART failures to show up. God damn, laziness rules. Maybe. Fool.

Regularly do SMART, scrub, resilver

YOU MUST DO THIS REGULARLY!

From here:

A drive, vdev or pool is declared degraded if ZFS detects problems with the data. If you reboot the error count is reset. A resilver will heal the data errors if there is sufficient redundancy. ZFS will only spot the data issues on read, that’s why we have scrubs, a forced read of all the data to try and determine if there are any errors. So schedule regular scrubs are important. This will not tell you why the data is corrupted, for this you have S.M.A.R.T tests, you need to schedule those as well, both long and short.

to get a handle on the situation as is, you need to trigger a scrub and long smart tests.

Never do more than one of these at a time, and never do any of them during heavy disk usage (backups, eg).

SMART can be done weekly (not too often or it will contribute to early wear-out of SSDs).

Same for scrub.

Resilver happens when a drive issue requires the data to be rebalanced or redistributed. Buckle up for this one!

Pool speed check

CAST to SAFE: ~114MB/s write (compressed) on 60MB/s network

Do this to test raw write speed from anywhere on the LAN to the [safe] pool:

dd if=/dev/zero of=/mnt/safe/safe-dd/speedtest.data bs=4M count=10000
# on hive: 4GB transferred in ~15sec at ~2.9GB/sec, WOW
# on cast: 42GB copied in 371sec at 114MB/s - that seems in line with my network speed (see below)

To test the network bandwidth limit:

# on hive
iperf -s -w 2m # to run in server mode, looking for 2MB transfers
# on another LAN machine
iperf -c hive -w 2m -t 30s -i 1s
# on cast: 1.51 GB at 477Mbits/sec aka 60MB/sec
# I have a 1Gb switch, i guess that's all we get out of it?

Replace a bad disk in a raidz pool

My 7-drive raidz arrays can only lose ONE drive before they go boom, so you MUST replace bad disks immediately. raidz2 uses 2 drives, raidz3 uses three, but SSD raidz you-can-lose-one-drive is, to me, a sweet spot.

  • Watch TrueNAS for CRITICAL alerts that indicate a drive is failing its self-tests.
  • Make note of its serial number.
  • Find the drive in the pool, make note of its drive id (not needed but no harm).
  • Change the pool drive status from FAULTED to OFFLINE

Storage > Pools > badpool > triple-dot Status > baddrive > triple-dot-status > FAULTED to OFFLINE

  • Power down the whole fucking PROXMOX machine
  • Pull it, and swap out bad drive for good
  • Replace it

Storage > Pools > badpool > baddrive > triple-dot-status > REPLACE

Remove a bad pool

  • Make note of which drives use the pool; likely some are bad and some are good and perhaps worth reusing elsewhere.
  • Disconnect SMB connections to the pool
    • Update valid shares in mh-setup-samba-shares
    • Rerun mh-setup-samba-shares everywhere (eventually anyway)
    • One possible easier way to get SMB disconnected from the pool is to stop SMB service in TrueNAS
    • Sadly, to get through this for my splat pool, I had to remove pool, fail, restart hive, remove pool.
  • Pool > (gear) > Export/disconnect
    • [x] Delete configuration of shares that use this pool (to remove the associated SMB share)
    • [x] Destroy data on this pool (you MUST select this or the silly thing will attempt to export the data)

Update TrueNAS

Updating is baked into the UI, nice! And I have auto-updates enabled. So nice.

These guys work hard on this, to make sure releases are well tested. Watch for alerts about newly available updates. Do not update past the current release!

System > Update > [Train] (ensure you have a good one selected; on occasion, you'll want to CHANGE it to select a newer stable release!)
Give the system a minute to load available updates...
Press Download available updates > The modal will ask if you want to apply and restart > Say yes

That's about it!

Configuration

Set up user

I set up m user (1000) and m group (1000)

Set up alert emails

Go to one of your google accounts to get an App password. It has to be an account that has 2fa turned on, bleh, so don't use moodboom@gmail.com. I went with abettersoftwaretrader@gmail.com.

Accounts > Users > root > edit password > abettersoftwaretrader@gmail.com
System > Email > from email > abettersoftwaretrader@gmail.com, smtp.gmail.com 465 Implicit SSL, SMTP auth: (email/API password)

Then you can test it here:

System > Email > (at bottom, next to Save...) Send Test Email

Set up user ssh

This was not fun.

  • Set up user
  • You have to set password ON and make sure to check [x] Allow sudo
  • Make sure to allow Samba Authentication for m user that is used for samba
  • Add public key to user
  • Create a valid folder on the /mnt NAS shares for the user's home; you can mkdir using samba; I created:
/mnt/safe/safe-ds/software/apps/hive-home
  • set the user's home to that ^; turn off password auth
  • Turn on SSH service
  • System > SSH Keypairs > Add SSH keypair for main user m
  • System > SSH Connections > Add, use localhost, keypair from prev step

It should work but it does not!

  • Open a TrueNas prompt via proxmox console
  • Go to the home dir, there should be an .ssh there now
  • Reduce permissions on both HOME DIR (700) and .ssh/KEY (400)
  • Get a shell and run `sudo visudo` and add this line:
m ALL=(ALL) NOPASSWD: ALL

Finally! It works!

Troubleshooting

SOME of my shares were throwing Permission Denied errors on mv. Solutions:

  • I applied permissions again, recursively, then restarted the SMB service on hive and the problem went away.
  • You can also always go to the melange hive console, request a shell, and things always seem to work from there (but you're in FreeBSD world and don't have any beauty scripts like mh-move-torrent!)